Fault Tolerance Interface
api.c File Reference

API functions for the FTI library. More...

#include "interface.h"
Include dependency graph for api.c:

Functions

int FTI_Init (char *configFile, MPI_Comm globalComm)
 Initializes FTI. More...
 
int FTI_Status ()
 It returns the current status of the recovery flag. More...
 
int FTI_InitType (FTIT_type *type, int size)
 It initializes a data type. More...
 
int FTI_InitComplexType (FTIT_type *newType, FTIT_complexType *typeDefinition, int length, size_t size, char *name, FTIT_H5Group *h5group)
 It initializes a complex data type. More...
 
void FTI_AddSimpleField (FTIT_complexType *typeDefinition, FTIT_type *ftiType, size_t offset, int id, char *name)
 It adds a simple field in complex data type. More...
 
void FTI_AddComplexField (FTIT_complexType *typeDefinition, FTIT_type *ftiType, size_t offset, int rank, int *dimLength, int id, char *name)
 It adds a simple field in complex data type. More...
 
int FTI_GetStageDir (char *stageDir, int maxLen)
 Places the FTI staging directory path into 'stageDir'. More...
 
int FTI_GetStageStatus (int ID)
 Returns status of staging request. More...
 
int FTI_SendFile (char *lpath, char *rpath)
 Copies file asynchronously from 'lpath' to 'rpath'. More...
 
int FTI_InitGroup (FTIT_H5Group *h5group, char *name, FTIT_H5Group *parent)
 It initialize a HDF5 group. More...
 
int FTI_RenameGroup (FTIT_H5Group *h5group, char *name)
 Renames a HDF5 group. More...
 
int FTI_Protect (int id, void *ptr, long count, FTIT_type type)
 It sets/resets the pointer and type to a protected variable. More...
 
int FTI_DefineDataset (int id, int rank, int *dimLength, char *name, FTIT_H5Group *h5group)
 Defines the dataset. More...
 
long FTI_GetStoredSize (int id)
 Returns size saved in metadata of variable. More...
 
void * FTI_Realloc (int id, void *ptr)
 Reallocates dataset to last checkpoint size. More...
 
int FTI_FloatBitFlip (float *target, int bit)
 It corrupts a bit of the given float. More...
 
int FTI_DoubleBitFlip (double *target, int bit)
 It corrupts a bit of the given float. More...
 
int FTI_BitFlip (int datasetID)
 Bit-flip injection following the injection instructions. More...
 
int FTI_Checkpoint (int id, int level)
 It takes the checkpoint and triggers the post-ckpt. work. More...
 
int FTI_Recover ()
 It loads the checkpoint data. More...
 
int FTI_Snapshot ()
 Takes an FTI snapshot or recovers the data if it is a restart. More...
 
int FTI_Finalize ()
 It closes FTI properly on the application processes. More...
 
int FTI_RecoverVar (int id)
 During the restart, recovers the given variable. More...
 
void FTI_Print (char *msg, int priority)
 Prints FTI messages. More...
 

Variables

MPI_Comm FTI_COMM_WORLD
 
FTIT_type FTI_CHAR
 
FTIT_type FTI_SHRT
 
FTIT_type FTI_INTG
 
FTIT_type FTI_LONG
 
FTIT_type FTI_UCHR
 
FTIT_type FTI_USHT
 
FTIT_type FTI_UINT
 
FTIT_type FTI_ULNG
 
FTIT_type FTI_SFLT
 
FTIT_type FTI_DBLE
 
FTIT_type FTI_LDBE
 

Detailed Description

API functions for the FTI library.

Copyright (c) 2017 Leonardo A. Bautista-Gomez All rights reserved

FTI - A multi-level checkpointing library for C/C++/Fortran applications

Revision 1.0 : Fault Tolerance Interface (FTI)

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Date
October, 2017

Function Documentation

void FTI_AddComplexField ( FTIT_complexType typeDefinition,
FTIT_type ftiType,
size_t  offset,
int  rank,
int *  dimLength,
int  id,
char *  name 
)

It adds a simple field in complex data type.

Parameters
typeDefinitionStructure definition of the complex data type.
ftiTypeType of the field
offsetOffset of the field (use offsetof)
rankRank of the array
dimLengthDimention length for each rank
idId of the field (start with 0)
nameName of the field (put NULL if want default)
Returns
integer FTI_SCES if successful.

This function adds a field to the complex datatype. Use offsetof macro to set offset. First ID must be 0, next one must be +1. If name is NULL FTI will set "T${id}" name.

void FTI_AddSimpleField ( FTIT_complexType typeDefinition,
FTIT_type ftiType,
size_t  offset,
int  id,
char *  name 
)

It adds a simple field in complex data type.

Parameters
typeDefinitionStructure definition of the complex data type.
ftiTypeType of the field
offsetOffset of the field (use offsetof)
idId of the field (start with 0)
nameName of the field (put NULL if want default)
Returns
integer FTI_SCES if successful.

This function adds a field to the complex datatype. Use offsetof macro to set offset. First ID must be 0, next one must be +1. If name is NULL FTI will set "T${id}" name. Sets rank and dimLength to 1.

int FTI_BitFlip ( int  datasetID)

Bit-flip injection following the injection instructions.

Parameters
datasetIDID of the dataset where to inject.
Returns
integer FTI_SCES if successful.

This function injects the given number of bit-flips, at the given frequency and in the given location (rank, dataset, bit position).

Here is the call graph for this function:

int FTI_Checkpoint ( int  id,
int  level 
)

It takes the checkpoint and triggers the post-ckpt. work.

Parameters
idCheckpoint ID.
levelCheckpoint level.
Returns
integer FTI_SCES if successful.

This function starts by blocking on a receive if the previous ckpt. was offline. Then, it updates the ckpt. information. It writes down the ckpt. data, creates the metadata and the post-processing work. This function is complementary with the FTI_Listen function in terms of communications.

Here is the call graph for this function:

int FTI_DefineDataset ( int  id,
int  rank,
int *  dimLength,
char *  name,
FTIT_H5Group h5group 
)

Defines the dataset.

Parameters
idID for searches and update.
rankRank of the array
dimLengthDimention length for each rank
nameName of the dataset in HDF5 file.
h5groupGroup of the dataset. If Null then "/"
Returns
integer FTI_SCES if successful.

This function gives FTI all information needed by HDF5 to correctly save the dataset in the checkpoint file.

Here is the call graph for this function:

int FTI_DoubleBitFlip ( double *  target,
int  bit 
)

It corrupts a bit of the given float.

Parameters
targetPointer to the float to corrupt.
bitPosition of the bit to corrupt.
Returns
integer FTI_SCES if successful.

This function filps the bit of the target float.

Here is the call graph for this function:

int FTI_Finalize ( )

It closes FTI properly on the application processes.

Returns
integer FTI_SCES if successful.

This function notifies the FTI processes that the execution is over, frees some data structures and it closes. If this function is not called on the application processes the FTI processes will never finish (deadlock).

Here is the call graph for this function:

int FTI_FloatBitFlip ( float *  target,
int  bit 
)

It corrupts a bit of the given float.

Parameters
targetPointer to the float to corrupt.
bitPosition of the bit to corrupt.
Returns
integer FTI_SCES if successful.

This function filps the bit of the target float.

int FTI_GetStageDir ( char *  stageDir,
int  maxLen 
)

Places the FTI staging directory path into 'stageDir'.

Parameters
stageDirpointer to allocated memory region.
maxLensize of allocated memory region in bytes.
Returns
integer FTI_SCES if successful, FTI_NSCS else.

This function places the FTI staging directory path in 'stageDir'. If allocation size is not sufficiant, no action is perfoprmed and FTI_NSCS is returned.

Here is the call graph for this function:

int FTI_GetStageStatus ( int  ID)

Returns status of staging request.

Parameters
IDID of staging request.
Returns
integer Status of staging request on success, FTI_NSCS else.

This function returns the status of the staging request corresponding to ID. The ID is returned by the function 'FTI_SendFile'. The status may be one of the five possible statuses:

FTI_SI_FAIL - Stage request failed FTI_SI_SCES - Stage request succeed FTI_SI_ACTV - Stage request is currently processed FTI_SI_PEND - Stage request is pending FTI_SI_NINI - There is no stage request with this ID
Note
If the status is FTI_SI_NINI, the ID is either invalid or the request was finished (succeeded or failed). In the latter case, 'FTI_GetStageStatus' returns FTI_SI_FAIL or FTI_SI_SCES and frees the stage request ressources. In the consecutive call it will then return FTI_SI_NINI.

Here is the call graph for this function:

long FTI_GetStoredSize ( int  id)

Returns size saved in metadata of variable.

Parameters
idVariable ID.
Returns
long Returns size of variable or 0 if size not saved.

This function returns size of variable of given ID that is saved in metadata. This may be different from size of variable that is in the program. If this function it's called when recovery it returns size from metadata file, if it's called after checkpoint it returns size saved in temporary metadata. If there is no size saved in metadata it returns 0.

Here is the call graph for this function:

int FTI_Init ( char *  configFile,
MPI_Comm  globalComm 
)

Initializes FTI.

Parameters
configFileFTI configuration file.
globalCommMain MPI communicator of the application.
Returns
integer FTI_SCES if successful.

This function initializes the FTI context and prepares the heads to wait for checkpoints. FTI processes should never get out of this function. In case of a restart, checkpoint files should be recovered and in place at the end of this function.

Here is the call graph for this function:

int FTI_InitComplexType ( FTIT_type newType,
FTIT_complexType typeDefinition,
int  length,
size_t  size,
char *  name,
FTIT_H5Group h5group 
)

It initializes a complex data type.

Parameters
newTypeThe data type to be intialized.
typeDefinitionStructure definition of the new type.
lengthNumber of fields in structure
sizeSize of the structure.
nameName of the structure.
h5groupGroup of the type.
Returns
integer FTI_SCES if successful.

This function initalizes a simple data type. New type can only consists fields of flat FTI types (no arrays). Type definition must include:

  • length => number of fields in the new type
  • field[].type => types of the field in the new type
  • field[].name => name of the field in the new type
  • field[].rank => number of dimentions of the field
  • field[].dimLength[] => length of each dimention of the field

Here is the call graph for this function:

int FTI_InitGroup ( FTIT_H5Group h5group,
char *  name,
FTIT_H5Group parent 
)

It initialize a HDF5 group.

Parameters
h5groupH5 group that we want to initialize
nameName of the H5 group
parentParent H5 group
Returns
integer FTI_SCES if successful.

Initialize group defined by user. If parent is NULL this mean parent will be set to root group.

int FTI_InitType ( FTIT_type type,
int  size 
)

It initializes a data type.

Parameters
typeThe data type to be intialized.
sizeThe size of the data type to be intialized.
Returns
integer FTI_SCES if successful.

This function initalizes a data type. The only information needed is the size of the data type, the rest is black box for FTI. Types saved as byte array in case of HDF5 format.

void FTI_Print ( char *  msg,
int  priority 
)

Prints FTI messages.

Parameters
msgMessage to print.
priorityPriority of the message to be printed.
Returns
void

This function prints messages depending on their priority and the verbosity level set by the user. DEBUG messages are printed by all processes with their rank. INFO messages are printed by one process. ERROR messages are printed with errno.

int FTI_Protect ( int  id,
void *  ptr,
long  count,
FTIT_type  type 
)

It sets/resets the pointer and type to a protected variable.

Parameters
idID for searches and update.
ptrPointer to the data structure.
countNumber of elements in the data structure.
typeType of elements in the data structure.
Returns
integer FTI_SCES if successful.

This function stores a pointer to a data structure, its size, its ID, its number of elements and the type of the elements. This list of structures is the data that will be stored during a checkpoint and loaded during a recovery. It resets the pointer to a data structure, its size, its number of elements and the type of the elements if the dataset was already previously registered.

Here is the call graph for this function:

void* FTI_Realloc ( int  id,
void *  ptr 
)

Reallocates dataset to last checkpoint size.

Parameters
idVariable ID.
ptrPointer to the variable.
Returns
ptr Pointer if successful, NULL otherwise This function loads the checkpoint data size from the metadata file, reallacates memory and updates data size information.

Here is the call graph for this function:

int FTI_Recover ( )

It loads the checkpoint data.

Returns
integer FTI_SCES if successful.

This function loads the checkpoint data from the checkpoint file and it updates some basic checkpoint information.

Here is the call graph for this function:

int FTI_RecoverVar ( int  id)

During the restart, recovers the given variable.

Parameters
idVariable to recover
Returns
int FTI_SCES if successful.

During a restart process, this function recovers the variable specified by the given id. No effect during a regular execution. The variable must have already been protected, otherwise, FTI_NSCS is returned. Improvements to be done:

  • Open checkpoint file at FTI_Init, close it at FTI_Snapshot
  • Maintain a variable accumulating the offset as variable are protected during the restart to avoid doing the loop to calculate the offset in the checkpoint file.

Here is the call graph for this function:

int FTI_RenameGroup ( FTIT_H5Group h5group,
char *  name 
)

Renames a HDF5 group.

Parameters
h5groupH5 group that we want to rename
nameNew name of the H5 group
Returns
integer FTI_SCES if successful.

This function renames HDF5 group defined by user.

int FTI_SendFile ( char *  lpath,
char *  rpath 
)

Copies file asynchronously from 'lpath' to 'rpath'.

Parameters
lpathabsolute path local file.
rpathabsolute path remote file.
Returns
integer Request handle (ID) on success, FTI_NSCS else.

This function may be used to copy a file local on the nodes via the FTI head process asynchronously to the PFS. The file will not be removed after successful transfer, however, if stored in the directory returned by 'FTI_GetStageDir' it will be removed during 'FTI_Finalize'.

If staging is enabled but no head process, the staging will be performed synchronously (i.e. by the calling rank).

Here is the call graph for this function:

int FTI_Snapshot ( )

Takes an FTI snapshot or recovers the data if it is a restart.

Returns
integer FTI_SCES if successful.

This function loads the checkpoint data from the checkpoint file in case of restart. Otherwise, it checks if the current iteration requires checkpointing, if it does it checks which checkpoint level, write the data in the files and it communicates with the head of the node to inform that a checkpoint has been taken. Checkpoint ID and counters are updated.

Here is the call graph for this function:

int FTI_Status ( )

It returns the current status of the recovery flag.

Returns
integer FTI_Exec.reco.

This function returns the current status of the recovery flag.

Variable Documentation

FTIT_type FTI_CHAR

FTI data type for chars.

MPI_Comm FTI_COMM_WORLD

MPI communicator that splits the global one into app and FTI appart.

FTIT_type FTI_DBLE

FTI data type for double floating point.

FTIT_type FTI_INTG

FTI data type for integers.

FTIT_type FTI_LDBE

FTI data type for long doble floating point.

FTIT_type FTI_LONG

FTI data type for long integers.

FTIT_type FTI_SFLT

FTI data type for single floating point.

FTIT_type FTI_SHRT

FTI data type for short integers.

FTIT_type FTI_UCHR

FTI data type for unsigned chars.

FTIT_type FTI_UINT

FTI data type for unsigned integers.

FTIT_type FTI_ULNG

FTI data type for unsigned long integers.

FTIT_type FTI_USHT

FTI data type for unsigned short integers.