Fault Tolerance Interface
|
API functions for the FTI library. More...
#include "interface.h"
Functions | |
int | FTI_Init (char *configFile, MPI_Comm globalComm) |
Initializes FTI. More... | |
int | FTI_Status () |
It returns the current status of the recovery flag. More... | |
int | FTI_InitType (FTIT_type *type, int size) |
It initializes a data type. More... | |
int | FTI_InitComplexType (FTIT_type *newType, FTIT_complexType *typeDefinition, int length, size_t size, char *name, FTIT_H5Group *h5group) |
It initializes a complex data type. More... | |
void | FTI_AddSimpleField (FTIT_complexType *typeDefinition, FTIT_type *ftiType, size_t offset, int id, char *name) |
It adds a simple field in complex data type. More... | |
void | FTI_AddComplexField (FTIT_complexType *typeDefinition, FTIT_type *ftiType, size_t offset, int rank, int *dimLength, int id, char *name) |
It adds a simple field in complex data type. More... | |
int | FTI_GetStageDir (char *stageDir, int maxLen) |
Places the FTI staging directory path into 'stageDir'. More... | |
int | FTI_GetStageStatus (int ID) |
Returns status of staging request. More... | |
int | FTI_SendFile (char *lpath, char *rpath) |
Copies file asynchronously from 'lpath' to 'rpath'. More... | |
int | FTI_InitGroup (FTIT_H5Group *h5group, char *name, FTIT_H5Group *parent) |
It initialize a HDF5 group. More... | |
int | FTI_RenameGroup (FTIT_H5Group *h5group, char *name) |
Renames a HDF5 group. More... | |
int | FTI_Protect (int id, void *ptr, long count, FTIT_type type) |
It sets/resets the pointer and type to a protected variable. More... | |
int | FTI_DefineDataset (int id, int rank, int *dimLength, char *name, FTIT_H5Group *h5group) |
Defines the dataset. More... | |
long | FTI_GetStoredSize (int id) |
Returns size saved in metadata of variable. More... | |
void * | FTI_Realloc (int id, void *ptr) |
Reallocates dataset to last checkpoint size. More... | |
int | FTI_FloatBitFlip (float *target, int bit) |
It corrupts a bit of the given float. More... | |
int | FTI_DoubleBitFlip (double *target, int bit) |
It corrupts a bit of the given float. More... | |
int | FTI_BitFlip (int datasetID) |
Bit-flip injection following the injection instructions. More... | |
int | FTI_Checkpoint (int id, int level) |
It takes the checkpoint and triggers the post-ckpt. work. More... | |
int | FTI_Recover () |
It loads the checkpoint data. More... | |
int | FTI_Snapshot () |
Takes an FTI snapshot or recovers the data if it is a restart. More... | |
int | FTI_Finalize () |
It closes FTI properly on the application processes. More... | |
int | FTI_RecoverVar (int id) |
During the restart, recovers the given variable. More... | |
void | FTI_Print (char *msg, int priority) |
Prints FTI messages. More... | |
API functions for the FTI library.
Copyright (c) 2017 Leonardo A. Bautista-Gomez All rights reserved
FTI - A multi-level checkpointing library for C/C++/Fortran applications
Revision 1.0 : Fault Tolerance Interface (FTI)
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
void FTI_AddComplexField | ( | FTIT_complexType * | typeDefinition, |
FTIT_type * | ftiType, | ||
size_t | offset, | ||
int | rank, | ||
int * | dimLength, | ||
int | id, | ||
char * | name | ||
) |
It adds a simple field in complex data type.
typeDefinition | Structure definition of the complex data type. |
ftiType | Type of the field |
offset | Offset of the field (use offsetof) |
rank | Rank of the array |
dimLength | Dimention length for each rank |
id | Id of the field (start with 0) |
name | Name of the field (put NULL if want default) |
This function adds a field to the complex datatype. Use offsetof macro to set offset. First ID must be 0, next one must be +1. If name is NULL FTI will set "T${id}" name.
void FTI_AddSimpleField | ( | FTIT_complexType * | typeDefinition, |
FTIT_type * | ftiType, | ||
size_t | offset, | ||
int | id, | ||
char * | name | ||
) |
It adds a simple field in complex data type.
typeDefinition | Structure definition of the complex data type. |
ftiType | Type of the field |
offset | Offset of the field (use offsetof) |
id | Id of the field (start with 0) |
name | Name of the field (put NULL if want default) |
This function adds a field to the complex datatype. Use offsetof macro to set offset. First ID must be 0, next one must be +1. If name is NULL FTI will set "T${id}" name. Sets rank and dimLength to 1.
int FTI_BitFlip | ( | int | datasetID | ) |
Bit-flip injection following the injection instructions.
datasetID | ID of the dataset where to inject. |
This function injects the given number of bit-flips, at the given frequency and in the given location (rank, dataset, bit position).
int FTI_Checkpoint | ( | int | id, |
int | level | ||
) |
It takes the checkpoint and triggers the post-ckpt. work.
id | Checkpoint ID. |
level | Checkpoint level. |
This function starts by blocking on a receive if the previous ckpt. was offline. Then, it updates the ckpt. information. It writes down the ckpt. data, creates the metadata and the post-processing work. This function is complementary with the FTI_Listen function in terms of communications.
int FTI_DefineDataset | ( | int | id, |
int | rank, | ||
int * | dimLength, | ||
char * | name, | ||
FTIT_H5Group * | h5group | ||
) |
Defines the dataset.
id | ID for searches and update. |
rank | Rank of the array |
dimLength | Dimention length for each rank |
name | Name of the dataset in HDF5 file. |
h5group | Group of the dataset. If Null then "/" |
This function gives FTI all information needed by HDF5 to correctly save the dataset in the checkpoint file.
int FTI_DoubleBitFlip | ( | double * | target, |
int | bit | ||
) |
It corrupts a bit of the given float.
target | Pointer to the float to corrupt. |
bit | Position of the bit to corrupt. |
This function filps the bit of the target float.
int FTI_Finalize | ( | ) |
It closes FTI properly on the application processes.
This function notifies the FTI processes that the execution is over, frees some data structures and it closes. If this function is not called on the application processes the FTI processes will never finish (deadlock).
int FTI_FloatBitFlip | ( | float * | target, |
int | bit | ||
) |
It corrupts a bit of the given float.
target | Pointer to the float to corrupt. |
bit | Position of the bit to corrupt. |
This function filps the bit of the target float.
int FTI_GetStageDir | ( | char * | stageDir, |
int | maxLen | ||
) |
Places the FTI staging directory path into 'stageDir'.
stageDir | pointer to allocated memory region. |
maxLen | size of allocated memory region in bytes. |
This function places the FTI staging directory path in 'stageDir'. If allocation size is not sufficiant, no action is perfoprmed and FTI_NSCS is returned.
int FTI_GetStageStatus | ( | int | ID | ) |
Returns status of staging request.
ID | ID of staging request. |
This function returns the status of the staging request corresponding to ID. The ID is returned by the function 'FTI_SendFile'. The status may be one of the five possible statuses:
long FTI_GetStoredSize | ( | int | id | ) |
Returns size saved in metadata of variable.
id | Variable ID. |
This function returns size of variable of given ID that is saved in metadata. This may be different from size of variable that is in the program. If this function it's called when recovery it returns size from metadata file, if it's called after checkpoint it returns size saved in temporary metadata. If there is no size saved in metadata it returns 0.
int FTI_Init | ( | char * | configFile, |
MPI_Comm | globalComm | ||
) |
Initializes FTI.
configFile | FTI configuration file. |
globalComm | Main MPI communicator of the application. |
This function initializes the FTI context and prepares the heads to wait for checkpoints. FTI processes should never get out of this function. In case of a restart, checkpoint files should be recovered and in place at the end of this function.
int FTI_InitComplexType | ( | FTIT_type * | newType, |
FTIT_complexType * | typeDefinition, | ||
int | length, | ||
size_t | size, | ||
char * | name, | ||
FTIT_H5Group * | h5group | ||
) |
It initializes a complex data type.
newType | The data type to be intialized. |
typeDefinition | Structure definition of the new type. |
length | Number of fields in structure |
size | Size of the structure. |
name | Name of the structure. |
h5group | Group of the type. |
This function initalizes a simple data type. New type can only consists fields of flat FTI types (no arrays). Type definition must include:
int FTI_InitGroup | ( | FTIT_H5Group * | h5group, |
char * | name, | ||
FTIT_H5Group * | parent | ||
) |
It initialize a HDF5 group.
h5group | H5 group that we want to initialize |
name | Name of the H5 group |
parent | Parent H5 group |
Initialize group defined by user. If parent is NULL this mean parent will be set to root group.
int FTI_InitType | ( | FTIT_type * | type, |
int | size | ||
) |
It initializes a data type.
type | The data type to be intialized. |
size | The size of the data type to be intialized. |
This function initalizes a data type. The only information needed is the size of the data type, the rest is black box for FTI. Types saved as byte array in case of HDF5 format.
void FTI_Print | ( | char * | msg, |
int | priority | ||
) |
Prints FTI messages.
msg | Message to print. |
priority | Priority of the message to be printed. |
This function prints messages depending on their priority and the verbosity level set by the user. DEBUG messages are printed by all processes with their rank. INFO messages are printed by one process. ERROR messages are printed with errno.
int FTI_Protect | ( | int | id, |
void * | ptr, | ||
long | count, | ||
FTIT_type | type | ||
) |
It sets/resets the pointer and type to a protected variable.
id | ID for searches and update. |
ptr | Pointer to the data structure. |
count | Number of elements in the data structure. |
type | Type of elements in the data structure. |
This function stores a pointer to a data structure, its size, its ID, its number of elements and the type of the elements. This list of structures is the data that will be stored during a checkpoint and loaded during a recovery. It resets the pointer to a data structure, its size, its number of elements and the type of the elements if the dataset was already previously registered.
void* FTI_Realloc | ( | int | id, |
void * | ptr | ||
) |
Reallocates dataset to last checkpoint size.
id | Variable ID. |
ptr | Pointer to the variable. |
int FTI_Recover | ( | ) |
It loads the checkpoint data.
This function loads the checkpoint data from the checkpoint file and it updates some basic checkpoint information.
int FTI_RecoverVar | ( | int | id | ) |
During the restart, recovers the given variable.
id | Variable to recover |
During a restart process, this function recovers the variable specified by the given id. No effect during a regular execution. The variable must have already been protected, otherwise, FTI_NSCS is returned. Improvements to be done:
int FTI_RenameGroup | ( | FTIT_H5Group * | h5group, |
char * | name | ||
) |
Renames a HDF5 group.
h5group | H5 group that we want to rename |
name | New name of the H5 group |
This function renames HDF5 group defined by user.
int FTI_SendFile | ( | char * | lpath, |
char * | rpath | ||
) |
Copies file asynchronously from 'lpath' to 'rpath'.
lpath | absolute path local file. |
rpath | absolute path remote file. |
This function may be used to copy a file local on the nodes via the FTI head process asynchronously to the PFS. The file will not be removed after successful transfer, however, if stored in the directory returned by 'FTI_GetStageDir' it will be removed during 'FTI_Finalize'.
int FTI_Snapshot | ( | ) |
Takes an FTI snapshot or recovers the data if it is a restart.
This function loads the checkpoint data from the checkpoint file in case of restart. Otherwise, it checks if the current iteration requires checkpointing, if it does it checks which checkpoint level, write the data in the files and it communicates with the head of the node to inform that a checkpoint has been taken. Checkpoint ID and counters are updated.
int FTI_Status | ( | ) |
It returns the current status of the recovery flag.
This function returns the current status of the recovery flag.
MPI_Comm FTI_COMM_WORLD |
MPI communicator that splits the global one into app and FTI appart.