Fault Tolerance Interface
checkpoint.c File Reference

Checkpointing functions for the FTI library. More...

#include <string.h>
#include "interface.h"
Include dependency graph for checkpoint.c:

Macros

#define _POSIX_C_SOURCE   200809L
 

Functions

int FTI_UpdateIterTime (FTIT_execution *FTI_Exec)
 It updates the local and global mean iteration time. More...
 
int FTI_WriteCkpt (FTIT_configuration *FTI_Conf, FTIT_execution *FTI_Exec, FTIT_topology *FTI_Topo, FTIT_checkpoint *FTI_Ckpt, FTIT_dataset *FTI_Data)
 It writes the checkpoint data in the target file. More...
 
int FTI_PostCkpt (FTIT_configuration *FTI_Conf, FTIT_execution *FTI_Exec, FTIT_topology *FTI_Topo, FTIT_checkpoint *FTI_Ckpt)
 Decides wich action start depending on the ckpt. level. More...
 
int FTI_Listen (FTIT_configuration *FTI_Conf, FTIT_execution *FTI_Exec, FTIT_topology *FTI_Topo, FTIT_checkpoint *FTI_Ckpt)
 It listens for checkpoint notifications. More...
 
int FTI_HandleCkptRequest (FTIT_configuration *FTI_Conf, FTIT_execution *FTI_Exec, FTIT_topology *FTI_Topo, FTIT_checkpoint *FTI_Ckpt)
 handles checkpoint requests from application ranks (if head). More...
 
int FTI_WritePosix (FTIT_configuration *FTI_Conf, FTIT_execution *FTI_Exec, FTIT_topology *FTI_Topo, FTIT_checkpoint *FTI_Ckpt, FTIT_dataset *FTI_Data)
 Writes ckpt to PFS using POSIX. More...
 
int FTI_WriteMPI (FTIT_configuration *FTI_Conf, FTIT_execution *FTI_Exec, FTIT_topology *FTI_Topo, FTIT_dataset *FTI_Data)
 Writes ckpt to PFS using MPI I/O. More...
 
int FTI_WriteSionlib (FTIT_configuration *FTI_Conf, FTIT_execution *FTI_Exec, FTIT_topology *FTI_Topo, FTIT_dataset *FTI_Data)
 Writes ckpt to PFS using SIONlib. More...
 

Detailed Description

Checkpointing functions for the FTI library.

Copyright (c) 2017 Leonardo A. Bautista-Gomez All rights reserved

FTI - A multi-level checkpointing library for C/C++/Fortran applications

Revision 1.0 : Fault Tolerance Interface (FTI)

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Date
October, 2017

Macro Definition Documentation

#define _POSIX_C_SOURCE   200809L

Function Documentation

int FTI_HandleCkptRequest ( FTIT_configuration FTI_Conf,
FTIT_execution FTI_Exec,
FTIT_topology FTI_Topo,
FTIT_checkpoint FTI_Ckpt 
)

handles checkpoint requests from application ranks (if head).

Parameters
FTI_ConfConfiguration metadata.
FTI_ExecExecution metadata.
FTI_TopoTopology metadata.
FTI_CkptCheckpoint metadata.
Returns
integer FTI_SCES if successful.

Here is the call graph for this function:

int FTI_Listen ( FTIT_configuration FTI_Conf,
FTIT_execution FTI_Exec,
FTIT_topology FTI_Topo,
FTIT_checkpoint FTI_Ckpt 
)

It listens for checkpoint notifications.

Parameters
FTI_ConfConfiguration metadata.
FTI_ExecExecution metadata.
FTI_TopoTopology metadata.
FTI_CkptCheckpoint metadata.
Returns
integer FTI_SCES if successful. This function listens for notifications from the application processes and takes the required actions after notification. This function is only executed by the head of the nodes and its complementary with the FTI_Checkpoint function in terms of communications.

Here is the call graph for this function:

int FTI_PostCkpt ( FTIT_configuration FTI_Conf,
FTIT_execution FTI_Exec,
FTIT_topology FTI_Topo,
FTIT_checkpoint FTI_Ckpt 
)

Decides wich action start depending on the ckpt. level.

Parameters
FTI_ConfConfiguration metadata.
FTI_ExecExecution metadata.
FTI_TopoTopology metadata.
FTI_CkptCheckpoint metadata.
Returns
integer FTI_SCES if successful.

This function launches the required action dependeing on the ckpt. level. It does that for each group (application process in the node) if executed by the head, or only locally if executed by an application process. The parameter pr determines if the for loops have 1 or number of App. procs. iterations. The group parameter helps determine the groupID in both cases.

Here is the call graph for this function:

int FTI_UpdateIterTime ( FTIT_execution FTI_Exec)

It updates the local and global mean iteration time.

Parameters
FTI_ExecExecution metadata.
Returns
integer FTI_SCES if successful.

This function updates the local and global mean iteration time. It also recomputes the checkpoint interval in iterations and corrects the next checkpointing iteration based on the observed mean iteration duration.

Here is the call graph for this function:

int FTI_WriteCkpt ( FTIT_configuration FTI_Conf,
FTIT_execution FTI_Exec,
FTIT_topology FTI_Topo,
FTIT_checkpoint FTI_Ckpt,
FTIT_dataset FTI_Data 
)

It writes the checkpoint data in the target file.

Parameters
FTI_ConfConfiguration metadata.
FTI_ExecExecution metadata.
FTI_TopoTopology metadata.
FTI_CkptCheckpoint metadata.
FTI_DataDataset metadata.
Returns
integer FTI_SCES if successful.

This function checks whether the checkpoint needs to be local or remote, opens the target file and writes dataset per dataset, the checkpoint data, it finally flushes and closes the checkpoint file.

Here is the call graph for this function:

int FTI_WriteMPI ( FTIT_configuration FTI_Conf,
FTIT_execution FTI_Exec,
FTIT_topology FTI_Topo,
FTIT_dataset FTI_Data 
)

Writes ckpt to PFS using MPI I/O.

Parameters
FTI_ConfConfiguration metadata.
FTI_ExecExecution metadata.
FTI_TopoTopology metadata.
FTI_DataDataset metadata.
Returns
integer FTI_SCES if successful.

In here it is taken into account, that in MPIIO the count parameter in both, MPI_Type_contiguous and MPI_File_write_at, are integer types. The ckpt data is split into chunks of maximal (MAX_INT-1)/2 elements to form contiguous data types. It was experienced, that if the size is greater then that, it may lead to problems.

Here is the call graph for this function:

int FTI_WritePosix ( FTIT_configuration FTI_Conf,
FTIT_execution FTI_Exec,
FTIT_topology FTI_Topo,
FTIT_checkpoint FTI_Ckpt,
FTIT_dataset FTI_Data 
)

Writes ckpt to PFS using POSIX.

Parameters
FTI_ConfConfiguration metadata.
FTI_ExecExecution metadata.
FTI_TopoTopology metadata.
FTI_CkptCheckpoint metadata.
FTI_DataDataset metadata.
Returns
integer FTI_SCES if successful.

Here is the call graph for this function:

int FTI_WriteSionlib ( FTIT_configuration FTI_Conf,
FTIT_execution FTI_Exec,
FTIT_topology FTI_Topo,
FTIT_dataset FTI_Data 
)

Writes ckpt to PFS using SIONlib.

Parameters
FTI_ConfConfiguration metadata.
FTI_ExecExecution metadata.
FTI_TopoTopology metadata.
FTI_DataDataset metadata.
Returns
integer FTI_SCES if successful.

Here is the call graph for this function: