What for?
FTI writes meta data related to the checkpoint files in separate text files. The meta data is needed to perform a restart from the last checkpoint. FTI-FF includes the meta data in the checkpoint files, thus reduces the amount of files on the PFS. This can be beneficial for executions with several thousands of processes.
But it can also be interesting for restarting from others then the last checkpoint files. The current implementation (V1.1) does not implement this in an automatic fashion, hence the checkpoint files must be copied by hand in order to be available for an alternative restart.
The motivation, however, for the new file format is to facilitate the future implementation of differential and incremental checkpointing. FTI-FF provides information within the file where every chunk of data is stored, thus enables to write changed data blocks to the corresponding position.
Structure
The file format basic structure, consists of a meta block (FB
) and a data block (VB
):
1 +--------------+ +------------------------+
5 +--------------+ +------------------------+
The FB
(file block) holds meta data related to the file whereas the VB
(variable block) holds meta and actual data of the variables protected by FTI.
The FB
has the following structure (FTIFF_metaInfo
):
The VB
contains the sub structures VCB_i
(variable chunk blocks), which consist of the variable chunks (VC_ij
) stored in the current VCB_i
and the corresponding variable chunk meta data (VMB_i
):
1 |<-------------------------------------------------- VB -------------------------------------------------->|
3 |<------------- VCB_1 --------------------------->| |<------------- VCB_n --------------------------->|
5 +-------------------------------------------------+ +-------------------------------------------------+
6 | +--------++-------------+ +--------------+ | | +--------++-------------+ +--------------+ |
7 | | || | | | | | | || | | | |
8 | | VMB_1 || VC_11 | ---- | VC_1k | | ---- | | VMB_n || VC_n1 | ---- | VC_nl | |
9 | | || | | | | | | || | | | |
10 | +--------++-------------+ +--------------+ | | +--------++-------------+ +--------------+ |
11 +-------------------------------------------------+ +-------------------------------------------------+
The number of data chunks (e.g. k
and l
in the sketch), generally differs. To which protected variable the chunk VC_ij
belongs is kept in the corresponding data structure (FTIFF_dbvar
) which is part of the VMB_i
.
The VMB_i
have the following sub structure:
1 |<-------------- VMB_i ------------->|
3 +-------++---------+ +----------+
5 | BMD_i || VMD_i1 | ---- | VMD_ij |
7 +-------++---------+ +----------+
Where the BMD_i
(block meta data) keep information related to the variable chunk block and possess the following structure (FTIFF_db
):
The VMD_ij
have the following structure (FTIFF_dbvar
):
Example
The container size is fixed once an additional container is created. A container is created if the size of a protected variable increases between two invocations of FTI_Checkpoint()
and if between invocations an additional variable is protected by a call to FTI_Protect()
.
The following example shows the status of the data structures just before the invocations of FTI_Checkpoint()
. To achieve the correct values for fptr
one has to consider the size of the FB
part at the beginning of the file. Thus the base offset is sizeof(FTIFF_infoMeta)
which is in the example here equal to 96 bytes.
Checkpoint 1
size1=1000000;
size2=2000000;
size3=3000000;
Output:
1 ------------------- DATASTRUCTURE BEGIN -------------------
19 containersize: 4000000
30 containersize: 8000000
41 containersize: 12000000
45 ------------------- DATASTRUCTURE END ---------------------
Checkpoint 2
Output:
1 ------------------- DATASTRUCTURE BEGIN -------------------
19 containersize: 4000000
30 containersize: 8000000
41 containersize: 12000000
60 containersize: 16000000
64 ------------------- DATASTRUCTURE END ---------------------
Checkpoint 3
size2=6000000;
size3=7000000;
Output:
1 ------------------- DATASTRUCTURE BEGIN -------------------
19 containersize: 4000000
30 containersize: 8000000
41 containersize: 12000000
60 containersize: 16000000
79 containersize: 16000000
90 containersize: 16000000
94 ------------------- DATASTRUCTURE END ---------------------
Checkpoint 4
Output:
1 ------------------- DATASTRUCTURE BEGIN -------------------
19 containersize: 4000000
30 containersize: 8000000
41 containersize: 12000000
60 containersize: 16000000
79 containersize: 16000000
90 containersize: 16000000
109 containersize: 20000000
113 ------------------- DATASTRUCTURE END ---------------------
Checkpoint 5
size2=5000000;
size3=6000000;
Output:
1 ------------------- DATASTRUCTURE BEGIN -------------------
19 containersize: 4000000
30 containersize: 8000000
41 containersize: 12000000
60 containersize: 16000000
79 containersize: 16000000
90 containersize: 16000000
109 containersize: 20000000
113 ------------------- DATASTRUCTURE END ---------------------
Checkpoint 6
size2=8000000;
size3=9000000;
Output:
1 ------------------- DATASTRUCTURE BEGIN -------------------
19 containersize: 4000000
30 containersize: 8000000
41 containersize: 12000000
60 containersize: 16000000
79 containersize: 16000000
90 containersize: 16000000
109 containersize: 20000000
128 containersize: 8000000
139 containersize: 8000000
143 ------------------- DATASTRUCTURE END ---------------------
Checkpoint 7
size2 = 1000000;
size3 = 2000000;
Output:
1 ------------------- DATASTRUCTURE BEGIN -------------------
19 containersize: 4000000
30 containersize: 8000000
41 containersize: 12000000
60 containersize: 16000000
79 containersize: 16000000
90 containersize: 16000000
109 containersize: 20000000
128 containersize: 8000000
139 containersize: 8000000
143 ------------------- DATASTRUCTURE END ---------------------