Fault Tolerance Interface
FTI File Format (FTI-FF)

What for?

FTI writes meta data related to the checkpoint files in separate text files. The meta data is needed to perform a restart from the last checkpoint. FTI-FF includes the meta data in the checkpoint files, thus reduces the amount of files on the PFS. This can be beneficial for executions with several thousands of processes.

But it can also be interesting for restarting from others then the last checkpoint files. The current implementation (V1.1) does not implement this in an automatic fashion, hence the checkpoint files must be copied by hand in order to be available for an alternative restart.

The motivation, however, for the new file format is to facilitate the future implementation of differential and incremental checkpointing. FTI-FF provides information within the file where every chunk of data is stored, thus enables to write changed data blocks to the corresponding position.

Structure

The file format basic structure, consists of a meta block (FB) and a data block (VB):

1 +--------------+ +------------------------+
2 | | | |
3 | FB | | VB |
4 | | | |
5 +--------------+ +------------------------+

The FB (file block) holds meta data related to the file whereas the VB (variable block) holds meta and actual data of the variables protected by FTI.

The FB has the following structure (FTIFF_metaInfo):

typedef struct FTIFF_metaInfo {
char checksum[MD5_DIGEST_STRING_LENGTH]; // Hash of the VB block in hex representation (33 bytes)
unsigned char myHash[MD5_DIGEST_LENGTH]; // Hash of FB without 'myHash' in unsigned char (16 bytes)
long ckptSize; // Size of actual data stored in file
long fs; // Size of FB + VB
long maxFs; // Maximum size of FB + VB in group
long ptFs; // Size of FB + VB of partner process
long timestamp; // Time in ns of FB block creation

The VB contains the sub structures VCB_i (variable chunk blocks), which consist of the variable chunks (VC_ij) stored in the current VCB_i and the corresponding variable chunk meta data (VMB_i):

1 |<-------------------------------------------------- VB -------------------------------------------------->|
2 # #
3 |<------------- VCB_1 --------------------------->| |<------------- VCB_n --------------------------->|
4 # # # #
5 +-------------------------------------------------+ +-------------------------------------------------+
6 | +--------++-------------+ +--------------+ | | +--------++-------------+ +--------------+ |
7 | | || | | | | | | || | | | |
8 | | VMB_1 || VC_11 | ---- | VC_1k | | ---- | | VMB_n || VC_n1 | ---- | VC_nl | |
9 | | || | | | | | | || | | | |
10 | +--------++-------------+ +--------------+ | | +--------++-------------+ +--------------+ |
11 +-------------------------------------------------+ +-------------------------------------------------+

The number of data chunks (e.g. k and l in the sketch), generally differs. To which protected variable the chunk VC_ij belongs is kept in the corresponding data structure (FTIFF_dbvar) which is part of the VMB_i.

The VMB_i have the following sub structure:

1 |<-------------- VMB_i ------------->|
2 # #
3 +-------++---------+ +----------+
4 | || | | |
5 | BMD_i || VMD_i1 | ---- | VMD_ij |
6 | || | | |
7 +-------++---------+ +----------+

Where the BMD_i (block meta data) keep information related to the variable chunk block and possess the following structure (FTIFF_db):

typedef struct FTIFF_db {
int numvars; // Size of entire block VCB_i (meta + actual data)
long dbsize; // Number of variable chunks in data block
FTIFF_dbvar *dbvars; // pointer to related VMD_ij array
struct FTIFF_db *previous; // link to BMD_(i-1) or NULL if first block (FTI_Exec->firstdb)
struct FTIFF_db *next; // link to BMD_(i+1) or NULL if last block (FTI_Exec->lastdb)

The VMD_ij have the following structure (FTIFF_dbvar):

typedef struct FTIFF_dbvar {
int id; // Id of protected variable the data chunk belongs to
int idx; // Index of corresponding element in FTI_Data
int containerid; // Id of this container
bool hascontent; // Boolean value indicating if container holds data or not
long dptr; // offset of chunk in runtime-data (i.e. virtual address ptr = FTI_Data[idx].ptr + dptr)
long fptr; // offset of chunk in file
long chunksize; // Size of chunk stored in container
long containersize; // Total size of container
unsigned char hash[MD5_DIGEST_LENGTH]; // Hash of variable chunk 'VC_ij'

Example

The container size is fixed once an additional container is created. A container is created if the size of a protected variable increases between two invocations of FTI_Checkpoint() and if between invocations an additional variable is protected by a call to FTI_Protect().

The following example shows the status of the data structures just before the invocations of FTI_Checkpoint(). To achieve the correct values for fptr one has to consider the size of the FB part at the beginning of the file. Thus the base offset is sizeof(FTIFF_infoMeta) which is in the example here equal to 96 bytes.

Checkpoint 1

size1=1000000;
size2=2000000;
size3=3000000;
FTI_Protect(1, arr1, size1, FTI_INTG);
FTI_Protect(2, arr2, size2, FTI_INTG);
FTI_Protect(3, arr3, size3, FTI_INTG);
FTI_Checkpoint(1,level);

Output:

1 ------------------- DATASTRUCTURE BEGIN -------------------
2 
3  DataBase-id: 0
4  numvars: 3
5  dbsize: 24000204
6  *dbvars: 0xe4b4b0
7  *previous: (nil)
8  *next: (nil)
9  [size metadata: 204]
10 
11  Var-id: 0
12  id: 1
13  idx: 0
14  containerid: 0
15  hascontent: true
16  dptr: 0
17  fptr: 300
18  chunksize: 4000000
19  containersize: 4000000
20  [hash not printed]
21 
22  Var-id: 1
23  id: 2
24  idx: 1
25  containerid: 0
26  hascontent: true
27  dptr: 0
28  fptr: 4000300
29  chunksize: 8000000
30  containersize: 8000000
31  [hash not printed]
32 
33  Var-id: 2
34  id: 3
35  idx: 2
36  containerid: 0
37  hascontent: true
38  dptr: 0
39  fptr: 12000300
40  chunksize: 12000000
41  containersize: 12000000
42  [hash not printed]
43 
44 
45 ------------------- DATASTRUCTURE END ---------------------

Checkpoint 2

size4=4000000;
FTI_Protect(4, arr4, size4, FTI_INTG);
FTI_Checkpoint(2,level);

Output:

1 ------------------- DATASTRUCTURE BEGIN -------------------
2 
3  DataBase-id: 0
4  numvars: 3
5  dbsize: 24000204
6  *dbvars: 0xe4b4b0
7  *previous: (nil)
8  *next: 0xe78f30
9  [size metadata: 204]
10 
11  Var-id: 0
12  id: 1
13  idx: 0
14  containerid: 0
15  hascontent: true
16  dptr: 0
17  fptr: 300
18  chunksize: 4000000
19  containersize: 4000000
20  [hash not printed]
21 
22  Var-id: 1
23  id: 2
24  idx: 1
25  containerid: 0
26  hascontent: true
27  dptr: 0
28  fptr: 4000300
29  chunksize: 8000000
30  containersize: 8000000
31  [hash not printed]
32 
33  Var-id: 2
34  id: 3
35  idx: 2
36  containerid: 0
37  hascontent: true
38  dptr: 0
39  fptr: 12000300
40  chunksize: 12000000
41  containersize: 12000000
42  [hash not printed]
43 
44  DataBase-id: 1
45  numvars: 1
46  dbsize: 16000076
47  *dbvars: 0xe77470
48  *previous: 0xe4fea0
49  *next: (nil)
50  [size metadata: 76]
51 
52  Var-id: 0
53  id: 4
54  idx: 3
55  containerid: 0
56  hascontent: true
57  dptr: 0
58  fptr: 24000376
59  chunksize: 16000000
60  containersize: 16000000
61  [hash not printed]
62 
63 
64 ------------------- DATASTRUCTURE END ---------------------

Checkpoint 3

size2=6000000;
size3=7000000;
FTI_Protect(2, arr2, size2, FTI_INTG);
FTI_Protect(3, arr3, size3, FTI_INTG);
FTI_Checkpoint(3,level);

Output:

1 ------------------- DATASTRUCTURE BEGIN -------------------
2 
3  DataBase-id: 0
4  numvars: 3
5  dbsize: 24000204
6  *dbvars: 0xe4b4b0
7  *previous: (nil)
8  *next: 0xe78f30
9  [size metadata: 204]
10 
11  Var-id: 0
12  id: 1
13  idx: 0
14  containerid: 0
15  hascontent: true
16  dptr: 0
17  fptr: 300
18  chunksize: 4000000
19  containersize: 4000000
20  [hash not printed]
21 
22  Var-id: 1
23  id: 2
24  idx: 1
25  containerid: 0
26  hascontent: true
27  dptr: 0
28  fptr: 4000300
29  chunksize: 8000000
30  containersize: 8000000
31  [hash not printed]
32 
33  Var-id: 2
34  id: 3
35  idx: 2
36  containerid: 0
37  hascontent: true
38  dptr: 0
39  fptr: 12000300
40  chunksize: 12000000
41  containersize: 12000000
42  [hash not printed]
43 
44  DataBase-id: 1
45  numvars: 1
46  dbsize: 16000076
47  *dbvars: 0xe77470
48  *previous: 0xe4fea0
49  *next: 0xe50620
50  [size metadata: 76]
51 
52  Var-id: 0
53  id: 4
54  idx: 3
55  containerid: 0
56  hascontent: true
57  dptr: 0
58  fptr: 24000376
59  chunksize: 16000000
60  containersize: 16000000
61  [hash not printed]
62 
63  DataBase-id: 2
64  numvars: 2
65  dbsize: 32000140
66  *dbvars: 0xe79260
67  *previous: 0xe78f30
68  *next: (nil)
69  [size metadata: 140]
70 
71  Var-id: 0
72  id: 2
73  idx: 1
74  containerid: 1
75  hascontent: true
76  dptr: 8000000
77  fptr: 40000516
78  chunksize: 16000000
79  containersize: 16000000
80  [hash not printed]
81 
82  Var-id: 1
83  id: 3
84  idx: 2
85  containerid: 1
86  hascontent: true
87  dptr: 12000000
88  fptr: 56000516
89  chunksize: 16000000
90  containersize: 16000000
91  [hash not printed]
92 
93 
94 ------------------- DATASTRUCTURE END ---------------------

Checkpoint 4

size5=5000000;
FTI_Protect(5, arr5, size5, FTI_INTG);
FTI_Checkpoint(4,level);

Output:

1 ------------------- DATASTRUCTURE BEGIN -------------------
2 
3  DataBase-id: 0
4  numvars: 3
5  dbsize: 24000204
6  *dbvars: 0xe4b4b0
7  *previous: (nil)
8  *next: 0xe78f30
9  [size metadata: 204]
10 
11  Var-id: 0
12  id: 1
13  idx: 0
14  containerid: 0
15  hascontent: true
16  dptr: 0
17  fptr: 300
18  chunksize: 4000000
19  containersize: 4000000
20  [hash not printed]
21 
22  Var-id: 1
23  id: 2
24  idx: 1
25  containerid: 0
26  hascontent: true
27  dptr: 0
28  fptr: 4000300
29  chunksize: 8000000
30  containersize: 8000000
31  [hash not printed]
32 
33  Var-id: 2
34  id: 3
35  idx: 2
36  containerid: 0
37  hascontent: true
38  dptr: 0
39  fptr: 12000300
40  chunksize: 12000000
41  containersize: 12000000
42  [hash not printed]
43 
44  DataBase-id: 1
45  numvars: 1
46  dbsize: 16000076
47  *dbvars: 0xe77470
48  *previous: 0xe4fea0
49  *next: 0xe50620
50  [size metadata: 76]
51 
52  Var-id: 0
53  id: 4
54  idx: 3
55  containerid: 0
56  hascontent: true
57  dptr: 0
58  fptr: 24000376
59  chunksize: 16000000
60  containersize: 16000000
61  [hash not printed]
62 
63  DataBase-id: 2
64  numvars: 2
65  dbsize: 32000140
66  *dbvars: 0xe79260
67  *previous: 0xe78f30
68  *next: 0xe79230
69  [size metadata: 140]
70 
71  Var-id: 0
72  id: 2
73  idx: 1
74  containerid: 1
75  hascontent: true
76  dptr: 8000000
77  fptr: 40000516
78  chunksize: 16000000
79  containersize: 16000000
80  [hash not printed]
81 
82  Var-id: 1
83  id: 3
84  idx: 2
85  containerid: 1
86  hascontent: true
87  dptr: 12000000
88  fptr: 56000516
89  chunksize: 16000000
90  containersize: 16000000
91  [hash not printed]
92 
93  DataBase-id: 3
94  numvars: 1
95  dbsize: 20000076
96  *dbvars: 0xe79170
97  *previous: 0xe50620
98  *next: (nil)
99  [size metadata: 76]
100 
101  Var-id: 0
102  id: 5
103  idx: 4
104  containerid: 0
105  hascontent: true
106  dptr: 0
107  fptr: 72000592
108  chunksize: 20000000
109  containersize: 20000000
110  [hash not printed]
111 
112 
113 ------------------- DATASTRUCTURE END ---------------------

Checkpoint 5

size2=5000000;
size3=6000000;
FTI_Protect(2, arr2, size2, FTI_INTG);
FTI_Protect(3, arr3, size3, FTI_INTG);
FTI_Checkpoint(1,level);

Output:

1 ------------------- DATASTRUCTURE BEGIN -------------------
2 
3  DataBase-id: 0
4  numvars: 3
5  dbsize: 24000204
6  *dbvars: 0xe734b0
7  *previous: (nil)
8  *next: 0xea0f30
9  [size metadata: 204]
10 
11  Var-id: 0
12  id: 1
13  idx: 0
14  containerid: 0
15  hascontent: true
16  dptr: 0
17  fptr: 300
18  chunksize: 4000000
19  containersize: 4000000
20  [hash not printed]
21 
22  Var-id: 1
23  id: 2
24  idx: 1
25  containerid: 0
26  hascontent: true
27  dptr: 0
28  fptr: 4000300
29  chunksize: 8000000
30  containersize: 8000000
31  [hash not printed]
32 
33  Var-id: 2
34  id: 3
35  idx: 2
36  containerid: 0
37  hascontent: true
38  dptr: 0
39  fptr: 12000300
40  chunksize: 12000000
41  containersize: 12000000
42  [hash not printed]
43 
44  DataBase-id: 1
45  numvars: 1
46  dbsize: 16000076
47  *dbvars: 0xe9f470
48  *previous: 0xe77ea0
49  *next: 0xe78620
50  [size metadata: 76]
51 
52  Var-id: 0
53  id: 4
54  idx: 3
55  containerid: 0
56  hascontent: true
57  dptr: 0
58  fptr: 24000376
59  chunksize: 16000000
60  containersize: 16000000
61  [hash not printed]
62 
63  DataBase-id: 2
64  numvars: 2
65  dbsize: 32000140
66  *dbvars: 0xea1260
67  *previous: 0xea0f30
68  *next: 0xea1230
69  [size metadata: 140]
70 
71  Var-id: 0
72  id: 2
73  idx: 1
74  containerid: 1
75  hascontent: true
76  dptr: 8000000
77  fptr: 40000516
78  chunksize: 12000000
79  containersize: 16000000
80  [hash not printed]
81 
82  Var-id: 1
83  id: 3
84  idx: 2
85  containerid: 1
86  hascontent: true
87  dptr: 12000000
88  fptr: 56000516
89  chunksize: 12000000
90  containersize: 16000000
91  [hash not printed]
92 
93  DataBase-id: 3
94  numvars: 1
95  dbsize: 20000076
96  *dbvars: 0xea1170
97  *previous: 0xe78620
98  *next: (nil)
99  [size metadata: 76]
100 
101  Var-id: 0
102  id: 5
103  idx: 4
104  containerid: 0
105  hascontent: true
106  dptr: 0
107  fptr: 72000592
108  chunksize: 20000000
109  containersize: 20000000
110  [hash not printed]
111 
112 
113 ------------------- DATASTRUCTURE END ---------------------

Checkpoint 6

size2=8000000;
size3=9000000;
FTI_Protect(2, arr2, size2, FTI_INTG);
FTI_Protect(3, arr3, size3, FTI_INTG);
FTI_Checkpoint(6,level);

Output:

1 ------------------- DATASTRUCTURE BEGIN -------------------
2 
3  DataBase-id: 0
4  numvars: 3
5  dbsize: 24000204
6  *dbvars: 0xe734b0
7  *previous: (nil)
8  *next: 0xea0f30
9  [size metadata: 204]
10 
11  Var-id: 0
12  id: 1
13  idx: 0
14  containerid: 0
15  hascontent: true
16  dptr: 0
17  fptr: 300
18  chunksize: 4000000
19  containersize: 4000000
20  [hash not printed]
21 
22  Var-id: 1
23  id: 2
24  idx: 1
25  containerid: 0
26  hascontent: true
27  dptr: 0
28  fptr: 4000300
29  chunksize: 8000000
30  containersize: 8000000
31  [hash not printed]
32 
33  Var-id: 2
34  id: 3
35  idx: 2
36  containerid: 0
37  hascontent: true
38  dptr: 0
39  fptr: 12000300
40  chunksize: 12000000
41  containersize: 12000000
42  [hash not printed]
43 
44  DataBase-id: 1
45  numvars: 1
46  dbsize: 16000076
47  *dbvars: 0xe9f470
48  *previous: 0xe77ea0
49  *next: 0xe78620
50  [size metadata: 76]
51 
52  Var-id: 0
53  id: 4
54  idx: 3
55  containerid: 0
56  hascontent: true
57  dptr: 0
58  fptr: 24000376
59  chunksize: 16000000
60  containersize: 16000000
61  [hash not printed]
62 
63  DataBase-id: 2
64  numvars: 2
65  dbsize: 32000140
66  *dbvars: 0xea1260
67  *previous: 0xea0f30
68  *next: 0xea1230
69  [size metadata: 140]
70 
71  Var-id: 0
72  id: 2
73  idx: 1
74  containerid: 1
75  hascontent: true
76  dptr: 8000000
77  fptr: 40000516
78  chunksize: 16000000
79  containersize: 16000000
80  [hash not printed]
81 
82  Var-id: 1
83  id: 3
84  idx: 2
85  containerid: 1
86  hascontent: true
87  dptr: 12000000
88  fptr: 56000516
89  chunksize: 16000000
90  containersize: 16000000
91  [hash not printed]
92 
93  DataBase-id: 3
94  numvars: 1
95  dbsize: 20000076
96  *dbvars: 0xea1170
97  *previous: 0xe78620
98  *next: 0xea0cf0
99  [size metadata: 76]
100 
101  Var-id: 0
102  id: 5
103  idx: 4
104  containerid: 0
105  hascontent: true
106  dptr: 0
107  fptr: 72000592
108  chunksize: 20000000
109  containersize: 20000000
110  [hash not printed]
111 
112  DataBase-id: 4
113  numvars: 2
114  dbsize: 16000140
115  *dbvars: 0xea0dc0
116  *previous: 0xea1230
117  *next: (nil)
118  [size metadata: 140]
119 
120  Var-id: 0
121  id: 2
122  idx: 1
123  containerid: 2
124  hascontent: true
125  dptr: 24000000
126  fptr: 92000732
127  chunksize: 8000000
128  containersize: 8000000
129  [hash not printed]
130 
131  Var-id: 1
132  id: 3
133  idx: 2
134  containerid: 2
135  hascontent: true
136  dptr: 28000000
137  fptr: 100000732
138  chunksize: 8000000
139  containersize: 8000000
140  [hash not printed]
141 
142 
143 ------------------- DATASTRUCTURE END ---------------------

Checkpoint 7

size2 = 1000000;
size3 = 2000000;
FTI_Protect(2, arr2, size2, FTI_INTG);
FTI_Protect(3, arr3, size3, FTI_INTG);
FTI_Checkpoint(7,level);

Output:

1 ------------------- DATASTRUCTURE BEGIN -------------------
2 
3  DataBase-id: 0
4  numvars: 3
5  dbsize: 24000204
6  *dbvars: 0xe734b0
7  *previous: (nil)
8  *next: 0xea0f30
9  [size metadata: 204]
10 
11  Var-id: 0
12  id: 1
13  idx: 0
14  containerid: 0
15  hascontent: true
16  dptr: 0
17  fptr: 300
18  chunksize: 4000000
19  containersize: 4000000
20  [hash not printed]
21 
22  Var-id: 1
23  id: 2
24  idx: 1
25  containerid: 0
26  hascontent: true
27  dptr: 0
28  fptr: 4000300
29  chunksize: 4000000
30  containersize: 8000000
31  [hash not printed]
32 
33  Var-id: 2
34  id: 3
35  idx: 2
36  containerid: 0
37  hascontent: true
38  dptr: 0
39  fptr: 12000300
40  chunksize: 8000000
41  containersize: 12000000
42  [hash not printed]
43 
44  DataBase-id: 1
45  numvars: 1
46  dbsize: 16000076
47  *dbvars: 0xe9f470
48  *previous: 0xe77ea0
49  *next: 0xe78620
50  [size metadata: 76]
51 
52  Var-id: 0
53  id: 4
54  idx: 3
55  containerid: 0
56  hascontent: true
57  dptr: 0
58  fptr: 24000376
59  chunksize: 16000000
60  containersize: 16000000
61  [hash not printed]
62 
63  DataBase-id: 2
64  numvars: 2
65  dbsize: 32000140
66  *dbvars: 0xea1260
67  *previous: 0xea0f30
68  *next: 0xea1230
69  [size metadata: 140]
70 
71  Var-id: 0
72  id: 2
73  idx: 1
74  containerid: 1
75  hascontent: false
76  dptr: 8000000
77  fptr: 40000516
78  chunksize: 0
79  containersize: 16000000
80  [hash not printed]
81 
82  Var-id: 1
83  id: 3
84  idx: 2
85  containerid: 1
86  hascontent: false
87  dptr: 12000000
88  fptr: 56000516
89  chunksize: 0
90  containersize: 16000000
91  [hash not printed]
92 
93  DataBase-id: 3
94  numvars: 1
95  dbsize: 20000076
96  *dbvars: 0xea1170
97  *previous: 0xe78620
98  *next: 0xea0cf0
99  [size metadata: 76]
100 
101  Var-id: 0
102  id: 5
103  idx: 4
104  containerid: 0
105  hascontent: true
106  dptr: 0
107  fptr: 72000592
108  chunksize: 20000000
109  containersize: 20000000
110  [hash not printed]
111 
112  DataBase-id: 4
113  numvars: 2
114  dbsize: 16000140
115  *dbvars: 0xea0dc0
116  *previous: 0xea1230
117  *next: (nil)
118  [size metadata: 140]
119 
120  Var-id: 0
121  id: 2
122  idx: 1
123  containerid: 2
124  hascontent: false
125  dptr: 24000000
126  fptr: 92000732
127  chunksize: 0
128  containersize: 8000000
129  [hash not printed]
130 
131  Var-id: 1
132  id: 3
133  idx: 2
134  containerid: 2
135  hascontent: false
136  dptr: 28000000
137  fptr: 100000732
138  chunksize: 0
139  containersize: 8000000
140  [hash not printed]
141 
142 
143 ------------------- DATASTRUCTURE END ---------------------