ctdb/doc/recovery-process.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436

Valid as of 1.0.66, may/will change in the future


RECMASTER
=========
Recovery Master, this is one of the nodes in the cluster that has been designated to
be the "recovery master".
The recovery master is responsible for performing full checks of cluster and cluster node consistency and is also responsible for performing the actual database recovery procedure.

Only one node at a time can be the recovery master.
This is ensured by CTDB using a lock on a single file in the shared gpfs filesystem:
  /etc/sysconfig/ctdb :
  ...
  # Options to ctdbd. This is read by /etc/init.d/ctdb
  # you must specify the location of a shared lock file across all the
  # nodes. This must be on shared storage
  # there is no default here
  CTDB_RECOVERY_LOCK=/gpfs/.ctdb/shared
  ...

In order to prevent that two nodes become recovery master at the same time (==split brain)
CTDB here relies on GPFS that GPFS will guarantee coherent locking across the cluster.
Thus CTDB relies on that GPFS MUST only allow one ctdb process on one node to take out and
hold this lock.

The recovery master is designated through an election process. 


VNNMAP
======
The VNNMAP is a list of all nodes in the cluster that is currently part of the cluster
and participates in hosting the cluster databases.
All nodes that are CONNECTED but not BANNED be present in the VNNMAP.

The VNNMAP is the list of LMASTERS for the cluster as reported by 'ctdb status' "
  ...
  Size:3
  hash:0 lmaster:0
  hash:1 lmaster:1
  hash:2 lmaster:2
  ...


CLUSTER MONITORING
==================
All nodes in the cluster monitor its own health and its own consistency regards to the
recovery master. How and what the nodes monitor for differs between the node which is
the recovery master and normal nodes.
This monitoring it to ensure that the cluster is healthy and consistent. 
This is not related to monitoring of inidividual node health, a.k.a. eventscript monitroing.

At the end of each step in the process are listed some of the most common and important
error messages that can be generated during that step.


NORMAL NODE CLUSTER MONITORING
------------------------------
Monitoring is performed in the dedicated recovery daemon process.
The implementation can be found in server/ctdb_recoverd.c:monitor_cluster()
This is an overview of the more important tasks during monitoring.
These tests are to verify that the local node is consistent with the recovery master.

Once every second the following monitoring loop is performed :

1, Verify that the parent ctdb daemon on the local node is still running.
   If it is not, the recovery daemon logs an error and terminates.
   "CTDB daemon is no longer available. Shutting down recovery daemon"

2, Check if any of the nodes has been recorded to have misbehaved too many times.
   If so we ban the node and log a message :
   "Node %u has caused %u failures in %.0f seconds - banning it for %u seconds"

3, Check that there is a recovery master.
   If not we initiate a clusterwide election process and log :
   "Initial recovery master set - forcing election"
   and we restart monitoring from 1.

4, Verify that recovery daemon and the local ctdb daemon agreed on all the
   node BANNING flags.
   If the recovery daemon and the local ctdb daemon disagrees on these flags we update
   the local ctdb daemon, logs one of two messages and restarts monitoring from 1 again.
   "Local ctdb daemon on recmaster does not think this node is BANNED but the recovery master disagrees. Unbanning the node"
   "Local ctdb daemon on non-recmaster does not think this node is BANNED but the recovery master disagrees. Re-banning the node"

5, Verify that the node designated to be recovery master exists in the local list of all nodes.
   If the recovery master is not in the list of all cluster nodes a new recovery master
   election is triggered and monitoring restarts from 1.
   "Recmaster node %u not in list. Force reelection"

6, Check if the recovery master has become disconnected.
   If is has, log an error message, force a new election and restart monitoring from 1.
   "Recmaster node %u is disconnected. Force reelection"

7, Read the node flags off the recovery master and verify that it has not become banned.
   If is has, log an error message, force a new election and restart monitoring from 1.
   "Recmaster node %u no longer available. Force reelection"

8, Verify that the recmaster and the local node agrees on the flags (BANNED/DISABLED/...)
   for the local node. 
   If there is an inconsistency, push the flags for the local node out to all other nodes.
   "Recmaster disagrees on our flags flags:0x%x recmaster_flags:0x%x  Broadcasting out flags."

9, Verify that the local node hosts all public ip addresses it should host and that it does
   NOT host any public addresses it should not host.
   If there is an inconsistency we log an error, trigger a recovery to occur and restart
   monitoring from 1 again.
   "Public address '%s' is missing and we should serve this ip"
   "We are still serving a public address '%s' that we should not be serving."

These are all the checks we perform during monitoring for a normal node.
These tests are performed on all nodes in the cluster which is why it is optimized to perform
as few network calls to other nodes as possible.
Each node only performs 1 call to the recovery master in each loop and to no other nodes.

NORMAL NODE CLUSTER MONITORING
------------------------------
The recovery master performs a much more extensive test. In addition to tests 1-9 above
the recovery master also performs the following tests:

10, Read the list of nodes and flags from all other CONNECTED nodes in the cluster.
    If there is a failure to read this list from one of the nodes, then log an 
    error, mark this node as a candidate to become BANNED and restart monitoring from 1.
    "Unable to get nodemap from remote node %u"

11, Verify that the local recovery master and the remote node agrees on the flags
    for the remote node. If there is a inconsistency for the BANNING flag,
    log an error, trigger a new recmaster election and restart monitoring from 1.
    "Remote node %u had different BANNED flags 0x%x, local had 0x%x - trigger a re-election"
    "Remote node %u had flags 0x%x, local had 0x%x - updating local"

12, Verify that the local recovery master and the remote node agrees on the flags
    for the remote node. If one of the flags other than the BANNING flag was inconsistent,
    just update the set of flags for the local recovery daemon, log an information message
    and continue monitoring.
    "Remote node %u had flags 0x%x, local had 0x%x - updating local"

13, Read the list of public ip addresses from all of the CONNECTED nodes and merge into a
    single clusterwide list.
    If we fail to read the list of ips from a node, log an error and restart monitoring from 1.
    "Failed to read public ips from node : %u"

14, Verify that all other nodes agree that this node is the recovery master.
    If one of the other nodes discgrees this is the recovery master, log an error,
    force a new election and restart monitoring from 1.
    "Node %d does not agree we are the recmaster. Need a new recmaster election"

15, Check if the previous attempt to run a recovery failed, and if it did, try a new recovery.
    After the recovery, restart monitoring from 1.
    "Starting do_recovery"

16, Verify that all CONNECTED nodes in the cluster are in recovery mode NORMAL.
    If one of the nodes were in recovery mode ACTIVE, force a new recovery and restart
    monitoring from 1.
    "Node:%u was in recovery mode. Restart recovery process"

17, Verify that the filehandle to the recovery lock file is valid.
    If it is not, this may mean a split brain and is a critical error.
    Try a new recovery and restart monitoring from 1.
    "recovery master doesn't have the recovery lock"

18, Verify that GPFS allows us to read from the recovery lock file.
    If not there is a critical GPFS issue and we may have a split brain.
    Try forcing a new recovery and restart monitoring from 1.
    "failed read from recovery_lock_fd - %s"

19, Read the list of all nodes and flags from all CONNECTED nodes in the cluster.
    If fail to read the nodemap from one of the remote nodes, log an error and restart
    monitoring from 1.
    "Unable to get nodemap from remote node %u"

20, If the nodemap differs between the local node and the remote node, log an error
    and force a recovery.
    This would happen if the /etc/ctdb/nodes file differs across nodes in the cluster.
    It is unlikely that the recovery will rectify the situation.
    This is a critical error, it is most likely the entire cluster will be unavailable
    until the files are fixed or have became banned.
    "Remote node:%u has different node count. %u vs %u of the local node"

21, If a remote node disagrees on the content of the nodes list, try a recovery and restart
    monitoring from 1.
    It is unlikely that the recovery will rectify the situation.
    This is a critical error, it is most likely the entire cluster will be unavailable
    until the files are fixed or have became banned.
    "Remote node:%u has different nodemap pnn for %d (%u vs %u)."

22, If a remote node disgrees on the node flags in the list, try a recovery to re-sync
    the flags and restart monitoring from 1.
    "Remote node:%u has different nodemap flag for %d (0x%x vs 0x%x)"

23, Verify that all active nodes are part of the VNNMAP.
    If not, this would be a new node that has become CONNECTED but does not yet participate
    in the cluster.
    Perform a recovery to merge the new node to the cluster and restart monitoring from 1.
    "The vnnmap count is different from the number of active nodes. %u vs %u"
    or
    "Node %u is active in the nodemap but did not exist in the vnnmap"

24, Read the VNNMAP from all CONNECTED nodes.
    Verify that all nodes have the same VNNMAP content and that all nodes are in the same
    generation instance of the databases.
    If not, force a recovery to re-synchronize the vnnmap and the databases across the cluster
    and restart monitoring from 1.
    "Remote node %u has different generation of vnnmap. %u vs %u (ours)"
    "Remote node %u has different size of vnnmap. %u vs %u (ours)"
    "Remote node %u has different vnnmap."

25, If there has been changes to the cluster that requires a reallocation of public ip
    addresses. On all nodes run the "startrecovery" event. Run "releaseip" and "takeip"
    events to reassign the ips across the cluster and finally run the "recovered" event.

Finished monitoring, continue monitoring from 1.


CLUSTER RECOVERY
================
Recoveries are driven by the recovery daemon on the node that is currently the recovery
master.
Most of the logging that is performed during recovery is only logged on the node that
is the recovery master.
Make sure to find which node is the recovery master and check the log for that node.

Example log entries that start in column 1 are expected to be present in the
log. Example log entries that are indented 3 columns are optional and might
only be present if an error occured.


1, Log that recovery has been initiated.
"Starting do_recovery"

   It might log an informational message :
"New recovery culprit %u".
   This is only semi-accurate and might may not mean that there is any problem
   at all with the node indicated.


2, Check if a node has caused too many failed recoveries and if so ban it from
   the cluster, giving the other nodes in the cluster a chance to recovery
   operation.
   "Node %u has caused %u recoveries in %.0f seconds - banning it for %u seconds"


3, Verify that the recovery daemon can lock the recovery lock file.
   At this stage this should be recovery master. 
   If this operation fails it means we have a split brain and have to abort recovery.
   "("ctdb_recovery_lock: Unable to open %s - (%s)"
   "ctdb_recovery_lock: Failed to get recovery lock on '%s'"
   "Unable to get recovery lock - aborting recovery"
"ctdb_recovery_lock: Got recovery lock on '%s'"


4, Log which node caused the recovery to be initiated.
   This is a semi-accurate information message only.
   This line does NOT mean that there has to be something wrong with the node listed.
"Recovery initiated due to problem with node %u"


5, Pull the names of all databases from all nodes and verify that these databases also
   exists locally.
   If a database is missing locally, just create it.
   It is not an error if a database is missing locally. Databases are created on demand and
   this could happen if it was one database which samba has never tried to access on the
   local node.


6, Check the list of databases on each remote node and create any databases that may be missing
   on the remote node.
"Recovery - created remote databases"


7, Set recovery mode to ACTIVE on all remote nodes.


8, Run the "startrecovery" eventscript on all nodes.

   At this stage you will also get a few additional log entries, these are not
   from the recovery daemon but from the main ctdb daemon due to running
   the eventscript :
"startrecovery eventscript has been invoked"
"Monitoring has been disabled"
"Executing event script ...
...


9, Create a new generation id and update the generation id and the VNNMAP on the local node
   only.
   This guarantees that the generation id will now be inconsistent across the cluster and
   that if recovery fails a new recovery is attempted in the next iteration of the monitoring
   loop.


10, Start a TDB TRANSACTION on all nodes for all databases.
   This is to ensure that if recovery is aborted or fails that we do not
   modify any databases on only some of the nodes.
"started transactions on all nodes"


11, For each database, pull the content from all CONNECTED nodes and merge it into 
    the TDB databases on the local node.
    This merges the records from the remote nodes based on their serial numbers so we
    only keep the most recent record found.
"Recovery - pulled remote database 0x%x"


12, For each database, perform a fast TDB WIPE operation to delete the entire TDB under the
    transaction started above.


13, For each database, drop all empty records.
    Force the DMASTER field of all records to point to the recovery master.
    Push the database out to all other nodes.

    The PUSH process lists some additional log entries for each database of the
    form :
"Recovery - pulled remote database 0x..."
"Recovery - pushed remote database 0x... of size ..."


14, Commit all changes to all TDB databases.
"Recovery - starting database commits"
"Recovery - committed databases"


15, Create a new VNNMAP of all CONNECTED nodes, create a new generation number
    and piush this new VNNMAP out to all nodes.
"Recovery - updated vnnmap"


16, Update all nodes that the local node is the recovery master.
"Recovery - updated recmaster"


17, synchronize node flags across the cluster.
"Recovery - updated flags"

18, Change recovery mode back to NORMAL.
"Recovery - disabled recovery mode"


19, Re-allocate all public ip addresses across the cluster.
"Deterministic IPs enabled. Resetting all ip allocations"

    If the IP address allocation on the local node changes you might get
    "Takeover of IP 10.0.0.201/24 on interface eth0"
    "Release of IP 10.0.0.204/24 on interface eth0"

"Recovery - takeip finished"


20, Run the "recovered" eventscript on all nodes.
"Recovery - finished the recovered event"

    You will also get an entry from the local ctdb daemon itself that it has 
    switched back to recovery mode NORMAL.
"Recovery has finished"


21, Broadcast a message to all samba daemons in the cluster that the databases have been
    recovered. Samba will now do some additional checking/cleanup of the content in the stored
    records.

"Recovery complete"


22. Finished. At this stage a 10 second timeout (ctdb listvars : rerecoverytimeout) is 
    initiated. The cluster will not allow a new recovery to be performed until this timeout
    has expired.

"New recoveries supressed for the rerecovery timeout"
"Rerecovery timeout elapsed. Recovery reactivated."


Example : RECOVERY LOG ON RECMASTER
====================================
2008/12/01 09:57:28.110732 [ 4933]: 10.0.0.21:4379: node 10.0.0.24:4379 is dead: 2 connected
2008/12/01 09:57:28.110838 [ 4933]: Tearing down connection to dead node :3
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:2682 The vnnmap count is different from the number of active nodes. 4 vs 3
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1327 Starting do_recovery
2008/12/01 09:57:28.967297 [ 4935]: ctdb_recovery_lock: Got recovery lock on '/gpfs/.ctdb/shared'
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1355 Recovery initiated due to problem with node 0
2008/12/01 09:57:28.967297 [ 4935]: server/ctdb_recoverd.c:1381 Recovery - created remote databases
2008/12/01 09:57:28.973543 [ 4933]: server/ctdb_recover.c:589 Recovery mode set to ACTIVE
2008/12/01 09:57:28.974823 [ 4933]: server/ctdb_recover.c:904 startrecovery eventscript has been invoked
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1431 started transactions on all nodes
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x42fe72c5
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x42fe72c5 of size 0
2008/12/01 09:57:29.187264 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x1421fb78
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x1421fb78 of size 0
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xc0bdde6a
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xc0bdde6a of size 0
2008/12/01 09:57:29.197262 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x17055d90
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x17055d90 of size 8
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x7bbbd26c
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x7bbbd26c of size 1
2008/12/01 09:57:29.207261 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xf2a58948
2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xf2a58948 of size 51
2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x92380e87
2008/12/01 09:57:29.217259 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x92380e87 of size 17
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x63501287
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x63501287 of size 1
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xe98e08b6
2008/12/01 09:57:29.227258 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xe98e08b6 of size 4
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0x2672a57f
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0x2672a57f of size 28
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1268 Recovery - pulled remote database 0xb775fff6
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1230 Recovery - pushed remote database 0xb775fff6 of size 6
2008/12/01 09:57:29.237256 [ 4935]: server/ctdb_recoverd.c:1440 Recovery - starting database commits
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1452 Recovery - committed databases
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1502 Recovery - updated vnnmap
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1511 Recovery - updated recmaster
2008/12/01 09:57:29.297247 [ 4935]: server/ctdb_recoverd.c:1522 Recovery - updated flags
2008/12/01 09:57:29.305235 [ 4933]: server/ctdb_recover.c:589 Recovery mode set to NORMAL
2008/12/01 09:57:29.307245 [ 4935]: server/ctdb_recoverd.c:1531 Recovery - disabled recovery mode
2008/12/01 09:57:29.307245 [ 4935]: Deterministic IPs enabled. Resetting all ip allocations
2008/12/01 09:57:29.311071 [ 4933]: takeoverip called for an ip '10.0.0.201' that is not a public address
2008/12/01 09:57:29.311186 [ 4933]: takeoverip called for an ip '10.0.0.202' that is not a public address
2008/12/01 09:57:29.311204 [ 4933]: takeoverip called for an ip '10.0.0.203' that is not a public address
2008/12/01 09:57:29.311299 [ 4933]: takeoverip called for an ip '10.0.0.204' that is not a public address
2008/12/01 09:57:29.537210 [ 4935]: server/ctdb_recoverd.c:1542 Recovery - takeip finished
2008/12/01 09:57:29.545404 [ 4933]: Recovery has finished
2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1551 Recovery - finished the recovered event
2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1557 Recovery complete
2008/12/01 09:57:29.807169 [ 4935]: server/ctdb_recoverd.c:1565 New recoveries supressed for the rerecovery timeout
2008/12/01 09:57:39.815648 [ 4935]: server/ctdb_recoverd.c:1567 Rerecovery timeout elapsed. Recovery reactivated.