1The cluster MD is a shared-device RAID for a cluster.
2
3
41. On-disk format
5
6Separate write-intent-bitmap are used for each cluster node.
7The bitmaps record all writes that may have been started on that node,
8and may not yet have finished. The on-disk layout is:
9
100                    4k                     8k                    12k
11-------------------------------------------------------------------
12| idle                | md super            | bm super [0] + bits |
13| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
14| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
15| bm bits [3, contd]  |                     |                     |
16
17During "normal" functioning we assume the filesystem ensures that only one
18node writes to any given block at a time, so a write
19request will
20 - set the appropriate bit (if not already set)
21 - commit the write to all mirrors
22 - schedule the bit to be cleared after a timeout.
23
24Reads are just handled normally.  It is up to the filesystem to
25ensure one node doesn't read from a location where another node (or the same
26node) is writing.
27
28
292. DLM Locks for management
30
31There are two locks for managing the device:
32
332.1 Bitmap lock resource (bm_lockres)
34
35 The bm_lockres protects individual node bitmaps. They are named in the
36 form bitmap001 for node 1, bitmap002 for node and so on. When a node
37 joins the cluster, it acquires the lock in PW mode and it stays so
38 during the lifetime the node is part of the cluster. The lock resource
39 number is based on the slot number returned by the DLM subsystem. Since
40 DLM starts node count from one and bitmap slots start from zero, one is
41 subtracted from the DLM slot number to arrive at the bitmap slot number.
42
433. Communication
44
45Each node has to communicate with other nodes when starting or ending
46resync, and metadata superblock updates.
47
483.1 Message Types
49
50 There are 3 types, of messages which are passed
51
52 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
53   updated, and the node must re-read the md superblock. This is performed
54   synchronously.
55
56 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
57   so that each node may suspend or resume the region.
58
593.2 Communication mechanism
60
61 The DLM LVB is used to communicate within nodes of the cluster. There
62 are three resources used for the purpose:
63
64  3.2.1 Token: The resource which protects the entire communication
65   system. The node having the token resource is allowed to
66   communicate.
67
68  3.2.2 Message: The lock resource which carries the data to
69   communicate.
70
71  3.2.3 Ack: The resource, acquiring which means the message has been
72   acknowledged by all nodes in the cluster. The BAST of the resource
73   is used to inform the receive node that a node wants to communicate.
74
75The algorithm is:
76
77 1. receive status
78
79   sender                         receiver                   receiver
80   ACK:CR                          ACK:CR                     ACK:CR
81
82 2. sender get EX of TOKEN
83    sender get EX of MESSAGE
84    sender                        receiver                 receiver
85    TOKEN:EX                       ACK:CR                   ACK:CR
86    MESSAGE:EX
87    ACK:CR
88
89    Sender checks that it still needs to send a message. Messages received
90    or other events that happened while waiting for the TOKEN may have made
91    this message inappropriate or redundant.
92
93 3. sender write LVB.
94    sender down-convert MESSAGE from EX to CR
95    sender try to get EX of ACK
96    [ wait until all receiver has *processed* the MESSAGE ]
97
98                                     [ triggered by bast of ACK ]
99                                     receiver get CR of MESSAGE
100                                     receiver read LVB
101                                     receiver processes the message
102                                     [ wait finish ]
103                                     receiver release ACK
104
105   sender                         receiver                   receiver
106   TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
107   MESSAGE:CR
108   ACK:EX
109
110 4. triggered by grant of EX on ACK (indicating all receivers have processed
111    message)
112    sender down-convert ACK from EX to CR
113    sender release MESSAGE
114    sender release TOKEN
115                               receiver upconvert to EX of MESSAGE
116                               receiver get CR of ACK
117                               receiver release MESSAGE
118
119   sender                      receiver                   receiver
120   ACK:CR                       ACK:CR                     ACK:CR
121
122
1234. Handling Failures
124
1254.1 Node Failure
126 When a node fails, the DLM informs the cluster with the slot. The node
127 starts a cluster recovery thread. The cluster recovery thread:
128	- acquires the bitmap<number> lock of the failed node
129	- opens the bitmap
130	- reads the bitmap of the failed node
131	- copies the set bitmap to local node
132	- cleans the bitmap of the failed node
133	- releases bitmap<number> lock of the failed node
134	- initiates resync of the bitmap on the current node
135
136 The resync process, is the regular md resync. However, in a clustered
137 environment when a resync is performed, it needs to tell other nodes
138 of the areas which are suspended. Before a resync starts, the node
139 send out RESYNC_START with the (lo,hi) range of the area which needs
140 to be suspended. Each node maintains a suspend_list, which contains
141 the list  of ranges which are currently suspended. On receiving
142 RESYNC_START, the node adds the range to the suspend_list. Similarly,
143 when the node performing resync finishes, it send RESYNC_FINISHED
144 to other nodes and other nodes remove the corresponding entry from
145 the suspend_list.
146
147 A helper function, should_suspend() can be used to check if a particular
148 I/O range should be suspended or not.
149
1504.2 Device Failure
151 Device failures are handled and communicated with the metadata update
152 routine.
153
1545. Adding a new Device
155For adding a new device, it is necessary that all nodes "see" the new device
156to be added. For this, the following algorithm is used:
157
158    1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
159       ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
160    2. Node 1 sends NEWDISK with uuid and slot number
161    3. Other nodes issue kobject_uevent_env with uuid and slot number
162       (Steps 4,5 could be a udev rule)
163    4. In userspace, the node searches for the disk, perhaps
164       using blkid -t SUB_UUID=""
165    5. Other nodes issue either of the following depending on whether the disk
166       was found:
167       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
168                disc.number set to slot number)
169       ioctl(CLUSTERED_DISK_NACK)
170    6. Other nodes drop lock on no-new-devs (CR) if device is found
171    7. Node 1 attempts EX lock on no-new-devs
172    8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
173       as SpareLocal
174    9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
175    10. Other nodes get the information whether a disk is added or not
176	by the following METADATA_UPDATED.
177