1The cluster MD is a shared-device RAID for a cluster. 2 3 41. On-disk format 5 6Separate write-intent-bitmap are used for each cluster node. 7The bitmaps record all writes that may have been started on that node, 8and may not yet have finished. The on-disk layout is: 9 100 4k 8k 12k 11------------------------------------------------------------------- 12| idle | md super | bm super [0] + bits | 13| bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | 14| bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | 15| bm bits [3, contd] | | | 16 17During "normal" functioning we assume the filesystem ensures that only one 18node writes to any given block at a time, so a write 19request will 20 - set the appropriate bit (if not already set) 21 - commit the write to all mirrors 22 - schedule the bit to be cleared after a timeout. 23 24Reads are just handled normally. It is up to the filesystem to 25ensure one node doesn't read from a location where another node (or the same 26node) is writing. 27 28 292. DLM Locks for management 30 31There are two locks for managing the device: 32 332.1 Bitmap lock resource (bm_lockres) 34 35 The bm_lockres protects individual node bitmaps. They are named in the 36 form bitmap001 for node 1, bitmap002 for node and so on. When a node 37 joins the cluster, it acquires the lock in PW mode and it stays so 38 during the lifetime the node is part of the cluster. The lock resource 39 number is based on the slot number returned by the DLM subsystem. Since 40 DLM starts node count from one and bitmap slots start from zero, one is 41 subtracted from the DLM slot number to arrive at the bitmap slot number. 42 433. Communication 44 45Each node has to communicate with other nodes when starting or ending 46resync, and metadata superblock updates. 47 483.1 Message Types 49 50 There are 3 types, of messages which are passed 51 52 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been 53 updated, and the node must re-read the md superblock. This is performed 54 synchronously. 55 56 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended 57 so that each node may suspend or resume the region. 58 593.2 Communication mechanism 60 61 The DLM LVB is used to communicate within nodes of the cluster. There 62 are three resources used for the purpose: 63 64 3.2.1 Token: The resource which protects the entire communication 65 system. The node having the token resource is allowed to 66 communicate. 67 68 3.2.2 Message: The lock resource which carries the data to 69 communicate. 70 71 3.2.3 Ack: The resource, acquiring which means the message has been 72 acknowledged by all nodes in the cluster. The BAST of the resource 73 is used to inform the receive node that a node wants to communicate. 74 75The algorithm is: 76 77 1. receive status 78 79 sender receiver receiver 80 ACK:CR ACK:CR ACK:CR 81 82 2. sender get EX of TOKEN 83 sender get EX of MESSAGE 84 sender receiver receiver 85 TOKEN:EX ACK:CR ACK:CR 86 MESSAGE:EX 87 ACK:CR 88 89 Sender checks that it still needs to send a message. Messages received 90 or other events that happened while waiting for the TOKEN may have made 91 this message inappropriate or redundant. 92 93 3. sender write LVB. 94 sender down-convert MESSAGE from EX to CR 95 sender try to get EX of ACK 96 [ wait until all receiver has *processed* the MESSAGE ] 97 98 [ triggered by bast of ACK ] 99 receiver get CR of MESSAGE 100 receiver read LVB 101 receiver processes the message 102 [ wait finish ] 103 receiver release ACK 104 105 sender receiver receiver 106 TOKEN:EX MESSAGE:CR MESSAGE:CR 107 MESSAGE:CR 108 ACK:EX 109 110 4. triggered by grant of EX on ACK (indicating all receivers have processed 111 message) 112 sender down-convert ACK from EX to CR 113 sender release MESSAGE 114 sender release TOKEN 115 receiver upconvert to EX of MESSAGE 116 receiver get CR of ACK 117 receiver release MESSAGE 118 119 sender receiver receiver 120 ACK:CR ACK:CR ACK:CR 121 122 1234. Handling Failures 124 1254.1 Node Failure 126 When a node fails, the DLM informs the cluster with the slot. The node 127 starts a cluster recovery thread. The cluster recovery thread: 128 - acquires the bitmap<number> lock of the failed node 129 - opens the bitmap 130 - reads the bitmap of the failed node 131 - copies the set bitmap to local node 132 - cleans the bitmap of the failed node 133 - releases bitmap<number> lock of the failed node 134 - initiates resync of the bitmap on the current node 135 136 The resync process, is the regular md resync. However, in a clustered 137 environment when a resync is performed, it needs to tell other nodes 138 of the areas which are suspended. Before a resync starts, the node 139 send out RESYNC_START with the (lo,hi) range of the area which needs 140 to be suspended. Each node maintains a suspend_list, which contains 141 the list of ranges which are currently suspended. On receiving 142 RESYNC_START, the node adds the range to the suspend_list. Similarly, 143 when the node performing resync finishes, it send RESYNC_FINISHED 144 to other nodes and other nodes remove the corresponding entry from 145 the suspend_list. 146 147 A helper function, should_suspend() can be used to check if a particular 148 I/O range should be suspended or not. 149 1504.2 Device Failure 151 Device failures are handled and communicated with the metadata update 152 routine. 153 1545. Adding a new Device 155For adding a new device, it is necessary that all nodes "see" the new device 156to be added. For this, the following algorithm is used: 157 158 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues 159 ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 160 2. Node 1 sends NEWDISK with uuid and slot number 161 3. Other nodes issue kobject_uevent_env with uuid and slot number 162 (Steps 4,5 could be a udev rule) 163 4. In userspace, the node searches for the disk, perhaps 164 using blkid -t SUB_UUID="" 165 5. Other nodes issue either of the following depending on whether the disk 166 was found: 167 ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and 168 disc.number set to slot number) 169 ioctl(CLUSTERED_DISK_NACK) 170 6. Other nodes drop lock on no-new-devs (CR) if device is found 171 7. Node 1 attempts EX lock on no-new-devs 172 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk 173 as SpareLocal 174 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 175 10. Other nodes get the information whether a disk is added or not 176 by the following METADATA_UPDATED. 177