1BTT - Block Translation Table 2============================= 3 4 51. Introduction 6--------------- 7 8Persistent memory based storage is able to perform IO at byte (or more 9accurately, cache line) granularity. However, we often want to expose such 10storage as traditional block devices. The block drivers for persistent memory 11will do exactly this. However, they do not provide any atomicity guarantees. 12Traditional SSDs typically provide protection against torn sectors in hardware, 13using stored energy in capacitors to complete in-flight block writes, or perhaps 14in firmware. We don't have this luxury with persistent memory - if a write is in 15progress, and we experience a power failure, the block will contain a mix of old 16and new data. Applications may not be prepared to handle such a scenario. 17 18The Block Translation Table (BTT) provides atomic sector update semantics for 19persistent memory devices, so that applications that rely on sector writes not 20being torn can continue to do so. The BTT manifests itself as a stacked block 21device, and reserves a portion of the underlying storage for its metadata. At 22the heart of it, is an indirection table that re-maps all the blocks on the 23volume. It can be thought of as an extremely simple file system that only 24provides atomic sector updates. 25 26 272. Static Layout 28---------------- 29 30The underlying storage on which a BTT can be laid out is not limited in any way. 31The BTT, however, splits the available space into chunks of up to 512 GiB, 32called "Arenas". 33 34Each arena follows the same layout for its metadata, and all references in an 35arena are internal to it (with the exception of one field that points to the 36next arena). The following depicts the "On-disk" metadata layout: 37 38 39 Backing Store +-------> Arena 40+---------------+ | +------------------+ 41| | | | Arena info block | 42| Arena 0 +---+ | 4K | 43| 512G | +------------------+ 44| | | | 45+---------------+ | | 46| | | | 47| Arena 1 | | Data Blocks | 48| 512G | | | 49| | | | 50+---------------+ | | 51| . | | | 52| . | | | 53| . | | | 54| | | | 55| | | | 56+---------------+ +------------------+ 57 | | 58 | BTT Map | 59 | | 60 | | 61 +------------------+ 62 | | 63 | BTT Flog | 64 | | 65 +------------------+ 66 | Info block copy | 67 | 4K | 68 +------------------+ 69 70 713. Theory of Operation 72---------------------- 73 74 75a. The BTT Map 76-------------- 77 78The map is a simple lookup/indirection table that maps an LBA to an internal 79block. Each map entry is 32 bits. The two most significant bits are special 80flags, and the remaining form the internal block number. 81 82Bit Description 8331 - 30 : Error and Zero flags - Used in the following way: 84 Bit Description 85 31 30 86 ----------------------------------------------------------------------- 87 00 Initial state. Reads return zeroes; Premap = Postmap 88 01 Zero state: Reads return zeroes 89 10 Error state: Reads fail; Writes clear 'E' bit 90 11 Normal Block ��� has valid postmap 91 92 9329 - 0 : Mappings to internal 'postmap' blocks 94 95 96Some of the terminology that will be subsequently used: 97 98External LBA : LBA as made visible to upper layers. 99ABA : Arena Block Address - Block offset/number within an arena 100Premap ABA : The block offset into an arena, which was decided upon by range 101 checking the External LBA 102Postmap ABA : The block number in the "Data Blocks" area obtained after 103 indirection from the map 104nfree : The number of free blocks that are maintained at any given time. 105 This is the number of concurrent writes that can happen to the 106 arena. 107 108 109For example, after adding a BTT, we surface a disk of 1024G. We get a read for 110the external LBA at 768G. This falls into the second arena, and of the 512G 111worth of blocks that this arena contributes, this block is at 256G. Thus, the 112premap ABA is 256G. We now refer to the map, and find out the mapping for block 113'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. 114 115 116b. The BTT Flog 117--------------- 118 119The BTT provides sector atomicity by making every write an "allocating write", 120i.e. Every write goes to a "free" block. A running list of free blocks is 121maintained in the form of the BTT flog. 'Flog' is a combination of the words 122"free list" and "log". The flog contains 'nfree' entries, and an entry contains: 123 124lba : The premap ABA that is being written to 125old_map : The old postmap ABA - after 'this' write completes, this will be a 126 free block. 127new_map : The new postmap ABA. The map will up updated to reflect this 128 lba->postmap_aba mapping, but we log it here in case we have to 129 recover. 130seq : Sequence number to mark which of the 2 sections of this flog entry is 131 valid/newest. It cycles between 01->10->11->01 (binary) under normal 132 operation, with 00 indicating an uninitialized state. 133lba' : alternate lba entry 134old_map': alternate old postmap entry 135new_map': alternate new postmap entry 136seq' : alternate sequence number. 137 138Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also 139padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are 140done such that for any entry being written, it: 141a. overwrites the 'old' section in the entry based on sequence numbers 142b. writes the 'new' section such that the sequence number is written last. 143 144 145c. The concept of lanes 146----------------------- 147 148While 'nfree' describes the number of concurrent IOs an arena can process 149concurrently, 'nlanes' is the number of IOs the BTT device as a whole can 150process. 151 nlanes = min(nfree, num_cpus) 152A lane number is obtained at the start of any IO, and is used for indexing into 153all the on-disk and in-memory data structures for the duration of the IO. If 154there are more CPUs than the max number of available lanes, than lanes are 155protected by spinlocks. 156 157 158d. In-memory data structure: Read Tracking Table (RTT) 159------------------------------------------------------ 160 161Consider a case where we have two threads, one doing reads and the other, 162writes. We can hit a condition where the writer thread grabs a free block to do 163a new IO, but the (slow) reader thread is still reading from it. In other words, 164the reader consulted a map entry, and started reading the corresponding block. A 165writer started writing to the same external LBA, and finished the write updating 166the map for that external LBA to point to its new postmap ABA. At this point the 167internal, postmap block that the reader is (still) reading has been inserted 168into the list of free blocks. If another write comes in for the same LBA, it can 169grab this free block, and start writing to it, causing the reader to read 170incorrect data. To prevent this, we introduce the RTT. 171 172The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts 173into rtt[lane_number], the postmap ABA it is reading, and clears it after the 174read is complete. Every writer thread, after grabbing a free block, checks the 175RTT for its presence. If the postmap free block is in the RTT, it waits till the 176reader clears the RTT entry, and only then starts writing to it. 177 178 179e. In-memory data structure: map locks 180-------------------------------------- 181 182Consider a case where two writer threads are writing to the same LBA. There can 183be a race in the following sequence of steps: 184 185free[lane] = map[premap_aba] 186map[premap_aba] = postmap_aba 187 188Both threads can update their respective free[lane] with the same old, freed 189postmap_aba. This has made the layout inconsistent by losing a free entry, and 190at the same time, duplicating another free entry for two lanes. 191 192To solve this, we could have a single map lock (per arena) that has to be taken 193before performing the above sequence, but we feel that could be too contentious. 194Instead we use an array of (nfree) map_locks that is indexed by 195(premap_aba modulo nfree). 196 197 198f. Reconstruction from the Flog 199------------------------------- 200 201On startup, we analyze the BTT flog to create our list of free blocks. We walk 202through all the entries, and for each lane, of the set of two possible 203'sections', we always look at the most recent one only (based on the sequence 204number). The reconstruction rules/steps are simple: 205- Read map[log_entry.lba]. 206- If log_entry.new matches the map entry, then log_entry.old is free. 207- If log_entry.new does not match the map entry, then log_entry.new is free. 208 (This case can only be caused by power-fails/unsafe shutdowns) 209 210 211g. Summarizing - Read and Write flows 212------------------------------------- 213 214Read: 215 2161. Convert external LBA to arena number + pre-map ABA 2172. Get a lane (and take lane_lock) 2183. Read map to get the entry for this pre-map ABA 2194. Enter post-map ABA into RTT[lane] 2205. If TRIM flag set in map, return zeroes, and end IO (go to step 8) 2216. If ERROR flag set in map, end IO with EIO (go to step 8) 2227. Read data from this block 2238. Remove post-map ABA entry from RTT[lane] 2249. Release lane (and lane_lock) 225 226Write: 227 2281. Convert external LBA to Arena number + pre-map ABA 2292. Get a lane (and take lane_lock) 2303. Use lane to index into in-memory free list and obtain a new block, next flog 231 index, next sequence number 2324. Scan the RTT to check if free block is present, and spin/wait if it is. 2335. Write data to this free block 2346. Read map to get the existing post-map ABA entry for this pre-map ABA 2357. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] 2368. Write new post-map ABA into map. 2379. Write old post-map entry into the free list 23810. Calculate next sequence number and write into the free list entry 23911. Release lane (and lane_lock) 240 241 2424. Error Handling 243================= 244 245An arena would be in an error state if any of the metadata is corrupted 246irrecoverably, either due to a bug or a media error. The following conditions 247indicate an error: 248- Info block checksum does not match (and recovering from the copy also fails) 249- All internal available blocks are not uniquely and entirely addressed by the 250 sum of mapped blocks and free blocks (from the BTT flog). 251- Rebuilding free list from the flog reveals missing/duplicate/impossible 252 entries 253- A map entry is out of bounds 254 255If any of these error conditions are encountered, the arena is put into a read 256only state using a flag in the info block. 257 258 2595. In-kernel usage 260================== 261 262Any block driver that supports byte granularity IO to the storage may register 263with the BTT. It will have to provide the rw_bytes interface in its 264block_device_operations struct: 265 266 int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw); 267 268It may register with the BTT after it adds its own gendisk, using btt_init: 269 270 struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize, 271 u32 lbasize, u8 uuid[], int maxlane); 272 273note that maxlane is the maximum amount of concurrency the driver wishes to 274allow the BTT to use. 275 276The BTT 'disk' appears as a stacked block device that grabs the underlying block 277device in the O_EXCL mode. 278 279When the driver wishes to remove the backing disk, it should similarly call 280btt_fini using the same struct btt* handle that was provided to it by btt_init. 281 282 void btt_fini(struct btt *btt); 283 284