xfs-self-describing-metadata.txt - OpenGrok cross reference for /linux-4.4.14/Documentation/filesystems/xfs-self-describing-metadata.txt

Lines Matching refs:the
8 scalability, but of verification of the filesystem structure. Scalabilty of the
9 structures and indexes on disk and the algorithms for iterating them are
11 is this very scalability that causes the verification problem.
14 metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
15 other metadata structures need to be discovered by walking the filesystem
17 validating and repairing the structure, there are limits to what they can
18 verify, and this in turn limits the supportable size of an XFS filesystem.
21 scripting to analyse the structure of a 100TB filesystem when trying to
22 determine the root cause of a corruption problem, but it is still mainly a
24 weren't the ultimate cause of a corruption event. It may take a few hours to a
28 However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
30 Most of the analysis work is slow and tedious, so as the amount of analysis goes
31 up, the more likely that the cause will be lost in the noise.  Hence the primary
32 concern for supporting PB scale filesystems is minimising the time and effort
33 required for basic forensic analysis of the filesystem structure.
39 One of the problems with the current metadata format is that apart from the
40 magic number in the metadata block, we have no other way of identifying what it
41 is supposed to be. We can't even identify if it is the right place. Put simply,
43 supposed to be there and the contents are valid".
45 Hence most of the time spent on forensic analysis is spent doing basic
49 pointers in a btree end up with loops in them) are the key to understanding what
50 went wrong, but it is impossible to tell what order the blocks were linked into
51 each other or written to disk after the fact.
53 Hence we need to record more information into the metadata to allow us to
54 quickly determine if the metadata is intact and can be ignored for the purpose
56 ensure that common types of errors are easily detectable.  Hence the concept of
59 The first, fundamental requirement of self describing metadata is that the
61 location. This allows us to identify the expected contents of the block and
62 hence parse and verify the metadata object. IF we can't independently identify
63 the type of metadata in the object, then the metadata doesn't describe itself
66 Luckily, almost all XFS metadata has magic numbers embedded already - only the
68 magic numbers. Hence we can change the on-disk format of all these objects to
69 add more identifying information and detect this simply by changing the magic
70 numbers in the metadata objects. That is, if it has the current magic number,
71 the metadata isn't self identifying. If it contains a new magic number, it is
72 self identifying and we can do much more expansive automated verification of the
76 integrity checking. We cannot trust the metadata if we cannot verify that it has
78 integrity check, and this is done by adding CRC32c validation to the metadata
79 block. If we can verify the block contains the metadata it was intended to
80 contain, a large amount of the manual verification work can be skipped.
85 fast. So while CRC32c is not the strongest of possible integrity checks that
89 complexity and so there is no provision for changing the integrity checking
92 Self describing metadata needs to contain enough information so that the
93 metadata block can be verified as being in the correct place without needing to
95 Just adding a block number to the metadata is not sufficient to protect against
96 mis-directed writes - a write might be misdirected to the wrong LUN and so be
97 written to the "correct block" of the wrong filesystem. Hence location
100 Another key information point in forensic analysis is knowing who the metadata
101 block belongs to. We already know the type, the location, that it is valid
102 and/or corrupted, and how long ago that it was last modified. Knowing the owner
103 of the block is important as it allows us to find other related metadata to
104 determine the scope of the corruption. For example, if we have a extent btree
105 object, we don't know what inode it belongs to and hence have to walk the entire
106 filesystem to find the owner of the block. Worse, the corruption could mean that
108 in the metadata we have no idea of the scope of the corruption. If we have an
109 owner field in the metadata object, we can immediately do top down validation to
110 determine the scope of the problem.
114 freespace btree blocks are owned by an allocation group. Hence the size and
115 contents of the owner field are determined by the type of metadata object we are
117 freespace btree block written to the wrong AG).
120 written to the filesystem. One of the key information points when doing forensic
121 analysis is how recently the block was modified. Correlation of set of corrupted
123 whether the corruptions are related, whether there's been multiple corruption
124 events that lead to the eventual failure, and even whether there are corruptions
125 present that the run-time verification is not detecting.
129 when the free space btree block that contains the block was last written
130 compared to when the metadata object itself was last written.  If the free space
131 block is more recent than the object and the object's owner, then there is a
132 very good chance that the block should have been removed from the owner.
134 To provide this "written timestamp", each metadata block gets the Log Sequence
135 Number (LSN) of the most recent transaction it was modified on written into it.
136 This number will always increase over the life of the filesystem, and the only
137 thing that resets it is running xfs_repair on the filesystem. Further, by use of
138 the LSN we can tell if the corrupted metadata all belonged to the same log
140 the first and last instance of corrupt metadata on disk and, further, how much
141 modification occurred between the corruption being written and when it was
152 The verification is completely stateless - it is done independently of the
153 modification process, and seeks only to check that the metadata is what it says
154 it is and that the metadata fields are within bounds and internally consistent.
156 as there may be certain limitations that operational state enforces of the
158 sibling pointer lists). Hence we still need stateful checking in the main code
159 body, but in general most of the per-field validation is handled by the
162 For read verification, the caller needs to specify the expected type of metadata
163 that it should see, and the IO completion process verifies that the metadata
164 object matches what was expected. If the verification process fails, then it
165 marks the object being read as EFSCORRUPTED. The caller needs to catch this
167 verification error it can do so by catching the EFSCORRUPTED error value. If we
171 The first step in read verification is checking the magic number and determining
172 whether CRC validating is necessary. If it is, the CRC32c is calculated and
173 compared against the value stored in the object itself. Once this is validated,
174 further checks are made against the location information, followed by extensive
175 object specific metadata validation. If any of these checks fail, then the
176 buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
178 Write verification is the opposite of the read verification - first the object
179 is extensively verified and if it is OK we then update the LSN from the last
180 modification made to the object, After this, we calculate the CRC and insert it
181 into the object. Once this is done the write IO is allowed to continue. If any
182 error occurs during this process, the buffer is again marked with a EFSCORRUPTED
183 error for the higher layers to catch.
188 A typical on-disk structure needs to contain the following information:
199 Depending on the metadata, this information may be part of a header structure
200 separate to the metadata contents, or may be distributed through an existing
202 information, such as the superblock and AG headers.
204 Other metadata may have different formats for the information, but the same
208 	  number for location. The two of these combined provide the same
212 	- directory/attribute node blocks have a 16 bit magic number, and the
213 	  header that contains the magic number has other information in it as
214 	  well. hence the additional metadata headers change the overall format
215 	  of the metadata.
236 The code ensures that the CRC is only checked if the filesystem has CRCs enabled
237 by checking the superblock of the feature bit, and then if the CRC verifies OK
238 (or is not needed) it verifies the actual contents of the block.
241 whether the magic number can be used to determine the format of the block. In
242 the case it can't, the code is structured as follows:
268 If there are different magic numbers for the different formats, the verifier
293 Write verifiers are very similar to the read verifiers, they just do things in
294 the opposite order to the read verifiers. A typical write verifier:
320 This will verify the internal structure of the metadata before we go any
321 further, detecting corruptions that have occurred as the metadata has been
322 modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
323 update the LSN field (when it was last modified) and calculate the CRC on the
324 metadata. Once this is done, we can issue the IO.
331 buffer. Hence we do not use per-buffer verifiers to do the work of per-object
333 identification of the buffer - that they contain inodes or dquots, and that
334 there are magic numbers in all the expected spots. All further CRC and
335 verification checks are done when each inode is read from or written back to the
338 The structure of the verifiers and the identifiers checks is very similar to the
340 example, inode read verification is done in xfs_iread() when the inode is first
341 read out of the buffer and the struct xfs_inode is instantiated. The inode is
342 already extensively verified during writeback in xfs_iflush_int, so the only
343 addition here is to add the LSN and CRC to the inode as it is copied back into
344 the buffer.
346 XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
347 the unlinked list modifications check or update CRCs, neither during unlink nor