xfs-delayed-logging-design.txt - OpenGrok cross reference for /linux-4.4.14/Documentation/filesystems/xfs-delayed-logging-design.txt

Lines Matching refs:to
4 Introduction to Re-logging in XFS
9 logged are made up of the changes to in-core structures rather than on-disk
11 logged. The reason for these differences is to reduce the amount of log space
18 modifications to a single object to be carried in the log at any given time.
19 This allows the log to avoid needing to flush each change to disk before
20 recording a new change to the object. XFS does this via a method called
22 new change to the object is recorded with a *new copy* of all the existing
23 changes in the new transaction that is written to the log.
25 That is, if we have a sequence of changes A through to F, and the object was
26 written to disk after change D, we would see in the log the following series
35 	    <object written to disk>
42 This relogging technique also allows objects to be moved forward in the log so
48 This relogging is also used to implement long-running, multiple-commit
59 Hence it can be seen that the relogging operation is fundamental to the correct
61 people should be able to see why the XFS metadata operations writes so much to
62 the log - repeated operations to the same objects write the same changes to
63 the log over and over again. Worse is the fact that objects tend to get
68 asynchronous. That is, they don't commit to disk until either a log buffer is
70 forces the log buffers holding the transactions to disk. This means that XFS is
71 doing aggregation of transactions in memory - batching them, if you like - to
77 to 256kB by use of a mount option.
80 that can be made to the filesystem at any point in time - if all the log
82 the current batch completes. It is now common for a single current CPU core to
83 be to able to issue enough transactions to keep the log buffers full and under
84 IO permanently. Hence the XFS journalling subsystem can be considered to be IO
90 The key thing to note about the asynchronous logging combined with the
92 multiple times before they are committed to disk in the log buffers. If we
93 return to the previous relogging example, it is entirely possible that
94 transactions A through D are committed to disk in the same log buffer.
97 but only one of those copies needs to be there - the last one "D", as it
102 buffers. It is clear that reducing the number of stale objects written to the
103 log would greatly reduce the amount of metadata we write to the log, and this
108 logical to physical formatting to do the relogging because there is no
109 infrastructure to keep track of logical changes in memory prior to physically
110 formatting the changes in a transaction to the log buffer. Hence we cannot avoid
113 Delayed logging is the name we've given to keeping and tracking transactional
114 changes to objects in memory outside the log buffer infrastructure. Because of
115 the relogging concept fundamental to the XFS journalling subsystem, this is
116 actually relatively easy to do - all the changes to logged items are already
117 tracked in the current infrastructure. The big problem is how to accumulate
118 them and get them to the log in a consistent, recoverable manner.
122 One of the key changes that delayed logging makes to the operation of the
126 written to the log at any point in time, there may be a much greater amount
135 need to ensure application level data integrity is maintained.
138 warrants rigorous proofs to determine whether it is correct or not. The method
139 of accumulating changes in memory for some period before writing them to the
141 no time is spent in this document trying to convince the reader that the
147 	1. Reduce the amount of metadata written to the log by at least
149 	2. Supply sufficient statistics to validate Requirement #1.
150 	3. Supply sufficient new tracing infrastructure to be able to debug
162 existing log item dirty region tracking) is that when it comes to writing the
163 changes to the log buffers, we need to ensure that the object we are formatting
164 is not changing while we do this. This requires locking the object to prevent
165 concurrent modification. Hence flushing the logical changes to the log would
166 require us to lock every object, format them, and then unlock them again.
170 the delayed logging tracking lock to commit the transaction. However, the
172 trying to get the lock on object A to flush it to the log buffer. This appears
173 to be an unsolvable deadlock condition, and it was solving this problem that
174 was the barrier to implementing delayed logging for so long.
176 The solution is relatively simple - it just took a long time to recognise it.
177 Put simply, the current logging code formats the changes to each item into an
178 vector array that points to the changed regions in the item. The log write code
179 simply copies the memory these vectors point to into the log buffer during
182 allocated memory buffer big enough to fit the formatted vector.
184 If we then copy the vector into the memory buffer and rewrite the vector to
185 point to the memory buffer rather than the object itself, we now have a copy of
187 that does not require us to lock the item to access. This formatting and
190 without needing to lock the owning item.
192 Hence we avoid the need to lock items when we need to flush outstanding
193 asynchronous transactions to the log. The differences between the existing
222 The memory buffer and associated vector need to be passed as a single object,
223 but still need to be associated with the parent object so if the object is
228 buffer is to support splitting vectors across log buffer boundaries correctly.
232 change and as such is not desirable.  It also means we'd have to write the log
234 region state that needs to be placed into the headers during the log write.
236 Hence we need to keep the vector, but by attaching the memory buffer to it and
237 rewriting the vector addresses to point at the memory buffer we end up with a
238 self-describing object that can be passed to the log buffer write code to be
240 Hence we avoid needing a new on-disk format to handle items that have been
247 them to be used without limitations, we need to be able to track and accumulate
248 them so that they can be written to the log at some later point in time.  The
249 log item is the natural place to store this vector and buffer, and also makes sense
250 to be the object that is used to track committed objects as it will always
253 The log item is already used to track the log items that have been written to
254 the log but not yet written to disk. Such log items are considered "active"
257 completion, after which they are unpinned and can be written to disk. An object
258 that is in the AIL can be relogged, which causes the object to be pinned again
263 and relogged, so any tracking must be separate to the AIL infrastructure. As
269 Similar to the AIL, tracking of committed items is done through a new list
271 committed and have formatted memory buffers attached to them. It tracks objects
274 and done to make it easy for debugging - the last items in the list are the
284 We need to write these items in the order that they exist in the CIL, and they
285 need to be written as an atomic transaction. The need for all the objects to be
292 To fulfill this requirement, we need to write the entire CIL in a single log
296 reason for this limit is that to find the head and tail of the log, there must
302 size of a checkpoint to be slightly less than a half the log.
305 to any other transaction - it contains a transaction header, a series of
309 might need to tune the recovery transaction object hash size.
311 Because the checkpoint is just another transaction and all the changes to log
313 code to write the changes into the log. To do this efficiently, we need to
315 transaction. The current log write code enables us to do this easily with the
317 the transaction commit record, but tracking this requires us to have a
318 per-checkpoint context that travels through the log write process through to
322 checkpoint from initiation to checkpoint completion. A new context is initiated
326 context and attach that to the CIL for aggregation of new transactions.
328 This allows us to unlock the CIL immediately after transfer of all the
329 committed items and effectively allow new transactions to be issued while we
331 checkpoints to be written into the log buffers in the case of log force heavy
339 to store the list of log vectors that need to be written into the transaction.
340 Hence log vectors need to be able to be chained together to allow them to be
342 buffer and log vector attached to each log item needs to be attached to the
390 start, while the checkpoint flush code works over the log vector chain to
394 attached to the log buffer that the commit record was written to along with a
400 Discussion Point: I am uncertain as to whether the log item is the most
401 efficient way to track vectors, even though it seems like the natural way to do
402 it. The fact that we walk the log items (in the CIL) just to chain the log
405 the log vector chaining. If we track by the log vectors, then we only need to
417 This allows transactions to be issued asynchronously even though there may be
419 committed to the log. In the rare case that a dependent operation occurs (e.g.
421 force can be issued to force the dependent transaction to disk immediately.
423 To do this, transactions need to record the LSN of the commit record of the
431 contexts, and as such it is simple to assign a sequence number to each
433 atomically, it is simple to ensure that each new context has a monotonically
434 increasing sequence number assigned to it without the need for an external
436 one to it for the new context.
438 Then, instead of assigning a log buffer LSN to the transaction commit LSN
441 checkpoint sequence needs to be committed before they can continue. As a
442 result, the code that forces the log to a specific LSN now needs to ensure that
443 the log forces to a specific checkpoint.
445 To ensure that we can do this, we need to track all the checkpoint contexts
446 that are currently committing to the log. When we flush a checkpoint, the
447 context gets added to a "committing" list which can be searched. When a
451 using the existing log force mechanisms to execute synchronous forces.
453 It should be noted that the synchronous forces may need to be extended with
454 mitigation algorithms similar to the current log buffer code to allow
459 The main concern with log forces is to ensure that all the previous checkpoints
460 are also committed to disk before the one we need to wait for. Therefore we
461 need to check that all the prior contexts in the committing list are also
462 complete before waiting on the one we need to complete. We do this
463 synchronisation in the log force code so that we don't need to wait anywhere
466 The only remaining complexity is that a log force now also has to handle the
468 is, we need to flush the CIL and potentially wait for it to complete. This is a
469 simple addition to the existing log forcing code to check the sequence numbers
472 transactions to remain untouched (i.e. commit an asynchronous transaction, then
479 transaction. We don't know how big a checkpoint transaction is going to be
480 ahead of time, nor how many log buffers it will take to write out, nor the
481 number of split log vector regions are going to be used. We can track the
482 amount of log space required as we add items to the commit item list, but we
483 still need to reserve the space in the log for the checkpoint.
497 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
498 vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
503 not particularly flexible and is difficult to select the "optimal value" for
506 Further, if we are going to use a static reservation, which bit of the entire
510 relogged. This allows for a checkpoint reservation to only have to account for
515 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
516 large enough to handle arbitrary sized checkpoint transactions. This
517 reservation needs to be made before the checkpoint is started, and we need to
518 be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
521 A static reservation needs to manipulate the log grant counters - we can take a
522 permanent reservation on the space, but we still need to make sure we refresh
523 the write reservation (the actual space available to the transaction) after
527 The problem with this is that it can lead to deadlocks as we may need to commit
528 checkpoints to be able to free up log space (refer back to the description of
530 space available in the log if we are to use static reservations, and that is
531 very difficult and complex to arrange. It is possible to do, but there is a
535 items in the CIL and using this to dynamically calculate the amount of log
542 will always be less than or equal to the maximal amount in the reservation.
545 are added to the CIL and avoid the need for reserving and regranting log space
549 As mentioned early, transactions can't grow to more than half the size of the
550 log. Hence as part of the reservation growing, we need to also check the size
552 the maximum threshold, we need to push the CIL to the log. This is effectively
553 a "background flush" and is done on demand. This is identical to
555 checkpoint commit to complete. This background push is checked and executed by
560 force will push the CIL to disk, and if the transaction subsystem stays idle,
561 allow the idle log to be covered (effectively marked clean) in exactly the same
563 whether this log force needs to be done more frequently than the current rate
572 that items get pinned once for every transaction that is committed to the log
580 For delayed logging, however, we have an asymmetric transaction commit to
583 That is, we now have a many-to-one relationship between transaction commit and
588 To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
590 pinning and unpinning becomes symmetric around a checkpoint context. We have to
595 value according to it's context.
597 Just to make matters more slightly more complex, this checkpoint level context
605 lock to guarantee that we pin the items correctly.
610 commits must scale to many concurrent commits. The current transaction commit
615 As a result, the delayed logging transaction commit code needs to be designed
620 	2. Adding items to the CIL and updating item space accounting
624 that we have a many-to-one interaction here. That is, the only restriction on
625 the number of concurrent transactions that can be trying to commit at once is
630 The amount of time a transaction commit needs to hold out a flush is a
631 relatively long period of time - the pinning of log items needs to be done
635 separately to the pinning of objects could be used to reduce the hold time of
639 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
648 compared to transaction commit for asynchronous transaction workloads - only
650 transaction commit concurrency due to cache line bouncing of the lock on the
655 concurrently, the CIL needs to be protected separately from the above
656 commit/flush exclusion. It also needs to be an exclusive lock but it is only
666 checkpoints and needs to block waiting for checkpoints to complete their commit
668 sequencing also requires the same lock, list walk, and blocking mechanism to
673 sequencing needs to wait until checkpoint contexts contain a commit LSN
675 sequencing needs to wait until previous checkpoint contexts are removed from
677 broadcast wakeups (thundering herds) has been used to implement these two
681 given separate wait lists to reduce lock contention and the number of processes
692 	4. Join item to transaction
695 			Attach log item to owner item
696 		Attach log item to transaction
704 		Attach transaction to log buffer
717 		Flush item to disk
737 	4. Join item to transaction
740 			Attach log item to owner item
741 		Attach log item to transaction
747 		Attach log vector and buffer to log item
761 		attach checkpoint context to log buffer
774 		Flush item to disk
783 committing of the log items to the log itself and the completion processing.
788 and the design of the internal structures to avoid on disk format changes, we
791 be able to swap methods automatically and transparently depending on load