xfs-delayed-logging-design.txt - OpenGrok cross reference for /linux-4.4.14/Documentation/filesystems/xfs-delayed-logging-design.txt

Lines Matching refs:log
11 logged. The reason for these differences is to reduce the amount of log space
18 modifications to a single object to be carried in the log at any given time.
19 This allows the log to avoid needing to flush each change to disk before
23 changes in the new transaction that is written to the log.
26 written to disk after change D, we would see in the log the following series
27 of transactions, their contents and the log sequence number (LSN) of the
40 the aggregation of all the previous changes currently held only in the log.
42 This relogging technique also allows objects to be moved forward in the log so
43 that an object being relogged does not prevent the tail of the log from ever
46 direct encoding of the location in the log of the transaction.
50 a special log reservation known as a permanent transaction reservation. A
55 removal operation. This keeps them moving forward in the log as the operation
57 log wraps around.
62 the log - repeated operations to the same objects write the same changes to
63 the log over and over again. Worse is the fact that objects tend to get
65 metadata into the log.
68 asynchronous. That is, they don't commit to disk until either a log buffer is
69 filled (a log buffer can hold multiple transactions) or a synchronous operation
70 forces the log buffers holding the transactions to disk. This means that XFS is
72 minimise the impact of the log IO on transaction throughput.
75 log buffers made available by the log manager. By default there are 8 log
80 that can be made to the filesystem at any point in time - if all the log
83 be to able to issue enough transactions to keep the log buffers full and under
92 multiple times before they are committed to disk in the log buffers. If we
94 transactions A through D are committed to disk in the same log buffer.
96 That is, a single log buffer may contain multiple copies of the same object,
99 necessary copy in the log buffer, and three stale copies that are simply
101 objects, these "stale objects" can be over 90% of the space used in the log
103 log would greatly reduce the amount of metadata we write to the log, and this
107 memory == log buffer), only it is doing it extremely inefficiently. It is using
110 formatting the changes in a transaction to the log buffer. Hence we cannot avoid
111 accumulating stale objects in the log buffers.
114 changes to objects in memory outside the log buffer infrastructure. Because of
118 them and get them to the log in a consistent, recoverable manner.
124 metadata changes from the size and number of log buffers available. In other
126 written to the log at any point in time, there may be a much greater amount
130 It should be noted that this does not change the guarantee that log recovery
140 log is used effectively in many filesystems including ext3 and ext4. Hence
147 	1. Reduce the amount of metadata written to the log by at least
152 	4. No on-disk format change (metadata or log format).
162 existing log item dirty region tracking) is that when it comes to writing the
163 changes to the log buffers, we need to ensure that the object we are formatting
165 concurrent modification. Hence flushing the logical changes to the log would
172 trying to get the lock on object A to flush it to the log buffer. This appears
178 vector array that points to the changed regions in the item. The log write code
179 simply copies the memory these vectors point to into the log buffer during
181 using the log buffer as the destination of the formatting code, we can use an
186 the changes in a format that is compatible with the log buffer writing code.
193 asynchronous transactions to the log. The differences between the existing
197 Current format log vector:
228 buffer is to support splitting vectors across log buffer boundaries correctly.
230 are in the item, so we'd need a new encapsulation method for regions in the log
232 change and as such is not desirable.  It also means we'd have to write the log
234 region state that needs to be placed into the headers during the log write.
238 self-describing object that can be passed to the log buffer write code to be
239 handled in exactly the same manner as the existing log vectors are handled.
248 them so that they can be written to the log at some later point in time.  The
249 log item is the natural place to store this vector and buffer, and also makes sense
253 The log item is already used to track the log items that have been written to
254 the log but not yet written to disk. Such log items are considered "active"
256 double linked list. Items are inserted into this list during log buffer IO
259 and then moved forward in the AIL when the log buffer IO completes for that
266 committed item tracking needs it's own locks, lists and state fields in the log
270 called the Committed Item List (CIL).  The list tracks log items that have been
282 When we have a log synchronisation event, commonly known as a "log force",
283 all the items in the CIL must be written into the log via the log buffers.
287 log replay - all the changes in all the objects in a given transaction must
288 either be completely replayed during log recovery, or not replayed at all. If
289 a transaction is not replayed because it is not complete in the log, then
292 To fulfill this requirement, we need to write the entire CIL in a single log
293 transaction. Fortunately, the XFS log code has no fixed limit on the size of a
294 transaction, nor does the log replay code. The only fundamental limit is that
295 the transaction cannot be larger than just under half the size of the log.  The
296 reason for this limit is that to find the head and tail of the log, there must
297 be at least one complete transaction in the log at any given time. If a
298 transaction is larger than half the log, then there is the possibility that a
300 only complete previous transaction in the log. This will result in a recovery
302 size of a checkpoint to be slightly less than a half the log.
306 formatted log items and a commit record at the tail. From a recovery
311 Because the checkpoint is just another transaction and all the changes to log
312 items are stored as log vectors, we can use the existing log buffer writing
313 code to write the changes into the log. To do this efficiently, we need to
315 transaction. The current log write code enables us to do this easily with the
316 way it separates the writing of the transaction contents (the log vectors) from
318 per-checkpoint context that travels through the log write process through to
330 are formatting the checkpoint into the log. It also allows concurrent
331 checkpoints to be written into the log buffers in the case of log force heavy
333 requires that we strictly order the commit records in the log so that
334 checkpoint sequence order is maintained during log replay.
337 the same time another transaction modifies the item and inserts the log item
338 into the new CIL, then checkpoint transaction commit code cannot use log items
339 to store the list of log vectors that need to be written into the transaction.
340 Hence log vectors need to be able to be chained together to allow them to be
341 detached from the log items. That is, when the CIL is flushed the memory
342 buffer and log vector attached to each log item needs to be attached to the
343 checkpoint context so that the log item can be released. In diagrammatic form,
349 	Log Item <-> log vector 1	-> memory buffer
352 	Log Item <-> log vector 2	-> memory buffer
358 	Log Item <-> log vector N-1	-> memory buffer
361 	Log Item <-> log vector N	-> memory buffer
364 And after the flush the CIL head is empty, and the checkpoint context log
370 	log vector 1	-> memory buffer
374 	log vector 2	-> memory buffer
381 	log vector N-1	-> memory buffer
385 	log vector N	-> memory buffer
390 start, while the checkpoint flush code works over the log vector chain to
393 Once the checkpoint is written into the log buffers, the checkpoint context is
394 attached to the log buffer that the commit record was written to along with a
396 run transaction committed processing for the log items (i.e. insert into AIL
397 and unpin) in the log vector chain and then free the log vector chain and
400 Discussion Point: I am uncertain as to whether the log item is the most
402 it. The fact that we walk the log items (in the CIL) just to chain the log
403 vectors and break the link between the log item and the log vector means that
404 we take a cache line hit for the log item list modification, then another for
405 the log vector chaining. If we track by the log vectors, then we only need to
406 break the link between the log item and the log vector, which means we should
407 dirty only the log item cachelines. Normally I wouldn't be concerned about one
408 vs two dirty cachelines except for the fact I've seen upwards of 80,000 log
416 committed transactions with the log sequence number of the transaction commit.
419 committed to the log. In the rare case that a dependent operation occurs (e.g.
420 re-using a freed metadata extent for a data extent), a special, optimised log
424 transaction. This LSN comes directly from the log buffer the transaction is
427 written directly into the log buffers. Hence some other method of sequencing
438 Then, instead of assigning a log buffer LSN to the transaction commit LSN
442 result, the code that forces the log to a specific LSN now needs to ensure that
443 the log forces to a specific checkpoint.
446 that are currently committing to the log. When we flush a checkpoint, the
450 we can also wait on the log buffer that contains the commit record, thereby
451 using the existing log force mechanisms to execute synchronous forces.
454 mitigation algorithms similar to the current log buffer code to allow
459 The main concern with log forces is to ensure that all the previous checkpoints
463 synchronisation in the log force code so that we don't need to wait anywhere
464 else for such serialisation - it only matters when we do a log force.
466 The only remaining complexity is that a log force now also has to handle the
469 simple addition to the existing log forcing code to check the sequence numbers
471 the log force code enables the current mechanism for issuing synchronous
473 force the log at the LSN of that transaction) and so the higher level code
478 The big issue for a checkpoint transaction is the log space reservation for the
480 ahead of time, nor how many log buffers it will take to write out, nor the
481 number of split log vector regions are going to be used. We can track the
482 amount of log space required as we add items to the commit item list, but we
483 still need to reserve the space in the log for the checkpoint.
485 A typical transaction reserves enough space in the log for the worst case space
486 usage of the transaction. The reservation accounts for log record headers,
491 of log vectors in the transaction).
495 there are lots of transactions that only contain an inode core and an inode log
502 space.  From this, it should be obvious that a static log space reservation is
511 log buffer metadata used such as log header records.
513 However, even using a static reservation for just the log metadata is
514 problematic. Typically log record headers use at least 16KB of log space per
515 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
521 A static reservation needs to manipulate the log grant counters - we can take a
528 checkpoints to be able to free up log space (refer back to the description of
530 space available in the log if we are to use static reservations, and that is
534 The simpler way of doing this is tracking the entire log space used by the
535 items in the CIL and using this to dynamically calculate the amount of log
536 space required by the log metadata. If this log metadata space changes as a
541 maximal amount of log metadata space they require, and such a delta reservation
545 are added to the CIL and avoid the need for reserving and regranting log space
550 log. Hence as part of the reservation growing, we need to also check the size
552 the maximum threshold, we need to push the CIL to the log. This is effectively
554 a CIL push triggered by a log force, only that there is no waiting for the
559 they will be flushed by the periodic log force issued by the xfssyncd. This log
561 allow the idle log to be covered (effectively marked clean) in exactly the same
563 whether this log force needs to be done more frequently than the current rate
569 Currently log items are pinned during transaction commit while the items are
572 that items get pinned once for every transaction that is committed to the log
573 buffers. Hence items that are relogged in the log buffers will have a pin count
577 pending transactions. Thus the pinning and unpinning of a log item is symmetric
578 as there is a 1:1 relationship with transaction commit and log item completion.
584 log item completion. The result of this is that pinning and unpinning of the
585 log items becomes unbalanced if we retain the "pin on transaction commit, unpin
626 the amount of space available in the log for their reservations. The practical
628 128MB log, which means that it is generally one per CPU in a machine.
631 relatively long period of time - the pinning of log items needs to be done
641 flushing the CIL could involve walking a list of tens of thousands of log
662 that is run as part of the checkpoint commit and log force sequencing. The code
663 path that triggers a CIL flush (i.e. whatever triggers the log force) will enter
664 an ordering loop after writing all the log vectors into the log buffers but
674 (obtained through completion of a commit record write) while log force
687 The existing log item life cycle is as follows:
694 			Allocate log item
695 			Attach log item to owner item
696 		Attach log item to transaction
698 		Record modifications in log item
701 		Format item into log buffer
704 		Attach transaction to log buffer
706 	<log buffer IO dispatched>
707 	<log buffer IO completes>
710 		Mark log item committed
711 		Insert log item into AIL
712 			Write commit LSN into log item
713 		Unpin log item
716 		Mark log item clean
722 		Moves log tail
728 at the same time. If the log item is in the AIL or between steps 6 and 7
739 			Allocate log item
740 			Attach log item to owner item
741 		Attach log item to transaction
743 		Record modifications in log item
746 		Format item into log vector + buffer
747 		Attach log vector and buffer to log item
748 		Insert log item into CIL
752 	<next log force>
756 		Chain log vectors and buffers together
759 		write log vectors into log
761 		attach checkpoint context to log buffer
763 	<log buffer IO dispatched>
764 	<log buffer IO completes>
767 		Mark log item committed
769 			Write commit LSN into log item
770 		Unpin log item
773 		Mark log item clean
777 		Moves log tail
783 committing of the log items to the log itself and the completion processing.
784 Hence delayed logging should not introduce any constraints on log item
790 mount option. Fundamentally, there is no reason why the log manager would not