Lines Matching refs:of

14 These are some notes describing some aspects of the 2.5 block layer in the
15 context of the bio rewrite. The idea is to bring out some of the key
16 changes and a glimpse of the rationale behind those changes.
26 Many aspects of the generic block layer redesign were driven by and evolved
27 over discussions, prior patches and the collective experience of several
28 people. See sections 8 and 9 for a list of some related references.
42 Description of Contents:
45 1. Scope for tuning of logic to various needs
56 (instead of using buffer heads at the i/o layer)
72 5.1 Granular locking: Removal of io_request_lock
76 7. A few tips on migration of older drivers
77 8. A list of prior/related/impacted patches/ideas
85 Let us discuss the changes in the context of how some overall goals for the
92 depending on the nature of the device and the requirements of the caller.
93 One of the objectives of the rewrite was to increase the degree of tunability
96 important especially in the light of ever improving hardware capabilities
97 and application/middleware software designed to take advantage of these
103 optimizations, high memory DMA support, etc may find some of the
106 Knowledge of some of the capabilities or parameters of the device should be
108 behalf of the driver.
117 a per-queue level (e.g maximum request size, maximum number of segments in
121 major/minor are now directly associated with the queue. Some of these may
134 Sets two variables that limit the size of the request.
137 units of 512 byte sectors, and could be dynamically varied
142 in units of 512 byte sectors.
145 255. The upper limit of max_sectors is 1024.
157 Maximum size of a clustered segment, 64kB default.
177 (blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/out
178 where a device is capable of handling high memory i/o.
180 In order to enable high-memory i/o where the device is capable of supporting
183 a virtual address mapping (unlike the earlier scheme of virtual address
189 on PCI high mem DMA aspects and mapping of scatter gather lists, and support
196 the type of the operation. For example, in case of a read operation, the
203 as it may not be in irq context. Special care is also required (by way of
216 which case a virtual mapping of the page is required. For SCSI it is also
228 certain portions of it. The 2.5 rewrite provides improved modularization
229 of the i/o scheduler. There are more pluggable callbacks, e.g for init,
231 i/o scheduling algorithm aspects and details outside of the generic loop.
232 It also makes it possible to completely hide the implementation details of
235 I/O scheduler wrappers are to be used instead of accessing the queue directly.
242 This comes from some of the high-performance database/middleware
244 decisions based on an understanding of the access patterns and i/o
254 What kind of support exists at the generic block layer for this ?
266 control (high/med/low) over the priority of an i/o request vs other pending
271 tunability. Time based aging avoids starvation of lower priority
281 to the device bypassing some of the intermediate i/o layers.
284 capabilities for certain kinds of fitness tests. Having direct interfaces at
286 it possible to perform bottom up validation of the i/o path, layer by
307 specify the virtual address of the buffer, if the driver expects buffer
313 (See 2.3 or Documentation/block/request.txt for a brief explanation of
327 completion of partial transfers. The driver has to modify these fields
339 command pre-building, and the type of the request is now indicated
340 through rq->flags instead of via rq->cmd)
342 The request structure flags can be set up to indicate the type of request
369 Prior to 2.5, buffer heads were used as the unit of i/o at the generic block
370 layer, and the low level request structure was associated with a chain of
375 when the underlying device was capable of handling the i/o in one shot.
377 from the buffer cache unnecessarily added to the weight of the descriptors
380 The following were some of the goals and expectations considered in the
381 redesign of the block i/o data structure in 2.5.
391 iv. At the same time, ability to retain independent identity of i/os from
396 without unnecessarily breaking it up, if the underlying device is capable of
399 passed around different types of subsystems or layers, maybe even
400 networking, without duplication or extra copies of data/descriptor fields
402 vii.Ability to handle the possibility of splits/merges as the structure passes
406 instead of using the buffer head structure (bh) directly, the idea being
407 avoidance of some associated baggage and limitations. The bio structure
408 is uniformly used for all i/o at the block layer ; it forms a part of the
409 bh structure for buffered i/o, and in the case of raw/direct i/o kiobufs are
414 The bio structure uses a vector representation pointing to an array of tuples
415 of <page, offset, len> to describe the i/o buffer, and has various other
429 * main unit of I/O for the block layer and lower layers (ie drivers)
454 of an array of <page, offset, len> fragments (similar to the way fragments
456 - Splitting of an i/o request across multiple devices (as in the case of
459 - A linked list of bios is used as before for unrelated merges (*) - this
461 - Code that traverses the req list can find all the segments of a bio
463 has multiple bios, each of which can have multiple segments.
465 field to keep track of the next bio_vec entry to process.
473 bi_end_io() i/o callback gets called on i/o completion of the entire bio.
476 The scatter gather list is in the form of an array of <page, offset, len>
480 covers the range of pages (up to 16 contiguous pages could be covered this
484 Note: Right now the only user of bios with more than one page is ll_rw_kio,
486 right now). The intent however is to enable clustering of pages etc to
489 The same is true of Andrew Morton's work-in-progress multipage bio writeout
497 use of block layer helper routine elv_next_request to pull the next request
503 to in some of the discussion here) are listed below, not necessarily in
521 sector_t sector; /* this field is now of type sector_t instead of int
526 /* Number of scatter-gather DMA addr+len pairs after
531 /* Number of scatter-gather addr+len pairs after
533 * This is the number of scatter-gather entries the driver
539 unsigned long nr_sectors; /* no. of sectors left: driver modifiable */
540 unsigned long hard_nr_sectors; /* block internal copy of above */
541 unsigned int current_nr_sectors; /* no. of sectors left in the
543 unsigned long hard_cur_sectors; /* block internal copy of the above */
552 struct bio *bio, *biotail; /* bio list instead of bh */
556 See the rq_flag_bits definitions for an explanation of the various flags
559 The behaviour of the various sector counts are almost the same as before,
561 to the numbers of sectors in the current segment being processed which could
562 be one of the many segments in the current bio (i.e i/o completion unit).
563 The nr_sectors value refers to the total number of sectors in the whole
564 request that remain to be transferred (no change). The purpose of the
567 end_that_request_first, i.e. every time the driver completes a part of the
571 hard_xxx values and the number of bytes transferred) and updates it on
575 The buffer field is just a virtual address mapping of the current segment
576 of the i/o buffer in cases where the buffer resides in low-memory. For high
588 freeing of bios (bio_alloc, bio_get, bio_put).
590 This makes use of Ingo Molnar's mempool implementation, which enables
593 subsystem makes use of the block layer to writeout dirty pages in order to be
607 case of bio, these routines make use of the standard slab allocator.
609 The caller of bio_alloc is expected to taken certain steps to avoid
618 amount of time (in the case of bio, that would be after the i/o is completed).
619 This ensures that if part of the pool has been used up, some work (in this
622 or hierarchy of allocation needs to be consistent, just the way one deals
627 so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of the
665 The helper routine provides a level of abstraction which makes it easier
666 to modify the internals of request to scatterlist conversion down the line
667 without breaking drivers. The blk_rq_map_sg routine takes care of several
673 - Avoids building segments that would exceed the number of physical
680 blk_queue_max_hw_segments() : Sets an upper limit of the maximum number of
681 hw data segments in a request (i.e. the maximum number of address/length
685 of physical data segments in a request (i.e. the largest sized scatter list
692 completion (and setting things up so the rest of the i/o or the next
693 request can be kicked of) as before. With the introduction of multi-page
695 the number of sectors completed.
704 size of remaining data in the current segment (that is the maximum it can
706 end_request, or end_that_request_first/last to take care of all accounting
707 and transparent mapping of the next bio segment when a segment boundary
708 is crossed on completion of a transfer. (The end*request* functions should
723 depth of 'depth'.
748 of the same request members that are used for normal request queue management.
752 completion of the request to the block layer. This means ending tag
753 operations before calling end_that_request_last()! For an example of a user
754 of these helpers, see the IDE tagged command queueing support.
800 unsigned long *tag_map; /* bitmap of free tags */
801 struct list_head busy_list; /* fifo list of busy tags */
806 Most of the above is simple and straight forward, however busy_list may need
807 a bit of explaining. Normally we don't care too much about request ordering,
808 but in the event of any barrier requests in the tag queue we need to ensure
815 routines make use of this:
824 perform the i/o on each of these.
827 preallocation of bios is done for kiobufs. [The intent is to remove the
834 of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec.
848 large bios for submission completely bypassing the usage of buffer
854 some of the address space ops interfaces to utilize this abstraction rather
855 than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf
867 of kiobufs, called a kvec_cb. This contains an array of <page, offset, len>
877 bios are handled today. The values of the tuples in such a vector passed in
879 of its request processing, since that would make it hard for the higher layer
926 results in some sort of conflict internally,
960 determine when actual execution of a request
970 All requests seen by I/O schedulers strictly follow one of the following three
988 iii. better utilization of h/w & CPU time
995 gives good scalability and good availability of information. Requests are
996 almost always dispatched in disk sort order, so a cache is kept of the next
1003 AS and deadline use a hash table indexed by the last sector of a request. This
1007 "Front merges", a new request being merged at the front of an existing request,
1008 are far less common than "back merges" due to the nature of most I/O patterns.
1011 iii. Plugging the queue to batch requests in anticipation of opportunities for
1016 advantage of the sorting/merging logic in the elevator. If the
1018 (sort of like plugging the bath tub of a vessel to get fluid to build up)
1026 the queue gets explicitly unplugged as part of waiting for completion on that
1027 buffer. For page driven IO, the address space ->sync_page() takes care of
1031 This is kind of controversial territory, as it's not clear if plugging is
1043 for an example of usage in an i/o scheduler.
1050 The global io_request_lock has been removed as of 2.5, to avoid
1059 request_fn execution which it means that lots of older drivers
1078 In 2.5 some of the gendisk/partition related code has been reorganized.
1087 sent are offset from the beginning of the device.
1090 7. A Few Tips on Migration of older drivers
1111 Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time the
1132 rq->sector = 123128; /* offset from start of disk */
1134 As mentioned, there is no virtual mapping of a bio. For DMA, this is
1154 8.3. SGI XFS - pagebuf patches - use of kiobufs
1177 et al - Feb-March 2001 (many of the initial thoughts that led to bio were