📄 biodoc.txt
字号:
I/O scheduler, a.k.a. elevator, is implemented in two layers. Generic dispatchqueue and specific I/O schedulers. Unless stated otherwise, elevator is usedto refer to both parts and I/O scheduler to specific I/O schedulers.Block layer implements generic dispatch queue in ll_rw_blk.c and elevator.c.The generic dispatch queue is responsible for properly ordering barrierrequests, requeueing, handling non-fs requests and all other subtleties.Specific I/O schedulers are responsible for ordering normal filesystemrequests. They can also choose to delay certain requests to improvethroughput or whatever purpose. As the plural form indicates, there aremultiple I/O schedulers. They can be built as modules but at least one shouldbe built inside the kernel. Each queue can choose different one and can alsochange to another one dynamically.A block layer call to the i/o scheduler follows the convention elv_xxx(). Thiscalls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh,xxx and xxx might not match exactly, but use your imagination. If an elevatordoesn't implement a function, the switch does nothing or some minimal housekeeping work.4.1. I/O scheduler APIThe functions an elevator may implement are: (* are mandatory)elevator_merge_fn called to query requests for merge with a bioelevator_merge_req_fn called when two requests get merged. the one which gets merged into the other one will be never seen by I/O scheduler again. IOW, after being merged, the request is gone.elevator_merged_fn called when a request in the scheduler has been involved in a merge. It is used in the deadline scheduler for example, to reposition the request if its sorting order has changed.elevator_allow_merge_fn called whenever the block layer determines that a bio can be merged into an existing request safely. The io scheduler may still want to stop a merge at this point if it results in some sort of conflict internally, this hook allows it to do that.elevator_dispatch_fn fills the dispatch queue with ready requests. I/O schedulers are free to postpone requests by not filling the dispatch queue unless @force is non-zero. Once dispatched, I/O schedulers are not allowed to manipulate the requests - they belong to generic dispatch queue.elevator_add_req_fn called to add a new request into the schedulerelevator_queue_empty_fn returns true if the merge queue is empty. Drivers shouldn't use this, but rather check if elv_next_request is NULL (without losing the request if one exists!)elevator_former_req_fnelevator_latter_req_fn These return the request before or after the one specified in disk sort order. Used by the block layer to find merge possibilities.elevator_completed_req_fn called when a request is completed.elevator_may_queue_fn returns true if the scheduler wants to allow the current context to queue a new request even if it is over the queue limit. This must be used very carefully!!elevator_set_req_fnelevator_put_req_fn Must be used to allocate and free any elevator specific storage for a request.elevator_activate_req_fn Called when device driver first sees a request. I/O schedulers can use this callback to determine when actual execution of a request starts.elevator_deactivate_req_fn Called when device driver decides to delay a request by requeueing it.elevator_init_fnelevator_exit_fn Allocate and free any elevator specific storage for a queue.4.2 Request flows seen by I/O schedulersAll requests seen by I/O schedulers strictly follow one of the following threeflows. set_req_fn -> i. add_req_fn -> (merged_fn ->)* -> dispatch_fn -> activate_req_fn -> (deactivate_req_fn -> activate_req_fn ->)* -> completed_req_fn ii. add_req_fn -> (merged_fn ->)* -> merge_req_fn iii. [none] -> put_req_fn4.3 I/O scheduler implementationThe generic i/o scheduler algorithm attempts to sort/merge/batch requests foroptimal disk scan and request servicing performance (based on genericprinciples and device capabilities), optimized for:i. improved throughputii. improved latencyiii. better utilization of h/w & CPU timeCharacteristics:i. Binary treeAS and deadline i/o schedulers use red black binary trees for disk positionsorting and searching, and a fifo linked list for time-based searching. Thisgives good scalability and good availability of information. Requests arealmost always dispatched in disk sort order, so a cache is kept of the nextrequest in sort order to prevent binary tree lookups.This arrangement is not a generic block layer characteristic however, soelevators may implement queues as they please.ii. Merge hashAS and deadline use a hash table indexed by the last sector of a request. Thisenables merging code to quickly look up "back merge" candidates, even whenmultiple I/O streams are being performed at once on one disk."Front merges", a new request being merged at the front of an existing request,are far less common than "back merges" due to the nature of most I/O patterns.Front merges are handled by the binary trees in AS and deadline schedulers.iii. Plugging the queue to batch requests in anticipation of opportunities for merge/sort optimizationsThis is just the same as in 2.4 so far, though per-device unpluggingsupport is anticipated for 2.5. Also with a priority-based i/o scheduler,such decisions could be based on request priorities.Plugging is an approach that the current i/o scheduling algorithm resorts to sothat it collects up enough requests in the queue to be able to takeadvantage of the sorting/merging logic in the elevator. If thequeue is empty when a request comes in, then it plugs the request queue(sort of like plugging the bottom of a vessel to get fluid to build up)till it fills up with a few more requests, before starting to servicethe requests. This provides an opportunity to merge/sort the requests beforepassing them down to the device. There are various conditions when the queue isunplugged (to open up the flow again), either through a scheduled task orcould be on demand. For example wait_on_buffer sets the unplugging going(by running tq_disk) so the read gets satisfied soon. So in the read case,the queue gets explicitly unplugged as part of waiting for completion,in fact all queues get unplugged as a side-effect.Aside: This is kind of controversial territory, as it's not clear if plugging is always the right thing to do. Devices typically have their own queues, and allowing a big queue to build up in software, while letting the device be idle for a while may not always make sense. The trick is to handle the fine balance between when to plug and when to open up. Also now that we have multi-page bios being queued in one shot, we may not need to wait to merge a big request from the broken up pieces coming by. Per-queue granularity unplugging (still a Todo) may help reduce some of the concerns with just a single tq_disk flush approach. Something like blk_kick_queue() to unplug a specific queue (right away ?) or optionally, all queues, is in the plan.4.4 I/O contextsI/O contexts provide a dynamically allocated per process data area. They maybe used in I/O schedulers, and in the block layer (could be used for IO statis,priorities for example). See *io_context in block/ll_rw_blk.c, and as-iosched.cfor an example of usage in an i/o scheduler.5. Scalability related changes5.1 Granular Locking: io_request_lock replaced by a per-queue lockThe global io_request_lock has been removed as of 2.5, to avoidthe scalability bottleneck it was causing, and has been replaced by moregranular locking. The request queue structure has a pointer to thelock to be used for that queue. As a result, locking can now beper-queue, with a provision for sharing a lock across queues ifnecessary (e.g the scsi layer sets the queue lock pointers to thecorresponding adapter lock, which results in a per host lockinggranularity). The locking semantics are the same, i.e. locking isstill imposed by the block layer, grabbing the lock beforerequest_fn execution which it means that lots of older driversshould still be SMP safe. Drivers are free to drop the queuelock themselves, if required. Drivers that explicitly used theio_request_lock for serialization need to be modified accordingly.Usually it's as easy as adding a global lock: static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED;and passing the address to that lock to blk_init_queue().5.2 64 bit sector numbers (sector_t prepares for 64 bit support)The sector number used in the bio structure has been changed to sector_t,which could be defined as 64 bit in preparation for 64 bit sector support.6. Other Changes/Implications6.1 Partition re-mapping handled by the generic block layerIn 2.5 some of the gendisk/partition related code has been reorganized.Now the generic block layer performs partition-remapping early and thusprovides drivers with a sector number relative to whole device, rather thanhaving to take partition number into account in order to arrive at the truesector number. The routine blk_partition_remap() is invoked bygeneric_make_request even before invoking the queue specific make_request_fn,so the i/o scheduler also gets to operate on whole disk sector numbers. Thisshould typically not require changes to block drivers, it just never getsto invoke its own partition sector offset calculations since all biossent are offset from the beginning of the device.7. A Few Tips on Migration of older driversOld-style drivers that just use CURRENT and ignores clustered requests,may not need much change. The generic layer will automatically handleclustered requests, multi-page bios, etc for the driver.For a low performance driver or hardware that is PIO driven or just doesn'tsupport scatter-gather changes should be minimal too.The following are some points to keep in mind when converting old driversto bio.Drivers should use elv_next_request to pick up requests and are no longersupposed to handle looping directly over the request list.(struct request->queue has been removed)Now end_that_request_first takes an additional number_of_sectors argument.It used to handle always just the first buffer_head in a request, nowit will loop and handle as many sectors (on a bio-segment granularity)as specified.Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time theright thing to use is bio_endio(bio, uptodate) instead.If the driver is dropping the io_request_lock from its request_fn strategy,then it just needs to replace that with q->queue_lock instead.As described in Sec 1.1, drivers can set max sector size, max segment sizeetc per queue now. Drivers that used to define their own merge functions ito handle things like this can now just use the blk_queue_* functions atblk_init_queue time.Drivers no longer have to map a {partition, sector offset} into thecorrect absolute location anymore, this is done by the block layer, sowhere a driver received a request ala this before: rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ rq->sector = 0; /* first sector on hda5 */ it will now see rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ rq->sector = 123128; /* offset from start of disk */As mentioned, there is no virtual mapping of a bio. For DMA, this isnot a problem as the driver probably never will need a virtual mapping.Instead it needs a bus mapping (pci_map_page for a single segment oruse blk_rq_map_sg for scatter gather) to be able to ship it to the driver. ForPIO drivers (or drivers that need to revert to PIO transfer once in awhile (IDE for example)), where the CPU is doing the actual datatransfer a virtual mapping is needed. If the driver supports highmem I/O,(Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq totemporarily map a bio into the virtual address space.8. Prior/Related/Impacted patches8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)- orig kiobuf & raw i/o patches (now in 2.4 tree)- direct kiobuf based i/o to devices (no intermediate bh's)- page i/o using kiobuf- kiobuf splitting for lvm (mkp)- elevator support for kiobuf request merging (axboe)8.2. Zero-copy networking (Dave Miller)8.3. SGI XFS - pagebuf patches - use of kiobufs8.4. Multi-page pioent patch for bio (Christoph Hellwig)8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre118.6. Async i/o implementation patch (Ben LaHaise)8.7. EVMS layering design (IBM EVMS team)8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips) => larger contiguous physical memory buffers8.9. VM reservations patch (Ben LaHaise)8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari)8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven)8.14 IDE Taskfile i/o patch (Andre Hedrick)8.15 Multi-page writeout and readahead patches (Andrew Morton)8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)9. Other References:9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml,and Linus' comments - Jan 2001)9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alanet al - Feb-March 2001 (many of the initial thoughts that led to bio werebrought up in this discussion thread)9.3 Discussions on mempool on lkml - Dec 2001.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -