biodoc.txt
来自「Linux Kernel 2.6.9 for OMAP1710」· 文本 代码 · 共 1,215 行 · 第 1/4 页
TXT
1,215 行
A block layer call to the i/o scheduler follows the convention elv_xxx(). Thiscalls elevator_xxx_fn in the elevator switch (drivers/block/elevator.c). Oh,xxx and xxx might not match exactly, but use your imagination. If an elevatordoesn't implement a function, the switch does nothing or some minimal housekeeping work.4.1. I/O scheduler APIThe functions an elevator may implement are: (* are mandatory)elevator_merge_fn called to query requests for merge with a bioelevator_merge_req_fn " " " with another requestelevator_merged_fn called when a request in the scheduler has been involved in a merge. It is used in the deadline scheduler for example, to reposition the request if its sorting order has changed.*elevator_next_req_fn returns the next scheduled request, or NULL if there are none (or none are ready).*elevator_add_req_fn called to add a new request into the schedulerelevator_queue_empty_fn returns true if the merge queue is empty. Drivers shouldn't use this, but rather check if elv_next_request is NULL (without losing the request if one exists!)elevator_remove_req_fn This is called when a driver claims ownership of the target request - it now belongs to the driver. It must not be modified or merged. Drivers must not lose the request! A subsequent call of elevator_next_req_fn must return the _next_ request.elevator_requeue_req_fn called to add a request to the scheduler. This is used when the request has alrnadebeen returned by elv_next_request, but hasn't completed. If this is not implemented then elevator_add_req_fn is called instead.elevator_former_req_fnelevator_latter_req_fn These return the request before or after the one specified in disk sort order. Used by the block layer to find merge possibilities.elevator_completed_req_fn called when a request is completed. This might come about due to being merged with another or when the device completes the request.elevator_may_queue_fn returns true if the scheduler wants to allow the current context to queue a new request even if it is over the queue limit. This must be used very carefully!!elevator_set_req_fnelevator_put_req_fn Must be used to allocate and free any elevator specific storate for a request.elevator_init_fnelevator_exit_fn Allocate and free any elevator specific storage for a queue.4.2 I/O scheduler implementationThe generic i/o scheduler algorithm attempts to sort/merge/batch requests foroptimal disk scan and request servicing performance (based on genericprinciples and device capabilities), optimized for:i. improved throughputii. improved latencyiii. better utilization of h/w & CPU timeCharacteristics:i. Binary treeAS and deadline i/o schedulers use red black binary trees for disk positionsorting and searching, and a fifo linked list for time-based searching. Thisgives good scalability and good availablility of information. Requests arealmost always dispatched in disk sort order, so a cache is kept of the nextrequest in sort order to prevent binary tree lookups.This arrangement is not a generic block layer characteristic however, soelevators may implement queues as they please.ii. Last merge hintThe last merge hint is part of the generic queue layer. I/O schedulers must dosome management on it. For the most part, the most important thing is to makesure q->last_merge is cleared (set to NULL) when the request on it is no longera candidate for merging (for example if it has been sent to the driver).The last merge performed is cached as a hint for the subsequent request. Ifsequential data is being submitted, the hint is used to perform merges withoutany scanning. This is not sufficient when there are multiple processes doingI/O though, so a "merge hash" is used by some schedulers.iii. Merge hashAS and deadline use a hash table indexed by the last sector of a request. Thisenables merging code to quickly look up "back merge" candidates, even whenmultiple I/O streams are being performed at once on one disk."Front merges", a new request being merged at the front of an existing request,are far less common than "back merges" due to the nature of most I/O patterns.Front merges are handled by the binary trees in AS and deadline schedulers.iv. Handling barrier casesA request with flags REQ_HARDBARRIER or REQ_SOFTBARRIER must not be orderedaround. That is, they must be processed after all older requests, and beforeany newer ones. This includes merges!In AS and deadline schedulers, barriers have the effect of flushing the reorderqueue. The performance cost of this will vary from nothing to a lot dependingon i/o patterns and device characteristics. Obviously they won't improveperformance, so their use should be kept to a minimum.v. Handling insertion position directivesA request may be inserted with a position directive. The directives are one ofELEVATOR_INSERT_BACK, ELEVATOR_INSERT_FRONT, ELEVATOR_INSERT_SORT.ELEVATOR_INSERT_SORT is a general directive for non-barrier requests.ELEVATOR_INSERT_BACK is used to insert a barrier to the back of the queue.ELEVATOR_INSERT_FRONT is used to insert a barrier to the front of the queue, andoverrides the ordering requested by any previous barriers. In practice this isharmless and required, because it is used for SCSI requeueing. This does notrequire flushing the reorder queue, so does not impose a performance penalty.vi. Plugging the queue to batch requests in anticipation of opportunities for merge/sort optimizationsThis is just the same as in 2.4 so far, though per-device unpluggingsupport is anticipated for 2.5. Also with a priority-based i/o scheduler,such decisions could be based on request priorities.Plugging is an approach that the current i/o scheduling algorithm resorts to sothat it collects up enough requests in the queue to be able to takeadvantage of the sorting/merging logic in the elevator. If thequeue is empty when a request comes in, then it plugs the request queue(sort of like plugging the bottom of a vessel to get fluid to build up)till it fills up with a few more requests, before starting to servicethe requests. This provides an opportunity to merge/sort the requests beforepassing them down to the device. There are various conditions when the queue isunplugged (to open up the flow again), either through a scheduled task orcould be on demand. For example wait_on_buffer sets the unplugging going(by running tq_disk) so the read gets satisfied soon. So in the read case,the queue gets explicitly unplugged as part of waiting for completion,in fact all queues get unplugged as a side-effect.Aside: This is kind of controversial territory, as it's not clear if plugging is always the right thing to do. Devices typically have their own queues, and allowing a big queue to build up in software, while letting the device be idle for a while may not always make sense. The trick is to handle the fine balance between when to plug and when to open up. Also now that we have multi-page bios being queued in one shot, we may not need to wait to merge a big request from the broken up pieces coming by. Per-queue granularity unplugging (still a Todo) may help reduce some of the concerns with just a single tq_disk flush approach. Something like blk_kick_queue() to unplug a specific queue (right away ?) or optionally, all queues, is in the plan.4.3 I/O contextsI/O contexts provide a dynamically allocated per process data area. They maybe used in I/O schedulers, and in the block layer (could be used for IO statis,priorities for example). See *io_context in drivers/block/ll_rw_blk.c, andas-iosched.c for an example of usage in an i/o scheduler.5. Scalability related changes5.1 Granular Locking: io_request_lock replaced by a per-queue lockThe global io_request_lock has been removed as of 2.5, to avoidthe scalability bottleneck it was causing, and has been replaced by moregranular locking. The request queue structure has a pointer to thelock to be used for that queue. As a result, locking can now beper-queue, with a provision for sharing a lock across queues ifnecessary (e.g the scsi layer sets the queue lock pointers to thecorresponding adapter lock, which results in a per host lockinggranularity). The locking semantics are the same, i.e. locking isstill imposed by the block layer, grabbing the lock beforerequest_fn execution which it means that lots of older driversshould still be SMP safe. Drivers are free to drop the queuelock themselves, if required. Drivers that explicitly used theio_request_lock for serialization need to be modified accordingly.Usually it's as easy as adding a global lock: static spinlock_t my_driver_lock = SPIN_LOCK_UNLOCKED;and passing the address to that lock to blk_init_queue().5.2 64 bit sector numbers (sector_t prepares for 64 bit support)The sector number used in the bio structure has been changed to sector_t,which could be defined as 64 bit in preparation for 64 bit sector support.6. Other Changes/Implications6.1 Partition re-mapping handled by the generic block layerIn 2.5 some of the gendisk/partition related code has been reorganized.Now the generic block layer performs partition-remapping early and thusprovides drivers with a sector number relative to whole device, rather thanhaving to take partition number into account in order to arrive at the truesector number. The routine blk_partition_remap() is invoked bygeneric_make_request even before invoking the queue specific make_request_fn,so the i/o scheduler also gets to operate on whole disk sector numbers. Thisshould typically not require changes to block drivers, it just never getsto invoke its own partition sector offset calculations since all biossent are offset from the beginning of the device.7. A Few Tips on Migration of older driversOld-style drivers that just use CURRENT and ignores clustered requests,may not need much change. The generic layer will automatically handleclustered requests, multi-page bios, etc for the driver.For a low performance driver or hardware that is PIO driven or just doesn'tsupport scatter-gather changes should be minimal too.The following are some points to keep in mind when converting old driversto bio.Drivers should use elv_next_request to pick up requests and are no longersupposed to handle looping directly over the request list.(struct request->queue has been removed)Now end_that_request_first takes an additional number_of_sectors argument.It used to handle always just the first buffer_head in a request, nowit will loop and handle as many sectors (on a bio-segment granularity)as specified.Now bh->b_end_io is replaced by bio->bi_end_io, but most of the time theright thing to use is bio_endio(bio, uptodate) instead.If the driver is dropping the io_request_lock from its request_fn strategy,then it just needs to replace that with q->queue_lock instead.As described in Sec 1.1, drivers can set max sector size, max segment sizeetc per queue now. Drivers that used to define their own merge functions ito handle things like this can now just use the blk_queue_* functions atblk_init_queue time.Drivers no longer have to map a {partition, sector offset} into thecorrect absolute location anymore, this is done by the block layer, sowhere a driver received a request ala this before: rq->rq_dev = mk_kdev(3, 5); /* /dev/hda5 */ rq->sector = 0; /* first sector on hda5 */ it will now see rq->rq_dev = mk_kdev(3, 0); /* /dev/hda */ rq->sector = 123128; /* offset from start of disk */As mentioned, there is no virtual mapping of a bio. For DMA, this isnot a problem as the driver probably never will need a virtual mapping.Instead it needs a bus mapping (pci_map_page for a single segment oruse blk_rq_map_sg for scatter gather) to be able to ship it to the driver. ForPIO drivers (or drivers that need to revert to PIO transfer once in awhile (IDE for example)), where the CPU is doing the actual datatransfer a virtual mapping is needed. If the driver supports highmem I/O,(Sec 1.1, (ii) ) it needs to use __bio_kmap_atomic and bio_kmap_irq totemporarily map a bio into the virtual address space. See how IDE handlesthis with ide_map_buffer.8. Prior/Related/Impacted patches8.1. Earlier kiobuf patches (sct/axboe/chait/hch/mkp)- orig kiobuf & raw i/o patches (now in 2.4 tree)- direct kiobuf based i/o to devices (no intermediate bh's)- page i/o using kiobuf- kiobuf splitting for lvm (mkp)- elevator support for kiobuf request merging (axboe)8.2. Zero-copy networking (Dave Miller)8.3. SGI XFS - pagebuf patches - use of kiobufs8.4. Multi-page pioent patch for bio (Christoph Hellwig)8.5. Direct i/o implementation (Andrea Arcangeli) since 2.4.10-pre118.6. Async i/o implementation patch (Ben LaHaise)8.7. EVMS layering design (IBM EVMS team)8.8. Larger page cache size patch (Ben LaHaise) and Large page size (Daniel Phillips) => larger contiguous physical memory buffers8.9. VM reservations patch (Ben LaHaise)8.10. Write clustering patches ? (Marcelo/Quintela/Riel ?)8.11. Block device in page cache patch (Andrea Archangeli) - now in 2.4.10+8.12. Multiple block-size transfers for faster raw i/o (Shailabh Nagar, Badari)8.13 Priority based i/o scheduler - prepatches (Arjan van de Ven)8.14 IDE Taskfile i/o patch (Andre Hedrick)8.15 Multi-page writeout and readahead patches (Andrew Morton)8.16 Direct i/o patches for 2.5 using kvec and bio (Badari Pulavarthy)9. Other References:9.1 The Splice I/O Model - Larry McVoy (and subsequent discussions on lkml,and Linus' comments - Jan 2001)9.2 Discussions about kiobuf and bh design on lkml between sct, linus, alanet al - Feb-March 2001 (many of the initial thoughts that led to bio werebrought up in this discusion thread)9.3 Discussions on mempool on lkml - Dec 2001.
⌨️ 快捷键说明
复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?