📄 biodoc.txt

📁 linux 内核源代码
💻 TXT
📖 第 1 页 / 共 4 页
字号:
12 3 4 下一页
	Notes on the Generic Block Layer Rewrite in Linux 2.5	=====================================================Notes Written on Jan 15, 2002:	Jens Axboe <jens.axboe@oracle.com>	Suparna Bhattacharya <suparna@in.ibm.com>Last Updated May 2, 2002September 2003: Updated I/O Scheduler portions	Nick Piggin <piggin@cyberone.com.au>Introduction:These are some notes describing some aspects of the 2.5 block layer in thecontext of the bio rewrite. The idea is to bring out some of the keychanges and a glimpse of the rationale behind those changes.Please mail corrections & suggestions to suparna@in.ibm.com.Credits:---------2.5 bio rewrite:	Jens Axboe <jens.axboe@oracle.com>Many aspects of the generic block layer redesign were driven by and evolvedover discussions, prior patches and the collective experience of severalpeople. See sections 8 and 9 for a list of some related references.The following people helped with review comments and inputs for thisdocument:	Christoph Hellwig <hch@infradead.org>	Arjan van de Ven <arjanv@redhat.com>	Randy Dunlap <rdunlap@xenotime.net>	Andre Hedrick <andre@linux-ide.org>The following people helped with fixes/contributions to the bio patcheswhile it was still work-in-progress:	David S. Miller <davem@redhat.com>Description of Contents:------------------------1. Scope for tuning of logic to various needs  1.1 Tuning based on device or low level driver capabilities	- Per-queue parameters	- Highmem I/O support	- I/O scheduler modularization  1.2 Tuning based on high level requirements/capabilities	1.2.1 I/O Barriers	1.2.2 Request Priority/Latency  1.3 Direct access/bypass to lower layers for diagnostics and special      device operations	1.3.1 Pre-built commands2. New flexible and generic but minimalist i/o structure or descriptor   (instead of using buffer heads at the i/o layer)  2.1 Requirements/Goals addressed  2.2 The bio struct in detail (multi-page io unit)  2.3 Changes in the request structure3. Using bios  3.1 Setup/teardown (allocation, splitting)  3.2 Generic bio helper routines    3.2.1 Traversing segments and completion units in a request    3.2.2 Setting up DMA scatterlists    3.2.3 I/O completion    3.2.4 Implications for drivers that do not interpret bios (don't handle 	  multiple segments)    3.2.5 Request command tagging  3.3 I/O submission4. The I/O scheduler5. Scalability related changes  5.1 Granular locking: Removal of io_request_lock  5.2 Prepare for transition to 64 bit sector_t6. Other Changes/Implications  6.1 Partition re-mapping handled by the generic block layer7. A few tips on migration of older drivers8. A list of prior/related/impacted patches/ideas9. Other References/Discussion Threads---------------------------------------------------------------------------Bio Notes--------Let us discuss the changes in the context of how some overall goals for theblock layer are addressed.1. Scope for tuning the generic logic to satisfy various requirementsThe block layer design supports adaptable abstractions to handle commonprocessing with the ability to tune the logic to an appropriate extentdepending on the nature of the device and the requirements of the caller.One of the objectives of the rewrite was to increase the degree of tunabilityand to enable higher level code to utilize underlying device/drivercapabilities to the maximum extent for better i/o performance. This isimportant especially in the light of ever improving hardware capabilitiesand application/middleware software designed to take advantage of thesecapabilities.1.1 Tuning based on low level device / driver capabilitiesSophisticated devices with large built-in caches, intelligent i/o schedulingoptimizations, high memory DMA support, etc may find some of thegeneric processing an overhead, while for less capable devices thegeneric functionality is essential for performance or correctness reasons.Knowledge of some of the capabilities or parameters of the device should beused at the generic block layer to take the right decisions onbehalf of the driver.How is this achieved ?Tuning at a per-queue level:i. Per-queue limits/values exported to the generic layer by the driverVarious parameters that the generic i/o scheduler logic uses are set ata per-queue level (e.g maximum request size, maximum number of segments ina scatter-gather list, hardsect size)Some parameters that were earlier available as global arrays indexed bymajor/minor are now directly associated with the queue. Some of these maymove into the block device structure in the future. Some characteristicshave been incorporated into a queue flags field rather than separate fieldsin themselves.  There are blk_queue_xxx functions to set the parameters,rather than update the fields directlySome new queue property settings:	blk_queue_bounce_limit(q, u64 dma_address)		Enable I/O to highmem pages, dma_address being the		limit. No highmem default.	blk_queue_max_sectors(q, max_sectors)		Sets two variables that limit the size of the request.		- The request queue's max_sectors, which is a soft size in		units of 512 byte sectors, and could be dynamically varied		by the core kernel.		- The request queue's max_hw_sectors, which is a hard limit		and reflects the maximum size request a driver can handle		in units of 512 byte sectors.		The default for both max_sectors and max_hw_sectors is		255. The upper limit of max_sectors is 1024.	blk_queue_max_phys_segments(q, max_segments)		Maximum physical segments you can handle in a request. 128		default (driver limit). (See 3.2.2)	blk_queue_max_hw_segments(q, max_segments)		Maximum dma segments the hardware can handle in a request. 128		default (host adapter limit, after dma remapping).		(See 3.2.2)	blk_queue_max_segment_size(q, max_seg_size)		Maximum size of a clustered segment, 64kB default.	blk_queue_hardsect_size(q, hardsect_size)		Lowest possible sector size that the hardware can operate		on, 512 bytes default.New queue flags:	QUEUE_FLAG_CLUSTER (see 3.2.2)	QUEUE_FLAG_QUEUED (see 3.2.4)ii. High-mem i/o capabilities are now considered the defaultThe generic bounce buffer logic, present in 2.4, where the block layer wouldby default copyin/out i/o requests on high-memory buffers to low-memory buffersassuming that the driver wouldn't be able to handle it directly, has beenchanged in 2.5. The bounce logic is now applied only for memory rangesfor which the device cannot handle i/o. A driver can specify this bysetting the queue bounce limit for the request queue for the device(blk_queue_bounce_limit()). This avoids the inefficiencies of the copyin/outwhere a device is capable of handling high memory i/o.In order to enable high-memory i/o where the device is capable of supportingit, the pci dma mapping routines and associated data structures have now beenmodified to accomplish a direct page -> bus translation, without requiringa virtual address mapping (unlike the earlier scheme of virtual address-> bus translation). So this works uniformly for high-memory pages (whichdo not have a corresponding kernel virtual address space mapping) andlow-memory pages.Note: Please refer to DMA-mapping.txt for a discussion on PCI high mem DMAaspects and mapping of scatter gather lists, and support for 64 bit PCI.Special handling is required only for cases where i/o needs to happen onpages at physical memory addresses beyond what the device can support. In thesecases, a bounce bio representing a buffer from the supported memory rangeis used for performing the i/o with copyin/copyout as needed depending onthe type of the operation.  For example, in case of a read operation, thedata read has to be copied to the original buffer on i/o completion, so acallback routine is set up to do this, while for write, the data is copiedfrom the original buffer to the bounce buffer prior to issuing theoperation. Since an original buffer may be in a high memory area that's notmapped in kernel virtual addr, a kmap operation may be required forperforming the copy, and special care may be needed in the completion pathas it may not be in irq context. Special care is also required (by way ofGFP flags) when allocating bounce buffers, to avoid certain highmemdeadlock possibilities.It is also possible that a bounce buffer may be allocated from high-memoryarea that's not mapped in kernel virtual addr, but within the range that thedevice can use directly; so the bounce page may need to be kmapped duringcopy operations. [Note: This does not hold in the current implementation,though]There are some situations when pages from high memory may need tobe kmapped, even if bounce buffers are not necessary. For example a devicemay need to abort DMA operations and revert to PIO for the transfer, inwhich case a virtual mapping of the page is required. For SCSI it is alsodone in some scenarios where the low level driver cannot be trusted tohandle a single sg entry correctly. The driver is expected to perform thekmaps as needed on such occasions using the __bio_kmap_atomic and bio_kmap_irqroutines as appropriate. A driver could also use the blk_queue_bounce()routine on its own to bounce highmem i/o to low memory for specific requestsif so desired.iii. The i/o scheduler algorithm itself can be replaced/set as appropriateAs in 2.4, it is possible to plugin a brand new i/o scheduler for a particularqueue or pick from (copy) existing generic schedulers and replace/overridecertain portions of it. The 2.5 rewrite provides improved modularizationof the i/o scheduler. There are more pluggable callbacks, e.g for init,add request, extract request, which makes it possible to abstract specifici/o scheduling algorithm aspects and details outside of the generic loop.It also makes it possible to completely hide the implementation details ofthe i/o scheduler from block drivers.I/O scheduler wrappers are to be used instead of accessing the queue directly.See section 4. The I/O scheduler for details.1.2 Tuning Based on High level code capabilitiesi. Application capabilities for raw i/oThis comes from some of the high-performance database/middlewarerequirements where an application prefers to make its own i/o schedulingdecisions based on an understanding of the access patterns and i/ocharacteristicsii. High performance filesystems or other higher level kernel code'scapabilitiesKernel components like filesystems could also take their own i/o schedulingdecisions for optimizing performance. Journalling filesystems may needsome control over i/o ordering.What kind of support exists at the generic block layer for this ?The flags and rw fields in the bio structure can be used for some tuningfrom above e.g indicating that an i/o is just a readahead request, or formarking  barrier requests (discussed next), or priority settings (currentlyunused). As far as user applications are concerned they would need anadditional mechanism either via open flags or ioctls, or some other upperlevel mechanism to communicate such settings to block.1.2.1 I/O BarriersThere is a way to enforce strict ordering for i/os through barriers.All requests before a barrier point must be serviced before the barrierrequest and any other requests arriving after the barrier will not beserviced until after the barrier has completed. This is useful for higherlevel control on write ordering, e.g flushing a log of committed updatesto disk before the corresponding updates themselves.A flag in the bio structure, BIO_BARRIER is used to identify a barrier i/o.The generic i/o scheduler would make sure that it places the barrier request andall other requests coming after it after all the previous requests in thequeue. Barriers may be implemented in different ways depending on thedriver. For more details regarding I/O barriers, please read barrier.txtin this directory.1.2.2 Request Priority/LatencyTodo/Under discussion:Arjan's proposed request priority scheme allows higher levels some broad  control (high/med/low) over the priority  of an i/o request vs other pending  requests in the queue. For example it allows reads for bringing in an  executable page on demand to be given a higher priority over pending write  requests which haven't aged too much on the queue. Potentially this priority  could even be exposed to applications in some manner, providing higher level  tunability. Time based aging avoids starvation of lower priority  requests. Some bits in the bi_rw flags field in the bio structure are  intended to be used for this priority information.1.3 Direct Access to Low level Device/Driver Capabilities (Bypass mode)    (e.g Diagnostics, Systems Management)There are situations where high-level code needs to have direct access tothe low level device capabilities or requires the ability to issue commandsto the device bypassing some of the intermediate i/o layers.These could, for example, be special control commands issued through ioctlinterfaces, or could be raw read/write commands that stress the drive'scapabilities for certain kinds of fitness tests. Having direct interfaces atmultiple levels without having to pass through upper layers makesit possible to perform bottom up validation of the i/o path, layer bylayer, starting from the media.
12 3 4 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -