📄 biodoc.txt

📁 linux 内核源代码
💻 TXT
📖 第 1 页 / 共 4 页
字号:
This makes use of Ingo Molnar's mempool implementation, which enablessubsystems like bio to maintain their own reserve memory pools for guaranteeddeadlock-free allocations during extreme VM load. For example, the VMsubsystem makes use of the block layer to writeout dirty pages in order to beable to free up memory space, a case which needs careful handling. Theallocation logic draws from the preallocated emergency reserve in situationswhere it cannot allocate through normal means. If the pool is empty and itcan wait, then it would trigger action that would help free up memory orreplenish the pool (without deadlocking) and wait for availability in the pool.If it is in IRQ context, and hence not in a position to do this, allocationcould fail if the pool is empty. In general mempool always first tries toperform allocation without having to wait, even if it means digging into thepool as long it is not less that 50% full.On a free, memory is released to the pool or directly freed depending onthe current availability in the pool. The mempool interface lets thesubsystem specify the routines to be used for normal alloc and free. In thecase of bio, these routines make use of the standard slab allocator.The caller of bio_alloc is expected to taken certain steps to avoiddeadlocks, e.g. avoid trying to allocate more memory from the pool whilealready holding memory obtained from the pool.[TBD: This is a potential issue, though a rare possibility in the bounce bio allocation that happens in the current code, since it ends up allocating a second bio from the same pool while holding the original bio ]Memory allocated from the pool should be released back within a limitedamount of time (in the case of bio, that would be after the i/o is completed).This ensures that if part of the pool has been used up, some work (in thiscase i/o) must already be in progress and memory would be available when itis over. If allocating from multiple pools in the same code path, the orderor hierarchy of allocation needs to be consistent, just the way one dealswith multiple locks.The bio_alloc routine also needs to allocate the bio_vec_list (bvec_alloc())for a non-clone bio. There are the 6 pools setup for different size biovecs,so bio_alloc(gfp_mask, nr_iovecs) will allocate a vec_list of thegiven size from these slabs.The bi_destructor() routine takes into account the possibility of the biohaving originated from a different source (see later discussions onn/w to block transfers and kvec_cb)The bio_get() routine may be used to hold an extra reference on a bio priorto i/o submission, if the bio fields are likely to be accessed after thei/o is issued (since the bio may otherwise get freed in case i/o completionhappens in the meantime).The bio_clone() routine may be used to duplicate a bio, where the cloneshares the bio_vec_list with the original bio (i.e. both point to thesame bio_vec_list). This would typically be used for splitting i/o requestsin lvm or md.3.2 Generic bio helper Routines3.2.1 Traversing segments and completion units in a requestThe macro rq_for_each_segment() should be used for traversing the biosin the request list (drivers should avoid directly trying to do itthemselves). Using these helpers should also make it easier to copewith block changes in the future.	struct req_iterator iter;	rq_for_each_segment(bio_vec, rq, iter)		/* bio_vec is now current segment */I/O completion callbacks are per-bio rather than per-segment, so driversthat traverse bio chains on completion need to keep that in mind. Driverswhich don't make a distinction between segments and completion units wouldneed to be reorganized to support multi-segment bios.3.2.2 Setting up DMA scatterlistsThe blk_rq_map_sg() helper routine would be used for setting up scattergather lists from a request, so a driver need not do it on its own.	nr_segments = blk_rq_map_sg(q, rq, scatterlist);The helper routine provides a level of abstraction which makes it easierto modify the internals of request to scatterlist conversion down the linewithout breaking drivers. The blk_rq_map_sg routine takes care of severalthings like collapsing physically contiguous segments (if QUEUE_FLAG_CLUSTERis set) and correct segment accounting to avoid exceeding the limits whichthe i/o hardware can handle, based on various queue properties.- Prevents a clustered segment from crossing a 4GB mem boundary- Avoids building segments that would exceed the number of physical  memory segments that the driver can handle (phys_segments) and the  number that the underlying hardware can handle at once, accounting for  DMA remapping (hw_segments)  (i.e. IOMMU aware limits).Routines which the low level driver can use to set up the segment limits:blk_queue_max_hw_segments() : Sets an upper limit of the maximum number ofhw data segments in a request (i.e. the maximum number of address/lengthpairs the host adapter can actually hand to the device at once)blk_queue_max_phys_segments() : Sets an upper limit on the maximum numberof physical data segments in a request (i.e. the largest sized scatter lista driver could handle)3.2.3 I/O completionThe existing generic block layer helper routines end_request,end_that_request_first and end_that_request_last can be used for i/ocompletion (and setting things up so the rest of the i/o or the nextrequest can be kicked of) as before. With the introduction of multi-pagebio support, end_that_request_first requires an additional argument indicatingthe number of sectors completed.3.2.4 Implications for drivers that do not interpret bios (don't handle multiple segments)Drivers that do not interpret bios e.g those which do not handle multiplesegments and do not support i/o into high memory addresses (require bouncebuffers) and expect only virtually mapped buffers, can access the rq->bufferfield. As before the driver should use current_nr_sectors to determine thesize of remaining data in the current segment (that is the maximum it cantransfer in one go unless it interprets segments), and rely on the block layerend_request, or end_that_request_first/last to take care of all accountingand transparent mapping of the next bio segment when a segment boundaryis crossed on completion of a transfer. (The end*request* functions shouldbe used if only if the request has come down from block/bio path, not fordirect access requests which only specify rq->buffer without a valid rq->bio)3.2.5 Generic request command tagging3.2.5.1 Tag helpersBlock now offers some simple generic functionality to help support commandqueueing (typically known as tagged command queueing), ie manage more thanone outstanding command on a queue at any given time.	blk_queue_init_tags(struct request_queue *q, int depth)	Initialize internal command tagging structures for a maximum	depth of 'depth'.	blk_queue_free_tags((struct request_queue *q)	Teardown tag info associated with the queue. This will be done	automatically by block if blk_queue_cleanup() is called on a queue	that is using tagging.The above are initialization and exit management, the main helpers duringnormal operations are:	blk_queue_start_tag(struct request_queue *q, struct request *rq)	Start tagged operation for this request. A free tag number between	0 and 'depth' is assigned to the request (rq->tag holds this number),	and 'rq' is added to the internal tag management. If the maximum depth	for this queue is already achieved (or if the tag wasn't started for	some other reason), 1 is returned. Otherwise 0 is returned.	blk_queue_end_tag(struct request_queue *q, struct request *rq)	End tagged operation on this request. 'rq' is removed from the internal	book keeping structures.To minimize struct request and queue overhead, the tag helpers utilize someof the same request members that are used for normal request queue management.This means that a request cannot both be an active tag and be on the queuelist at the same time. blk_queue_start_tag() will remove the request, butthe driver must remember to call blk_queue_end_tag() before signallingcompletion of the request to the block layer. This means ending tagoperations before calling end_that_request_last()! For an example of a userof these helpers, see the IDE tagged command queueing support.Certain hardware conditions may dictate a need to invalidate the block tagqueue. For instance, on IDE any tagged request error needs to clear boththe hardware and software block queue and enable the driver to sanely restartall the outstanding requests. There's a third helper to do that:	blk_queue_invalidate_tags(struct request_queue *q)	Clear the internal block tag queue and re-add all the pending requests	to the request queue. The driver will receive them again on the	next request_fn run, just like it did the first time it encountered	them.3.2.5.2 Tag infoSome block functions exist to query current tag status or to go from atag number to the associated request. These are, in no particular order:	blk_queue_tagged(q)	Returns 1 if the queue 'q' is using tagging, 0 if not.	blk_queue_tag_request(q, tag)	Returns a pointer to the request associated with tag 'tag'.	blk_queue_tag_depth(q)		Return current queue depth.	blk_queue_tag_queue(q)	Returns 1 if the queue can accept a new queued command, 0 if we are	at the maximum depth already.	blk_queue_rq_tagged(rq)	Returns 1 if the request 'rq' is tagged.3.2.5.2 Internal structureInternally, block manages tags in the blk_queue_tag structure:	struct blk_queue_tag {		struct request **tag_index;	/* array or pointers to rq */		unsigned long *tag_map;		/* bitmap of free tags */		struct list_head busy_list;	/* fifo list of busy tags */		int busy;			/* queue depth */		int max_depth;			/* max queue depth */	};Most of the above is simple and straight forward, however busy_list may needa bit of explaining. Normally we don't care too much about request ordering,but in the event of any barrier requests in the tag queue we need to ensurethat requests are restarted in the order they were queue. This may happenif the driver needs to use blk_queue_invalidate_tags().Tagging also defines a new request flag, REQ_QUEUED. This is set whenevera request is currently tagged. You should not use this flag directly,blk_rq_tagged(rq) is the portable way to do so.3.3 I/O SubmissionThe routine submit_bio() is used to submit a single io. Higher level i/oroutines make use of this:(a) Buffered i/o:The routine submit_bh() invokes submit_bio() on a bio corresponding to thebh, allocating the bio if required. ll_rw_block() uses submit_bh() as before.(b) Kiobuf i/o (for raw/direct i/o):The ll_rw_kio() routine breaks up the kiobuf into page sized chunks andmaps the array to one or more multi-page bios, issuing submit_bio() toperform the i/o on each of these.The embedded bh array in the kiobuf structure has been removed and nopreallocation of bios is done for kiobufs. [The intent is to remove theblocks array as well, but it's currently in there to kludge around direct i/o.]Thus kiobuf allocation has switched back to using kmalloc rather than vmalloc.Todo/Observation: A single kiobuf structure is assumed to correspond to a contiguous range of data, so brw_kiovec() invokes ll_rw_kio for each kiobuf in a kiovec. So right now it wouldn't work for direct i/o on non-contiguous blocks. This is to be resolved.  The eventual direction is to replace kiobuf by kvec's. Badari Pulavarty has a patch to implement direct i/o correctly using bio and kvec.(c) Page i/o:Todo/Under discussion: Andrew Morton's multi-page bio patches attempt to issue multi-page writeouts (and reads) from the page cache, by directly building up large bios for submission completely bypassing the usage of buffer heads. This work is still in progress. Christoph Hellwig had some code that uses bios for page-io (rather than bh). This isn't included in bio as yet. Christoph was also working on a design for representing virtual/real extents as an entity and modifying some of the address space ops interfaces to utilize this abstraction rather than buffer_heads. (This is somewhat along the lines of the SGI XFS pagebuf abstraction, but intended to be as lightweight as possible).(d) Direct access i/o:Direct access requests that do not contain bios would be submitted differentlyas discussed earlier in section 1.3.Aside:  Kvec i/o:  Ben LaHaise's aio code uses a slightly different structure instead  of kiobufs, called a kvec_cb. This contains an array of <page, offset, len>  tuples (very much like the networking code), together with a callback function  and data pointer. This is embedded into a brw_cb structure when passed  to brw_kvec_async().  Now it should be possible to directly map these kvecs to a bio. Just as while  cloning, in this case rather than PRE_BUILT bio_vecs, we set the bi_io_vec  array pointer to point to the veclet array in kvecs.  TBD: In order for this to work, some changes are needed in the way multi-page  bios are handled today. The values of the tuples in such a vector passed in  from higher level code should not be modified by the block layer in the course  of its request processing, since that would make it hard for the higher layer  to continue to use the vector descriptor (kvec) after i/o completes. Instead,  all such transient state should either be maintained in the request structure,  and passed on in some way to the endio completion routine.4. The I/O scheduler
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -