📄 barrier.txt
字号:
I/O Barriers============Tejun Heo <htejun@gmail.com>, July 22 2005I/O barrier requests are used to guarantee ordering around the barrierrequests. Unless you're crazy enough to use disk drives forimplementing synchronization constructs (wow, sounds interesting...),the ordering is meaningful only for write requests for things likejournal checkpoints. All requests queued before a barrier requestmust be finished (made it to the physical medium) before the barrierrequest is started, and all requests queued after the barrier requestmust be started only after the barrier request is finished (again,made it to the physical medium).In other words, I/O barrier requests have the following two properties.1. Request orderingRequests cannot pass the barrier request. Preceding requests areprocessed before the barrier and following requests after.Depending on what features a drive supports, this can be done in oneof the following three ways.i. For devices which have queue depth greater than 1 (TCQ devices) andsupport ordered tags, block layer can just issue the barrier as anordered request and the lower level driver, controller and driveitself are responsible for making sure that the ordering constraint ismet. Most modern SCSI controllers/drives should support this.NOTE: SCSI ordered tag isn't currently used due to limitation in the SCSI midlayer, see the following random notes section.ii. For devices which have queue depth greater than 1 but don'tsupport ordered tags, block layer ensures that the requests precedinga barrier request finishes before issuing the barrier request. Also,it defers requests following the barrier until the barrier request isfinished. Older SCSI controllers/drives and SATA drives fall in thiscategory.iii. Devices which have queue depth of 1. This is a degenerate caseof ii. Just keeping issue order suffices. Ancient SCSIcontrollers/drives and IDE drives are in this category.2. Forced flushing to physical mediumAgain, if you're not gonna do synchronization with disk drives (dang,it sounds even more appealing now!), the reason you use I/O barriersis mainly to protect filesystem integrity when power failure or someother events abruptly stop the drive from operating and possibly makethe drive lose data in its cache. So, I/O barriers need to guaranteethat requests actually get written to non-volatile medium in order.There are four cases,i. No write-back cache. Keeping requests ordered is enough.ii. Write-back cache but no flush operation. There's no way toguarantee physical-medium commit order. This kind of devices can't toI/O barriers.iii. Write-back cache and flush operation but no FUA (forced unitaccess). We need two cache flushes - before and after the barrierrequest.iv. Write-back cache, flush operation and FUA. We still need oneflush to make sure requests preceding a barrier are written to medium,but post-barrier flush can be avoided by using FUA write on thebarrier itself.How to support barrier requests in drivers------------------------------------------All barrier handling is done inside block layer proper. All low leveldrivers have to are implementing its prepare_flush_fn and using onethe following two functions to indicate what barrier type it supportsand how to prepare flush requests. Note that the term 'ordered' isused to indicate the whole sequence of performing barrier requestsincluding draining and flushing.typedef void (prepare_flush_fn)(struct request_queue *q, struct request *rq);int blk_queue_ordered(struct request_queue *q, unsigned ordered, prepare_flush_fn *prepare_flush_fn);@q : the queue in question@ordered : the ordered mode the driver/device supports@prepare_flush_fn : this function should prepare @rq such that it flushes cache to physical medium when executedFor example, SCSI disk driver's prepare_flush_fn looks like thefollowing.static void sd_prepare_flush(struct request_queue *q, struct request *rq){ memset(rq->cmd, 0, sizeof(rq->cmd)); rq->cmd_type = REQ_TYPE_BLOCK_PC; rq->timeout = SD_TIMEOUT; rq->cmd[0] = SYNCHRONIZE_CACHE; rq->cmd_len = 10;}The following seven ordered modes are supported. The following tableshows which mode should be used depending on what features adevice/driver supports. In the leftmost column of table,QUEUE_ORDERED_ prefix is omitted from the mode names to save space.The table is followed by description of each mode. Note that in thedescriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' isused for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that thepreceding step must be complete before proceeding to the next step.'->' indicates that the next step can start as soon as the previousstep is issued. write-back cache ordered tag flush FUA-----------------------------------------------------------------------NONE yes/no N/A no N/ADRAIN no no N/A N/ADRAIN_FLUSH yes no yes noDRAIN_FUA yes no yes yesTAG no yes N/A N/ATAG_FLUSH yes yes yes noTAG_FUA yes yes yes yesQUEUE_ORDERED_NONE I/O barriers are not needed and/or supported. Sequence: N/AQUEUE_ORDERED_DRAIN Requests are ordered by draining the request queue and cache flushing isn't needed. Sequence: drain => barrierQUEUE_ORDERED_DRAIN_FLUSH Requests are ordered by draining the request queue and both pre-barrier and post-barrier cache flushings are needed. Sequence: drain => preflush => barrier => postflushQUEUE_ORDERED_DRAIN_FUA Requests are ordered by draining the request queue and pre-barrier cache flushing is needed. By using FUA on barrier request, post-barrier flushing can be skipped. Sequence: drain => preflush => barrierQUEUE_ORDERED_TAG Requests are ordered by ordered tag and cache flushing isn't needed. Sequence: barrierQUEUE_ORDERED_TAG_FLUSH Requests are ordered by ordered tag and both pre-barrier and post-barrier cache flushings are needed. Sequence: preflush -> barrier -> postflushQUEUE_ORDERED_TAG_FUA Requests are ordered by ordered tag and pre-barrier cache flushing is needed. By using FUA on barrier request, post-barrier flushing can be skipped. Sequence: preflush -> barrierRandom notes/caveats--------------------* SCSI layer currently can't use TAG ordering even if the drive,controller and driver support it. The problem is that SCSI midlayerrequest dispatch function is not atomic. It releases queue lock andswitch to SCSI host lock during issue and it's possible and likely tohappen in time that requests change their relative positions. Oncethis problem is solved, TAG ordering can be enabled.* Currently, no matter which ordered mode is used, there can be onlyone barrier request in progress. All I/O barriers are held off byblock layer until the previous I/O barrier is complete. This doesn'tmake any difference for DRAIN ordered devices, but, for TAG ordereddevices with very high command latency, passing multiple I/O barriersto low level *might* be helpful if they are very frequent. Well, thiscertainly is a non-issue. I'm writing this just to make clear that notwo I/O barrier is ever passed to low-level driver.* Completion order. Requests in ordered sequence are issued in orderbut not required to finish in order. Barrier implementation canhandle out-of-order completion of ordered sequence. IOW, the requestsMUST be processed in order but the hardware/software completion pathsare allowed to reorder completion notifications - eg. current SCSImidlayer doesn't preserve completion order during error handling.* Requeueing order. Low-level drivers are free to requeue any requestafter they removed it from the request queue withblkdev_dequeue_request(). As barrier sequence should be kept in orderwhen requeued, generic elevator code takes care of putting requests inorder around barrier. See blk_ordered_req_seq() andELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.Note that block drivers must not requeue preceding requests whilecompleting latter requests in an ordered sequence. Currently, noerror checking is done against this.* Error handling. Currently, block layer will report error to upperlayer if any of requests in an ordered sequence fails. Unfortunately,this doesn't seem to be enough. Look at the following request flow.QUEUE_ORDERED_TAG_FLUSH is in use. [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... > still in elevatorLet's say request [2], [3] are write requests to update file systemmetadata (journal or whatever) and [barrier] is used to mark thatthose updates are valid. Consider the following sequence. i. Requests [0] ~ [post] leaves the request queue and enters low-level driver. ii. After a while, unfortunately, something goes wrong and the drive fails [2]. Note that any of [0], [1] and [3] could have completed by this time, but [pre] couldn't have been finished as the drive must process it in order and it failed before processing that command. iii. Error handling kicks in and determines that the error is unrecoverable and fails [2], and resumes operation. iv. [pre] [barrier] [post] gets processed. v. *BOOM* power failsThe problem here is that the barrier request is *supposed* to indicatethat filesystem update requests [2] and [3] made it safely to thephysical medium and, if the machine crashes after the barrier iswritten, filesystem recovery code can depend on that. Sadly, thatisn't true in this case anymore. IOW, the success of a I/O barriershould also be dependent on success of some of the preceding requests,where only upper layer (filesystem) knows what 'some' is.This can be solved by implementing a way to tell the block layer whichrequests affect the success of the following barrier request andmaking lower lever drivers to resume operation on error only afterblock layer tells it to do so.As the probability of this happening is very low and the drive shouldbe faulty, implementing the fix is probably an overkill. But, still,it's there.* In previous drafts of barrier implementation, there was fallbackmechanism such that, if FUA or ordered TAG fails, less fancy orderedmode can be selected and the failed barrier request is retriedautomatically. The rationale for this feature was that as FUA ispretty new in ATA world and ordered tag was never used widely, therecould be devices which report to support those features but choke whenactually given such requests. This was removed for two reasons 1. it's an overkill 2. it'simpossible to implement properly when TAG ordering is used as lowlevel drivers resume after an error automatically. If it's everneeded adding it back and modifying low level drivers accordinglyshouldn't be difficult.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -