📄 scsi_eh.txt
字号:
SCSI EH====================================== This document describes SCSI midlayer error handling infrastructure.Please refer to Documentation/scsi/scsi_mid_low_api.txt for moreinformation regarding SCSI midlayer.TABLE OF CONTENTS[1] How SCSI commands travel through the midlayer and to EH [1-1] struct scsi_cmnd [1-2] How do scmd's get completed? [1-2-1] Completing a scmd w/ scsi_done [1-2-2] Completing a scmd w/ timeout [1-3] How EH takes over[2] How SCSI EH works [2-1] EH through fine-grained callbacks [2-1-1] Overview [2-1-2] Flow of scmds through EH [2-1-3] Flow of control [2-2] EH through transportt->eh_strategy_handler() [2-2-1] Pre transportt->eh_strategy_handler() SCSI midlayer conditions [2-2-2] Post transportt->eh_strategy_handler() SCSI midlayer conditions [2-2-3] Things to consider[1] How SCSI commands travel through the midlayer and to EH[1-1] struct scsi_cmnd Each SCSI command is represented with struct scsi_cmnd (== scmd). Ascmd has two list_head's to link itself into lists. The two arescmd->list and scmd->eh_entry. The former is used for free list orper-device allocated scmd list and not of much interest to this EHdiscussion. The latter is used for completion and EH lists and unlessotherwise stated scmds are always linked using scmd->eh_entry in thisdiscussion.[1-2] How do scmd's get completed? Once LLDD gets hold of a scmd, either the LLDD will complete thecommand by calling scsi_done callback passed from midlayer wheninvoking hostt->queuecommand() or SCSI midlayer will time it out.[1-2-1] Completing a scmd w/ scsi_done For all non-EH commands, scsi_done() is the completion callback. Itdoes the following. 1. Delete timeout timer. If it fails, it means that timeout timer has expired and is going to finish the command. Just return. 2. Link scmd to per-cpu scsi_done_q using scmd->en_entry 3. Raise SCSI_SOFTIRQ SCSI_SOFTIRQ handler scsi_softirq calls scsi_decide_disposition() todetermine what to do with the command. scsi_decide_disposition()looks at the scmd->result value and sense data to determine what to dowith the command. - SUCCESS scsi_finish_command() is invoked for the command. The function does some maintenance choirs and notify completion by calling scmd->done() callback, which, for fs requests, would be HLD completion callback - sd:sd_rw_intr, sr:rw_intr, st:st_intr. - NEEDS_RETRY - ADD_TO_MLQUEUE scmd is requeued to blk queue. - otherwise scsi_eh_scmd_add(scmd, 0) is invoked for the command. See [1-3] for details of this function.[1-2-2] Completing a scmd w/ timeout The timeout handler is scsi_times_out(). When a timeout occurs, thisfunction 1. invokes optional hostt->eh_timed_out() callback. Return value can be one of - EH_HANDLED This indicates that eh_timed_out() dealt with the timeout. The scmd is passed to __scsi_done() and thus linked into per-cpu scsi_done_q. Normal command completion described in [1-2-1] follows. - EH_RESET_TIMER This indicates that more time is required to finish the command. Timer is restarted. This action is counted as a retry and only allowed scmd->allowed + 1(!) times. Once the limit is reached, action for EH_NOT_HANDLED is taken instead. *NOTE* This action is racy as the LLDD could finish the scmd after the timeout has expired but before it's added back. In such cases, scsi_done() would think that timeout has occurred and return without doing anything. We lose completion and the command will time out again. - EH_NOT_HANDLED This is the same as when eh_timed_out() callback doesn't exist. Step #2 is taken. 2. scsi_eh_scmd_add(scmd, SCSI_EH_CANCEL_CMD) is invoked for the command. See [1-3] for more information.[1-3] How EH takes over scmds enter EH via scsi_eh_scmd_add(), which does the following. 1. Turns on scmd->eh_eflags as requested. It's 0 for error completions and SCSI_EH_CANCEL_CMD for timeouts. 2. Links scmd->eh_entry to shost->eh_cmd_q 3. Sets SHOST_RECOVERY bit in shost->shost_state 4. Increments shost->host_failed 5. Wakes up SCSI EH thread if shost->host_busy == shost->host_failed As can be seen above, once any scmd is added to shost->eh_cmd_q,SHOST_RECOVERY shost_state bit is turned on. This prevents any newscmd to be issued from blk queue to the host; eventually, all scmds onthe host either complete normally, fail and get added to eh_cmd_q, ortime out and get added to shost->eh_cmd_q. If all scmds either complete or fail, the number of in-flight scmdsbecomes equal to the number of failed scmds - i.e. shost->host_busy ==shost->host_failed. This wakes up SCSI EH thread. So, once woken up,SCSI EH thread can expect that all in-flight commands have failed andare linked on shost->eh_cmd_q. Note that this does not mean lower layers are quiescent. If a LLDDcompleted a scmd with error status, the LLDD and lower layers areassumed to forget about the scmd at that point. However, if a scmdhas timed out, unless hostt->eh_timed_out() made lower layers forgetabout the scmd, which currently no LLDD does, the command is stillactive as long as lower layers are concerned and completion couldoccur at any time. Of course, all such completions are ignored as thetimer has already expired. We'll talk about how SCSI EH takes actions to abort - make LLDDforget about - timed out scmds later.[2] How SCSI EH works LLDD's can implement SCSI EH actions in one of the following twoways. - Fine-grained EH callbacks LLDD can implement fine-grained EH callbacks and let SCSI midlayer drive error handling and call appropriate callbacks. This will be discussed further in [2-1]. - eh_strategy_handler() callback This is one big callback which should perform whole error handling. As such, it should do all choirs SCSI midlayer performs during recovery. This will be discussed in [2-2]. Once recovery is complete, SCSI EH resumes normal operation bycalling scsi_restart_operations(), which 1. Checks if door locking is needed and locks door. 2. Clears SHOST_RECOVERY shost_state bit 3. Wakes up waiters on shost->host_wait. This occurs if someone calls scsi_block_when_processing_errors() on the host. (*QUESTION* why is it needed? All operations will be blocked anyway after it reaches blk queue.) 4. Kicks queues in all devices on the host in the asses[2-1] EH through fine-grained callbacks[2-1-1] Overview If eh_strategy_handler() is not present, SCSI midlayer takes chargeof driving error handling. EH's goals are two - make LLDD, host anddevice forget about timed out scmds and make them ready for newcommands. A scmd is said to be recovered if the scmd is forgotten bylower layers and lower layers are ready to process or fail the scmdagain. To achieve these goals, EH performs recovery actions with increasingseverity. Some actions are performed by issuing SCSI commands andothers are performed by invoking one of the following fine-grainedhostt EH callbacks. Callbacks may be omitted and omitted ones areconsidered to fail always.int (* eh_abort_handler)(struct scsi_cmnd *);int (* eh_device_reset_handler)(struct scsi_cmnd *);int (* eh_bus_reset_handler)(struct scsi_cmnd *);int (* eh_host_reset_handler)(struct scsi_cmnd *); Higher-severity actions are taken only when lower-severity actionscannot recover some of failed scmds. Also, note that failure of thehighest-severity action means EH failure and results in offlining ofall unrecovered devices. During recovery, the following rules are followed - Recovery actions are performed on failed scmds on the to do list, eh_work_q. If a recovery action succeeds for a scmd, recovered scmds are removed from eh_work_q. Note that single recovery action on a scmd can recover multiple scmds. e.g. resetting a device recovers all failed scmds on the device. - Higher severity actions are taken iff eh_work_q is not empty after lower severity actions are complete. - EH reuses failed scmds to issue commands for recovery. For timed-out scmds, SCSI EH ensures that LLDD forgets about a scmd before reusing it for EH commands. When a scmd is recovered, the scmd is moved from eh_work_q to EHlocal eh_done_q using scsi_eh_finish_cmd(). After all scmds arerecovered (eh_work_q is empty), scsi_eh_flush_done_q() is invoked toeither retry or error-finish (notify upper layer of failure) recoveredscmds. scmds are retried iff its sdev is still online (not offlined duringEH), REQ_FAILFAST is not set and ++scmd->retries is less thanscmd->allowed.[2-1-2] Flow of scmds through EH
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -