📄 error_handling.tex

📁 xorp源码hg
💻 TEX
字号:
\documentclass[11pt]{article}\usepackage{xspace}\usepackage{times}\usepackage{psfig}\usepackage{amsmath}\newcommand{\module} {{\em module}\@\xspace}\newcommand{\modules} {{\em modules}\@\xspace}\newcommand{\finder} {{\em Finder}\@\xspace}\newcommand{\cm} {{\em CM}\@\xspace}\textwidth 6.5in\topmargin 0.0in\textheight 8.5in\headheight 0in\headsep 0in\oddsidemargin 0in%\date{}\title{Error Handling in XORP}\author{First pass: Mark Handley}%\twocolumn\begin{document}\parsep 0ex\parskip 1.0ex\parindent 0em\noindent\maketitle                            \section{Introduction}A XORP router consists of a number of user-space modules communicatingvia XRLs, a Forwarding Engine Abstraction process (which iseffectively a kernel proxy) and a kernel forwarding path.  State iscommunicated between these modules during the normal operation of therouter.  In this document we attempt to map out what the correctactions should be when XORP modules unexpectedly restart, in terms ofwho is responsible for removing the now-obsolete state, and who isresponsible for re-instantiating the state to get back to operationalstatus.This document assumes roughly the process structure in Figure\ref{modules}.  The RouterManager process is responsible for startingup all other processes, and for monitoring their health using periodicXRL status messages.\begin{figure}[htb]\centerline{\psfig{figure=processes3.ps,width=0.6\textwidth}}  \caption{XORP Modules}\label{modules}\end{figure}\section{General Infrastructure}\subsection{RouterManager Restart}A router manager restart is considered a fatal event; all routerprocesses will be terminated and restarted.\subsection{Finder Restart}The XRL mechanism should include a keepalive mechanism that allowseach process to detect when the Finder has failed.  Each process's XRLlibrary will then repeated attempt to re-connect to the Finder.  Whenthe reconnect succeeds, the state that was registered with the oldFinder instance will then be re-registered, allowing normal routeroperation to resume.  This whole process is handled in the XRL librarycode, so no explicit action needs to be taken by any process, exceptfor the RouterManager which must restart the Finder process.During the Finder restart process, some services will be temporarilyunavailable if a client does not have a cached resolved copy of an XRLfor a request it wishes to make.  All processes should gracefullyhandle a NO\_FINDER error during regular operation and simplyperiodically retry the request.  It should be noted that after aFinder restart, there may be a period of time when the new Finder hasincomplete state, and may return a FINDER\_INITIALIZING error becausethe destination module has not yet had a chance to re-register.Processes should gracefully handle such an event, and generally retry therequest.  Outside of this period, FINDER\_INITIALIZING will not bereturned; instead RESOLVE\_FAILED will be returned, and should betreated as an error in the normal way.Since the Finder has no way of knowing what the appropriate end ofits initialization period is, it may be able to use the RouterMangerto tell it all of the processes it should expect to hear from.  TheFinder could then infer the end of the initialization period when eachprocess has registered its mappings.  The entering of mappings intothe Finder by client processes may need modifying so that initialmappings are done in one transaction to facilitate the end ofinitialization detection.\subsection{CLI Restart}A CLI restart will terminate all terminal sessions with that CLIinstance.  However, the CLI itself will not generally storeconfiguration state, so a CLI restart should not cause any processother than the RouterManager to take any remedial action.\section{Unicast Routes}\subsection{FEA Restart}\subsubsection*{State teardown}The FEA is stateless with respect to unicast routes.  FEA terminationdoes not cause the forwarding kernel to remove forwarding state.  TheURIB will detect that the FEA has changed from XRL keepalive messagetimeout, but it takes no immediate action on this event.The RouterManager process will detect the FEA failure,either from a SIGCHILD, or from XRL keepalive message timeout.  Itshould log the event, and attempt to restart the FEA.\subsubsection*{State re-instantiation}The URIB will repeatedly attempt to contact the new FEA (sending a XRLstatus request) until it is successful.  While the FEA is notresponding, routing changes may occur; these cannot be passed to theFEA, but this failure should be silent.  Upon detecting successful FEArestart, the URIB will send a {\tt flush\_all\_routes}, and then dumpits entire forwarding table into the FEA.If the URIB fails to detect FEA restart after a suitable period oftime (eg 10 seconds), it should send a {\tt forwarding\_engine\_failure}message to the routing protocol processes that are registered withit. On receipt of such a message, a routing process should sendappropriate messages to its peers withdrawing all routing informationthat may cause forwarding through this router.  If the URIBsubsequently succeeds in contacting the FEA, it should send a{\tt forwarding\_engine\_alive} message to the routing protocolprocesses, which re-enables normal operation.\subsection{URIB Restart}\subsubsection*{State Teardown}The FEA is not required to monitor the status of the URIB.  Even if itdoes detect the URIB failure, it should not take any action.Each routing protocol will detect URIB failure from XRL keepalivemessage timeout.  The default behavior on a routing process detectingURIB failure should be to immediately send appropriate messages to itspeers withdrawing all routing information that may cause forwardingthrough this router. In general, the routing protocol should notdrop its peerings unless this is the only way to prevent forwarding.  The RouterManager process will detect the URIB failure,either from a SIGCHILD, or from XRL keepalive message timeout.  Itshould log the event, and attempt to restart the URIB.\subsubsection*{State re-instantiation}After detecting URIB failure, routing processes should repeatedlyattempt to re-contact the URIB (sending a XRL status request) until itis successful.  Upon success, they should re-send their routing tableto the URIB, and re-register any nexthop information that they need tohear from the URIB.  After this is done, they should re-advertise therouter to their peers as being available for forwarding.\subsection{Routing Protocol Restart}\subsubsection*{State Teardown}The URIB should detect the failure of a routing protocol from XRLkeepalive message timeout.  It should then delete all the routes thatit has stored from that routing protocol, propagating the deletions toboth the kernel and to other routing protocols that have notificationentries set for these routes.The RouterManager process will detect the routing process failure,either from a SIGCHILD, or from XRL keepalive message timeout.  Itshould log the event, and attempt to restart the routing process.\subsubsection*{State re-instantiation}Routing state re-instantiation after a routing protocol restart ishandled the same as if that routing protocol started up for the firsttime - no special action is required.\section{Multicast Routes}\subsection{FEA Restart}\subsubsection*{State teardown}The kernel should detect that the routing socket has closed.  Onclosure of the routing socket, all multicast forwarding state isflushed from the kernel.The MFIB process (which may or may not be a part of PIM-SM) willdetect the FEA failure from XRL keepalive message timeout, or fromtimeout of an attempt to modify multicast routing state.  The MFIBmust assume that the FEA will be restarted, and so it will attempt tore-contact the FEA periodically.PIM-SM should not take any special action.The RouterManager process will detect the FEA failure, either from aSIGCHILD, or from XRL keepalive message timeout.  It should log theevent, and attempt to restart the FEA.\subsubsection*{State re-instantiation}No special action is required for state re-instantiation after the FEAhas restarted.  When a multicast data packet arrives, it will generatea CACHEMISS event, which will be signaled to the FEA.  The FEAdoesn't have state for this (S,G) or (*,G), so the CACHEMISS is thenpropagated to the MFIB process. \begin{itemize}\item If the MFIB has the appropriate forwarding information, thiswill be sent to the FEA which will then send it to the kernel so thatthe packet can be forwarded correctly.  \item If the MFIB has state, but the packet arrived on the wronginterface, the MFIB should also send this forwarding information to theFEA, but it should also signal WRONGIIF to the appropriate multicastrouting process.\item If the MFIB has no forwarding state that matches this (S,G),CACHEMISS is signaled to the appropriate multicast routing process.\end{itemize}In this way, as data packets arrive, they cause both the FEA and thekernel's multicast forwarding state to be re-instantiated.\section*{XRL Error Handling}Interprocess communication in XORP is achieved using XRLs. In thissection we will consider what should be done when an XRL call failsdue to a communication error.All XRL calls will ultimately get a response. In the normal case theresponse returns the status of the call (good or bad). In addition toerror responses produced by the application, the XRL library can alsoreturn the following error responses:\begin{itemize}\item NO\_FINDER\item RESOLVE\_FAILED\item SEND\_FAILED\item REPLY\_TIMED\_OUT\end{itemize}From an application point or view, the first three errors areequivalent: the XRL was not communicated to the destination.  We willdiscuss these below using the generic term {\em XRL send failure}.However, its not clear what can be inferred from a timeoutresponse. The reasons for a timeout can be: the peer has died, peer isslow to respond, the network cable has been removed. As in all networkcommunications when a timeout occurs we don't know if the lastunacknowledged XRL request was received and processed by the peer.If the timeout has occurred because the peer has died we will receivenotification of this explicitly and will deal with it as specified insection \ref{pfailure}.  Thus an XRL transport error SHOULD NOT betaken as an indication that the peer is dead.  If an application caresthat the peer has died or restarted, it SHOULD register with thefinder to receive notifications of process restarts.  This a processSHOULD assume that an XRL transport problem will be transient until itreceives an explicit confirmation that the destination has failed. XRLs can be sent over unreliable transports such as UDP or reliabletransports such as TCP. The type of transport that should be used willbe specified when defining the interface. In the case of reliabletransport, the errors above should generally not occur, but in anyevent we need general rules about how to handle them should somethingfail in an unexpectedly way.In addition, the way the application uses an XRL interface can bypipelined or non-pipelined.  In the pipelined case, multiple requestscan be outstanding simultaneously; in the non-pipelined case at mostone request can be outstanding at a time.It is useful for us to categorise XRL interfaces along these two axes:reliable/unreliable and pipelined/non-pilelined.\subsubsection*{Unreliable, Non-pipelined}If an XRL send failure occurs, the sending application MAY choose toretransmit the XRL, or ignore the failure as it sees fit.  In an XRL timeout occurs, the sending application MAY also choose toretransmit the XRL, or ignore the failure as it sees fit.  However, ifthe application chooses to re-send the XRL, the interface MUST bewritten in such a way that if this XRL had previously been received,this will not cause a further failure.\subsubsection*{Reliable, Non-pipelined}If an XRL send failure occurs, the sending application SHOULDretransmit the XRL.  In an XRL timeout occurs the sending applicationSHOULD also retransmit the XRL.  Further requests using this interfaceMUST be queued until the XRL has successfully been received.The interface should be written in such a way that if this XRL hadpreviously been received, this will not cause a further failure.An alternative strategy is possible.  If the XRL in question changesstate at the receiving application, the interface may also support aquery mechanism.  If the XRL fails with a timeout, the sendingapplication may opt not to blindly re-transmit the XRL, but insteadsend a query (retransmitted as necessary) to determine whether thestate at the remote system is as it would be if the XRL had beenreceived.  Only if it the query indicates that the state was notreceived would the original XRL be retransmitted.This alternative is more complicated, and so it should only beprefered when the consequences of receiving the same XRL twiceoutweight the additional complexity.\subsubsection*{Unreliable, Pipelined}The same issues apply as with unreliable, non-pipelined, but thesituation is more complicated.  An interface that uses unreliabletransport and pipelining is one that explicitly permits loss and {\emre-ordering} of requests.  It is up to the application to choosewhether to retransmit XRLs that return XRL send failed or timeout, butthe application must only do so if it is certain that the re-orderingcaused by retransmission will not be a problem.\subsubsection*{Reliable, Pipelined}Reliable, pipelined interfaces are the most difficult in which tohandle XRL errors.  Three issues need to be considered:\begin{itemize}\item The XRL that failed due to a transport error may be followed bypipelined XRLs that succeeded.\item The XRL that failed due to a transport error may be followed bypipelined XRLs that failed at the application level due to the statecaused by the first failed XRL not being instantiated.\item If a failed XRL was followed by pipelined XRLs that succeeded,retransmitting that XRL will cause a re-ordering that might leave thedestination in a different state than it would be if the XRLs hadarrived in order.\end{itemize}To avoid all these problems we require that a reliable XRL transportfail to deliver all subsequent XRLs in the pipeline if a single XRLfails.  Thus the reliable pipelined interface falls back to areliable, non-pipelined interface after a failure.  A subsequent XRLthat succeeds then permits pipelined operation to resume.\bibliographystyle{plain}\bibliography{xorp}\end{document}
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -