📄 rfc817.txt
字号:
layer boundary occurs when the protocol package is relegated to a front-end processor. In this case the interface to the protocol is some otherprotocol. It is difficult to imagine how to build close cooperationbetween layers when they are that far separated. Realistically, one ofthe prices which must be associated with an implementation so physicallymodularized is that the performance will suffer as a result. Of course,a separate processor for protocols could be very closely integrated into 18the mainframe architecture, with interprocessor co-ordination signals,shared memory, and similar features. Such a physical modularity mightwork very well, but there is little documented experience with thisclosely coupled architecture for protocol support. 6. Efficiency of Protocol Processing To this point, this document has considered how a protocol packageshould be broken into modules, and how those modules should bedistributed between free standing machines, the operating system kernel,and one or more user processes. It is now time to consider the otherhalf of the efficiency question, which is what can be done to speed theexecution of those programs that actually implement the protocols. Wewill make some specific observations about TCP and IP, and then concludewith a few generalities. IP is a simple protocol, especially with respect to the processingof normal packets, so it should be easy to get it to performefficiently. The only area of any complexity related to actual packetprocessing has to do with fragmentation and reassembly. The reader isreferred to RFC 815, titled "IP Datagram Reassembly Algorithms", forspecific consideration of this point. Most costs in the IP layer come from table look up functions, asopposed to packet processing functions. An outgoing packet requires twotranslation functions to be performed. The internet address must betranslated to a target gateway, and a gateway address must be translatedto a local network number (if the host is attached to more than one 19network). It is easy to build a simple implementation of these tablelook up functions that in fact performs very poorly. The programmershould keep in mind that there may be as many as a thousand networknumbers in a typical configuration. Linear searching of a thousandentry table on every packet is extremely unsuitable. In fact, it may beworth asking TCP to cache a hint for each connection, which can behanded down to IP each time a packet is sent, to try to avoid theoverhead of a table look up. TCP is a more complex protocol, and presents many moreopportunities for getting things wrong. There is one area which isgenerally accepted as causing noticeable and substantial overhead aspart of TCP processing. This is computation of the checksum. It wouldbe nice if this cost could be avoided somehow, but the idea of an end-to-end checksum is absolutely central to the functioning of TCP. Nohost implementor should think of omitting the validation of a checksumon incoming data. Various clever tricks have been used to try to minimize the cost ofcomputing the checksum. If it is possible to add additional microcodedinstructions to the machine, a checksum instruction is the most obviouscandidate. Since computing the checksum involves picking up every byteof the segment and examining it, it is possible to combine the operationof computing the checksum with the operation of copying the segment fromone location to another. Since a number of data copies are probablyalready required as part of the processing structure, this kind ofsharing might conceivably pay off if it didn't cause too much trouble to 20the modularity of the program. Finally, computation of the checksumseems to be one place where careful attention to the details of thealgorithm used can make a drastic difference in the throughput of theprogram. The Multics system provides one of the best case studies ofthis, since Multics is about as poorly organized to perform thisfunction as any machine implementing TCP. Multics is a 36-bit wordmachine, with four 9-bit bytes per word. The eight-bit bytes of a TCPsegment are laid down packed in memory, ignoring word boundaries. Thismeans that when it is necessary to pick up the data as a set of 16-bitunits for the purpose of adding them to compute checksums, horriblemasking and shifting is required for each 16-bit value. An earlyversion of a program using this strategy required 6 milliseconds tochecksum a 576-byte segment. Obviously, at this point, checksumcomputation was becoming the central bottleneck to throughput. A morecareful recoding of this algorithm reduced the checksum processing timeto less than one millisecond. The strategy used was extremely dirty.It involved adding up carefully selected words of the area in which thedata lay, knowing that for those particular words, the 16-bit valueswere properly aligned inside the words. Only after the addition hadbeen done were the various sums shifted, and finally added to producethe eventual checksum. This kind of highly specialized programming isprobably not acceptable if used everywhere within an operating system.It is clearly appropriate for one highly localized function which can beclearly identified as an extreme performance bottleneck. Another area of TCP processing which may cause performance problemsis the overhead of examining all of the possible flags and options which 21occur in each incoming packet. One paper, by Bunch and Day [2], assertsthat the overhead of packet header processing is actually an importantlimiting factor in throughput computation. Not all measurementexperiments have tended to support this result. To whatever extent itis true, however, there is an obvious strategy which the implementorought to use in designing his program. He should build his program tooptimize the expected case. It is easy, especially when first designinga program, to pay equal attention to all of the possible outcomes ofevery test. In practice, however, few of these will ever happen. A TCPshould be built on the assumption that the next packet to arrive willhave absolutely nothing special about it, and will be the next oneexpected in the sequence space. One or two tests are sufficient todetermine that the expected set of control flags are on. (The ACK flagshould be on; the Push flag may or may not be on. No other flags shouldbe on.) One test is sufficient to determine that the sequence number ofthe incoming packet is one greater than the last sequence numberreceived. In almost every case, that will be the actual result. Again,using the Multics system as an example, failure to optimize the case ofreceiving the expected sequence number had a detectable effect on theperformance of the system. The particular problem arose when a numberof packets arrived at once. TCP attempted to process all of thesepackets before awaking the user. As a result, by the time the lastpacket arrived, there was a threaded list of packets which had severalitems on it. When a new packet arrived, the list was searched to findthe location into which the packet should be inserted. Obviously, thelist should be searched from highest sequence number to lowest sequence 22number, because one is expecting to receive a packet which comes afterthose already received. By mistake, the list was searched from front toback, starting with the packets with the lowest sequence number. Theamount of time spent searching this list backwards was easily detectablein the metering measurements. Other data structures can be organized to optimize the action whichis normally taken on them. For example, the retransmission queue isvery seldom actually used for retransmission, so it should not beorganized to optimize that action. In fact, it should be organized tooptimized the discarding of things from it when the acknowledgementarrives. In many cases, the easiest way to do this is not to save thepacket at all, but to reconstruct it only if it needs to beretransmitted, starting from the data as it was originally buffered bythe user. There is another generality, at least as important as optimizingthe common case, which is to avoid copying data any more times thannecessary. One more result from the Multics TCP may prove enlighteninghere. Multics takes between two and three milliseconds within the TCPlayer to process an incoming packet, depending on its size. For a 576-byte packet, the three milliseconds is used up approximately as follows.One millisecond is used computing the checksum. Six hundredmicroseconds is spent copying the data. (The data is copied twice, at.3 milliseconds a copy.) One of those copy operations could correctlybe included as part of the checksum cost, since it is done to get thedata on a known word boundary to optimize the checksum algorithm. 23However, the copy also performs another necessary transfer at the sametime. Header processing and packet resequencing takes .7 milliseconds.The rest of the time is used in miscellaneous processing, such asremoving packets from the retransmission queue which are acknowledged bythis packet. Data copying is the second most expensive single operationafter data checksuming. Some implementations, often because of anexcessively layered modularity, end up copying the data around a greatdeal. Other implementations end up copying the data because there is noshared memory between processes, and the data must be moved from processto process via a kernel operation. Unless the amount of this activityis kept strictly under control, it will quickly become the majorperformance bottleneck. 7. Conclusions This document has addressed two aspects of obtaining performancefrom a protocol implementation, the way in which the protocol is layeredand integrated into the operating system, and the way in which thedetailed handling of the packet is optimized. It would be nice if oneor the other of these costs would completely dominate, so that all ofone's attention could be concentrated there. Regrettably, this is notso. Depending on the particular sort of traffic one is getting, forexample, whether Telnet one-byte packets or file transfer maximum sizepackets at maximum speed, one can expect to see one or the other costbeing the major bottleneck to throughput. Most implementors who havestudied their programs in an attempt to find out where the time wasgoing have reached the unsatisfactory conclusion that it is going 24equally to all parts of their program. With the possible exception ofchecksum processing, very few people have ever found that theirperformance problems were due to a single, horrible bottleneck whichthey could fix by a single stroke of inventive programming. Rather, theperformance was something which was improved by painstaking tuning ofthe entire program. Most discussions of protocols begin by introducing the concept oflayering, which tends to suggest that layering is a fundamentallywonderful idea which should be a part of every consideration ofprotocols. In fact, layering is a mixed blessing. Clearly, a layerinterface is necessary whenever more than one client of a particularlayer is to be allowed to use that same layer. But an interface,precisely because it is fixed, inevitably leads to a lack of completeunderstanding as to what one layer wishes to obtain from another. Thishas to lead to inefficiency. Furthermore, layering is a potential snarein that one is tempted to think that a layer boundary, which was anartifact of the specification procedure, is in fact the proper boundaryto use in modularizing the implementation. Again, in certain cases, anarchitected layer must correspond to an implemented layer, precisely sothat several clients can have access to that layer in a reasonablystraightforward manner. In other cases, cunning rearrangement of theimplemented module boundaries to match with various functions, such asthe demultiplexing of incoming packets, or the sending of asynchronousoutgoing packets, can lead to unexpected performance improvementscompared to more traditional implementation strategies. Finally, goodperformance is something which is difficult to retrofit onto an existing 25program. Since performance is influenced, not just by the fine detail,but by the gross structure, it is sometimes the case that in order toobtain a substantial performance improvement, it is necessary tocompletely redo the program from the bottom up. This is a greatdisappointment to programmers, especially those doing a protocolimplementation for the first time. Programmers who are somewhatinexperienced and unfamiliar with protocols are sufficiently concernedwith getting their program logically correct that they do not have thecapacity to think at the same time about the performance of thestructure they are building. Only after they have achieved a logicallycorrect program do they discover that they have done so in a way whichhas precluded real performance. Clearly, it is more difficult to designa program thinking from the start about both logical correctness andperformance. With time, as implementors as a group learn more about theappropriate structures to use for building protocols, it will bepossible to proceed with an implementation project having moreconfidence that the structure is rational, that the program will work,and that the program will work well. Those of us now implementingprotocols have the privilege of being on the forefront of this learningprocess. It should be no surprise that our programs sometimes sufferfrom the uncertainty we bring to bear on them. 26Citations [1] Cohen and Postel, "On Protocol Multiplexing", Sixth DataCommunications Symposium, ACM/IEEE, November 1979. [2] Bunch and Day, "Control Structure Overhead in TCP", Trends andApplications: Computer Networking, NBS Symposium, May 1980.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -