📄 rfc817.txt

📁 中、英文RFC文档大全打包下载完全版 .
💻 TXT
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
layer boundary occurs when the protocol package is relegated to a front-end processor.  In this case the interface to the protocol is some otherprotocol.    It  is  difficult to imagine how to build close cooperationbetween layers when they are that far separated.  Realistically, one  ofthe prices which must be associated with an implementation so physicallymodularized is that the performance will suffer as a result.  Of course,a separate processor for protocols could be very closely integrated into                                   18the  mainframe  architecture, with interprocessor co-ordination signals,shared memory, and similar features.  Such a physical  modularity  mightwork  very  well,  but  there  is little documented experience with thisclosely coupled architecture for protocol support.     6.  Efficiency of Protocol Processing     To this point, this document has considered how a protocol  packageshould  be  broken  into  modules,  and  how  those  modules  should  bedistributed between free standing machines, the operating system kernel,and one or more user processes.  It is now time to  consider  the  otherhalf  of the efficiency question, which is what can be done to speed theexecution of those programs that actually implement the protocols.    Wewill make some specific observations about TCP and IP, and then concludewith a few generalities.     IP  is a simple protocol, especially with respect to the processingof  normal  packets,  so  it  should  be  easy  to  get  it  to  performefficiently.    The only area of any complexity related to actual packetprocessing has to do with fragmentation and reassembly.  The  reader  isreferred  to  RFC  815,  titled "IP Datagram Reassembly Algorithms", forspecific consideration of this point.     Most costs in the IP layer come from table look  up  functions,  asopposed to packet processing functions.  An outgoing packet requires twotranslation  functions  to  be  performed.  The internet address must betranslated to a target gateway, and a gateway address must be translatedto a local network number (if the host is  attached  to  more  than  one                                   19network).    It  is easy to build a simple implementation of these tablelook up functions that in fact performs very  poorly.    The  programmershould  keep  in  mind  that  there may be as many as a thousand networknumbers in a typical configuration.   Linear  searching  of  a  thousandentry table on every packet is extremely unsuitable.  In fact, it may beworth  asking  TCP  to  cache  a  hint for each connection, which can behanded down to IP each time a packet  is  sent,  to  try  to  avoid  theoverhead of a table look up.     TCP   is   a   more   complex  protocol,  and  presents  many  moreopportunities for getting things wrong.  There  is  one  area  which  isgenerally  accepted  as  causing  noticeable and substantial overhead aspart of TCP processing.  This is computation of the checksum.  It  wouldbe  nice  if this cost could be avoided somehow, but the idea of an end-to-end checksum is absolutely central to the functioning  of  TCP.    Nohost  implementor  should think of omitting the validation of a checksumon incoming data.     Various clever tricks have been used to try to minimize the cost ofcomputing the checksum.  If it is possible to add additional  microcodedinstructions  to the machine, a checksum instruction is the most obviouscandidate.  Since computing the checksum involves picking up every  byteof the segment and examining it, it is possible to combine the operationof computing the checksum with the operation of copying the segment fromone  location  to  another.   Since a number of data copies are probablyalready required as part of  the  processing  structure,  this  kind  ofsharing might conceivably pay off if it didn't cause too much trouble to                                   20the  modularity  of  the  program.  Finally, computation of the checksumseems to be one place where careful attention  to  the  details  of  thealgorithm  used  can  make a drastic difference in the throughput of theprogram.  The Multics system provides one of the best  case  studies  ofthis,  since  Multics  is  about  as  poorly  organized  to perform thisfunction as any machine implementing TCP.   Multics  is  a  36-bit  wordmachine,  with  four 9-bit bytes per word.  The eight-bit bytes of a TCPsegment are laid down packed in memory, ignoring word boundaries.   Thismeans  that  when it is necessary to pick up the data as a set of 16-bitunits for the purpose of adding  them  to  compute  checksums,  horriblemasking  and  shifting  is  required  for  each  16-bit value.  An earlyversion of a program using this  strategy  required  6  milliseconds  tochecksum  a  576-byte  segment.    Obviously,  at  this  point, checksumcomputation was becoming the central bottleneck to throughput.   A  morecareful  recoding of this algorithm reduced the checksum processing timeto less than one millisecond.  The strategy used  was  extremely  dirty.It  involved adding up carefully selected words of the area in which thedata lay, knowing that for those particular  words,  the  16-bit  valueswere  properly  aligned  inside  the words.  Only after the addition hadbeen done were the various sums shifted, and finally  added  to  producethe  eventual  checksum.  This kind of highly specialized programming isprobably not acceptable if used everywhere within an  operating  system.It is clearly appropriate for one highly localized function which can beclearly identified as an extreme performance bottleneck.     Another area of TCP processing which may cause performance problemsis the overhead of examining all of the possible flags and options which                                   21occur in each incoming packet.  One paper, by Bunch and Day [2], assertsthat  the  overhead of packet header processing is actually an importantlimiting  factor  in  throughput  computation.    Not  all   measurementexperiments  have  tended to support this result.  To whatever extent itis true, however, there is an obvious  strategy  which  the  implementorought  to  use in designing his program.  He should build his program tooptimize the expected case.  It is easy, especially when first designinga program, to pay equal attention to all of  the  possible  outcomes  ofevery test.  In practice, however, few of these will ever happen.  A TCPshould  be  built  on the assumption that the next packet to arrive willhave absolutely nothing special about it,  and  will  be  the  next  oneexpected  in  the  sequence  space.   One or two tests are sufficient todetermine that the expected set of control flags are on.  (The ACK  flagshould be on; the Push flag may or may not be on.  No other flags shouldbe on.)  One test is sufficient to determine that the sequence number ofthe  incoming  packet  is  one  greater  than  the  last sequence numberreceived.  In almost every case, that will be the actual result.  Again,using the Multics system as an example, failure to optimize the case  ofreceiving  the  expected  sequence number had a detectable effect on theperformance of the system.  The particular problem arose when  a  numberof  packets  arrived  at  once.    TCP attempted to process all of thesepackets before awaking the user.  As a result,  by  the  time  the  lastpacket  arrived,  there was a threaded list of packets which had severalitems on it.  When a new packet arrived, the list was searched  to  findthe  location  into which the packet should be inserted.  Obviously, thelist should be searched from highest sequence number to lowest  sequence                                   22number,  because  one is expecting to receive a packet which comes afterthose already received.  By mistake, the list was searched from front toback, starting with the packets with the lowest sequence  number.    Theamount of time spent searching this list backwards was easily detectablein the metering measurements.     Other data structures can be organized to optimize the action whichis  normally  taken  on  them.  For example, the retransmission queue isvery seldom actually used  for  retransmission,  so  it  should  not  beorganized  to  optimize that action.  In fact, it should be organized tooptimized the discarding of things  from  it  when  the  acknowledgementarrives.    In many cases, the easiest way to do this is not to save thepacket  at  all,  but  to  reconstruct  it  only  if  it  needs  to   beretransmitted,  starting  from the data as it was originally buffered bythe user.     There is another generality, at least as  important  as  optimizingthe  common  case,  which  is  to avoid copying data any more times thannecessary.  One more result from the Multics TCP may prove  enlighteninghere.    Multics takes between two and three milliseconds within the TCPlayer to process an incoming packet, depending on its size.  For a  576-byte packet, the three milliseconds is used up approximately as follows.One   millisecond   is   used  computing  the  checksum.    Six  hundredmicroseconds is spent copying the data.  (The data is copied  twice,  at.3  milliseconds  a copy.)  One of those copy operations could correctlybe included as part of the checksum cost, since it is done  to  get  thedata  on  a  known  word  boundary  to  optimize the checksum algorithm.                                   23However,  the  copy also performs another necessary transfer at the sametime.  Header processing and packet resequencing takes .7  milliseconds.The  rest  of  the  time  is  used  in miscellaneous processing, such asremoving packets from the retransmission queue which are acknowledged bythis packet.  Data copying is the second most expensive single operationafter data checksuming.   Some  implementations,  often  because  of  anexcessively  layered  modularity, end up copying the data around a greatdeal.  Other implementations end up copying the data because there is noshared memory between processes, and the data must be moved from processto process via a kernel operation.  Unless the amount of  this  activityis  kept  strictly  under  control,  it  will  quickly  become the majorperformance bottleneck.     7.  Conclusions     This document has addressed two aspects  of  obtaining  performancefrom a protocol implementation, the way in which the protocol is layeredand  integrated  into  the  operating  system,  and the way in which thedetailed handling of the packet is optimized.  It would be nice  if  oneor  the  other  of these costs would completely dominate, so that all ofone's attention could be concentrated there.  Regrettably, this  is  notso.    Depending  on  the particular sort of traffic one is getting, forexample, whether Telnet one-byte packets or file transfer  maximum  sizepackets  at  maximum  speed, one can expect to see one or the other costbeing the major bottleneck to throughput.  Most  implementors  who  havestudied  their  programs  in  an  attempt to find out where the time wasgoing have reached  the  unsatisfactory  conclusion  that  it  is  going                                   24equally  to  all parts of their program.  With the possible exception ofchecksum  processing,  very  few  people  have  ever  found  that  theirperformance  problems  were  due  to a single, horrible bottleneck whichthey could fix by a single stroke of inventive programming.  Rather, theperformance was something which was improved by  painstaking  tuning  ofthe entire program.     Most  discussions  of protocols begin by introducing the concept oflayering, which tends  to  suggest  that  layering  is  a  fundamentallywonderful  idea  which  should  be  a  part  of  every  consideration ofprotocols.  In fact, layering is a mixed blessing.    Clearly,  a  layerinterface  is  necessary  whenever  more than one client of a particularlayer is to be allowed to use  that  same  layer.    But  an  interface,precisely  because  it  is fixed, inevitably leads to a lack of completeunderstanding as to what one layer wishes to obtain from another.   Thishas to lead to inefficiency.  Furthermore, layering is a potential snarein  that  one  is  tempted  to think that a layer boundary, which was anartifact of the specification procedure, is in fact the proper  boundaryto  use in modularizing the implementation.  Again, in certain cases, anarchitected layer must correspond to an implemented layer, precisely  sothat  several  clients  can  have  access  to that layer in a reasonablystraightforward manner.  In other cases, cunning  rearrangement  of  theimplemented  module  boundaries to match with various functions, such asthe demultiplexing of incoming packets, or the sending  of  asynchronousoutgoing  packets,  can  lead  to  unexpected  performance  improvementscompared to more traditional implementation strategies.   Finally,  goodperformance is something which is difficult to retrofit onto an existing                                   25program.   Since performance is influenced, not just by the fine detail,but by the gross structure, it is sometimes the case that  in  order  toobtain  a  substantial  performance  improvement,  it  is  necessary  tocompletely redo the program from  the  bottom  up.    This  is  a  greatdisappointment   to  programmers,  especially  those  doing  a  protocolimplementation for  the  first  time.    Programmers  who  are  somewhatinexperienced  and  unfamiliar with protocols are sufficiently concernedwith getting their program logically correct that they do not  have  thecapacity  to  think  at  the  same  time  about  the  performance of thestructure they are building.  Only after they have achieved a  logicallycorrect  program  do they discover that they have done so in a way whichhas precluded real performance.  Clearly, it is more difficult to designa program thinking from the start about  both  logical  correctness  andperformance.  With time, as implementors as a group learn more about theappropriate  structures  to  use  for  building  protocols,  it  will bepossible  to  proceed  with  an  implementation  project   having   moreconfidence  that  the structure is rational, that the program will work,and that the program will work well.    Those  of  us  now  implementingprotocols  have the privilege of being on the forefront of this learningprocess.  It should be no surprise that our  programs  sometimes  sufferfrom the uncertainty we bring to bear on them.                                   26Citations     [1]  Cohen  and  Postel,  "On  Protocol  Multiplexing",  Sixth DataCommunications Symposium, ACM/IEEE, November 1979.     [2] Bunch and Day, "Control Structure Overhead in TCP", Trends  andApplications:  Computer Networking, NBS Symposium, May 1980.
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -