📄 rfc817.txt

📁 RFC 的详细文档！
💻 TXT
📖 第 1 页 / 共 4 页
字号:
上一页 1 2 34
one  location  to  another.   Since a number of data copies are probably

already required as part of  the  processing  structure,  this  kind  of

sharing might conceivably pay off if it didn't cause too much trouble to

                                   20


the  modularity  of  the  program.  Finally, computation of the checksum

seems to be one place where careful attention  to  the  details  of  the

algorithm  used  can  make a drastic difference in the throughput of the

program.  The Multics system provides one of the best  case  studies  of

this,  since  Multics  is  about  as  poorly  organized  to perform this

function as any machine implementing TCP.   Multics  is  a  36-bit  word

machine,  with  four 9-bit bytes per word.  The eight-bit bytes of a TCP

segment are laid down packed in memory, ignoring word boundaries.   This

means  that  when it is necessary to pick up the data as a set of 16-bit

units for the purpose of adding  them  to  compute  checksums,  horrible

masking  and  shifting  is  required  for  each  16-bit value.  An early

version of a program using this  strategy  required  6  milliseconds  to

checksum  a  576-byte  segment.    Obviously,  at  this  point, checksum

computation was becoming the central bottleneck to throughput.   A  more

careful  recoding of this algorithm reduced the checksum processing time

to less than one millisecond.  The strategy used  was  extremely  dirty.

It  involved adding up carefully selected words of the area in which the

data lay, knowing that for those particular  words,  the  16-bit  values

were  properly  aligned  inside  the words.  Only after the addition had

been done were the various sums shifted, and finally  added  to  produce

the  eventual  checksum.  This kind of highly specialized programming is

probably not acceptable if used everywhere within an  operating  system.

It is clearly appropriate for one highly localized function which can be

clearly identified as an extreme performance bottleneck.


     Another area of TCP processing which may cause performance problems

is the overhead of examining all of the possible flags and options which

                                   21


occur in each incoming packet.  One paper, by Bunch and Day [2], asserts

that  the  overhead of packet header processing is actually an important

limiting  factor  in  throughput  computation.    Not  all   measurement

experiments  have  tended to support this result.  To whatever extent it

is true, however, there is an obvious  strategy  which  the  implementor

ought  to  use in designing his program.  He should build his program to

optimize the expected case.  It is easy, especially when first designing

a program, to pay equal attention to all of  the  possible  outcomes  of

every test.  In practice, however, few of these will ever happen.  A TCP

should  be  built  on the assumption that the next packet to arrive will

have absolutely nothing special about it,  and  will  be  the  next  one

expected  in  the  sequence  space.   One or two tests are sufficient to

determine that the expected set of control flags are on.  (The ACK  flag

should be on; the Push flag may or may not be on.  No other flags should

be on.)  One test is sufficient to determine that the sequence number of

the  incoming  packet  is  one  greater  than  the  last sequence number

received.  In almost every case, that will be the actual result.  Again,

using the Multics system as an example, failure to optimize the case  of

receiving  the  expected  sequence number had a detectable effect on the

performance of the system.  The particular problem arose when  a  number

of  packets  arrived  at  once.    TCP attempted to process all of these

packets before awaking the user.  As a result,  by  the  time  the  last

packet  arrived,  there was a threaded list of packets which had several

items on it.  When a new packet arrived, the list was searched  to  find

the  location  into which the packet should be inserted.  Obviously, the

list should be searched from highest sequence number to lowest  sequence

                                   22


number,  because  one is expecting to receive a packet which comes after

those already received.  By mistake, the list was searched from front to

back, starting with the packets with the lowest sequence  number.    The

amount of time spent searching this list backwards was easily detectable

in the metering measurements.


     Other data structures can be organized to optimize the action which

is  normally  taken  on  them.  For example, the retransmission queue is

very seldom actually used  for  retransmission,  so  it  should  not  be

organized  to  optimize that action.  In fact, it should be organized to

optimized the discarding of things  from  it  when  the  acknowledgement

arrives.    In many cases, the easiest way to do this is not to save the

packet  at  all,  but  to  reconstruct  it  only  if  it  needs  to   be

retransmitted,  starting  from the data as it was originally buffered by

the user.


     There is another generality, at least as  important  as  optimizing

the  common  case,  which  is  to avoid copying data any more times than

necessary.  One more result from the Multics TCP may prove  enlightening

here.    Multics takes between two and three milliseconds within the TCP

layer to process an incoming packet, depending on its size.  For a  576-

byte packet, the three milliseconds is used up approximately as follows.

One   millisecond   is   used  computing  the  checksum.    Six  hundred

microseconds is spent copying the data.  (The data is copied  twice,  at

.3  milliseconds  a copy.)  One of those copy operations could correctly

be included as part of the checksum cost, since it is done  to  get  the

data  on  a  known  word  boundary  to  optimize the checksum algorithm.

                                   23


However,  the  copy also performs another necessary transfer at the same

time.  Header processing and packet resequencing takes .7  milliseconds.

The  rest  of  the  time  is  used  in miscellaneous processing, such as

removing packets from the retransmission queue which are acknowledged by

this packet.  Data copying is the second most expensive single operation

after data checksuming.   Some  implementations,  often  because  of  an

excessively  layered  modularity, end up copying the data around a great

deal.  Other implementations end up copying the data because there is no

shared memory between processes, and the data must be moved from process

to process via a kernel operation.  Unless the amount of  this  activity

is  kept  strictly  under  control,  it  will  quickly  become the major

performance bottleneck.


     7.  Conclusions


     This document has addressed two aspects  of  obtaining  performance

from a protocol implementation, the way in which the protocol is layered

and  integrated  into  the  operating  system,  and the way in which the

detailed handling of the packet is optimized.  It would be nice  if  one

or  the  other  of these costs would completely dominate, so that all of

one's attention could be concentrated there.  Regrettably, this  is  not

so.    Depending  on  the particular sort of traffic one is getting, for

example, whether Telnet one-byte packets or file transfer  maximum  size

packets  at  maximum  speed, one can expect to see one or the other cost

being the major bottleneck to throughput.  Most  implementors  who  have

studied  their  programs  in  an  attempt to find out where the time was

going have reached  the  unsatisfactory  conclusion  that  it  is  going

                                   24


equally  to  all parts of their program.  With the possible exception of

checksum  processing,  very  few  people  have  ever  found  that  their

performance  problems  were  due  to a single, horrible bottleneck which

they could fix by a single stroke of inventive programming.  Rather, the

performance was something which was improved by  painstaking  tuning  of

the entire program.


     Most  discussions  of protocols begin by introducing the concept of

layering, which tends  to  suggest  that  layering  is  a  fundamentally

wonderful  idea  which  should  be  a  part  of  every  consideration of

protocols.  In fact, layering is a mixed blessing.    Clearly,  a  layer

interface  is  necessary  whenever  more than one client of a particular

layer is to be allowed to use  that  same  layer.    But  an  interface,

precisely  because  it  is fixed, inevitably leads to a lack of complete

understanding as to what one layer wishes to obtain from another.   This

has to lead to inefficiency.  Furthermore, layering is a potential snare

in  that  one  is  tempted  to think that a layer boundary, which was an

artifact of the specification procedure, is in fact the proper  boundary

to  use in modularizing the implementation.  Again, in certain cases, an

architected layer must correspond to an implemented layer, precisely  so

that  several  clients  can  have  access  to that layer in a reasonably

straightforward manner.  In other cases, cunning  rearrangement  of  the

implemented  module  boundaries to match with various functions, such as

the demultiplexing of incoming packets, or the sending  of  asynchronous

outgoing  packets,  can  lead  to  unexpected  performance  improvements

compared to more traditional implementation strategies.   Finally,  good

performance is something which is difficult to retrofit onto an existing

                                   25


program.   Since performance is influenced, not just by the fine detail,

but by the gross structure, it is sometimes the case that  in  order  to

obtain  a  substantial  performance  improvement,  it  is  necessary  to

completely redo the program from  the  bottom  up.    This  is  a  great

disappointment   to  programmers,  especially  those  doing  a  protocol

implementation for  the  first  time.    Programmers  who  are  somewhat

inexperienced  and  unfamiliar with protocols are sufficiently concerned

with getting their program logically correct that they do not  have  the

capacity  to  think  at  the  same  time  about  the  performance of the

structure they are building.  Only after they have achieved a  logically

correct  program  do they discover that they have done so in a way which

has precluded real performance.  Clearly, it is more difficult to design

a program thinking from the start about  both  logical  correctness  and

performance.  With time, as implementors as a group learn more about the

appropriate  structures  to  use  for  building  protocols,  it  will be

possible  to  proceed  with  an  implementation  project   having   more

confidence  that  the structure is rational, that the program will work,

and that the program will work well.    Those  of  us  now  implementing

protocols  have the privilege of being on the forefront of this learning

process.  It should be no surprise that our  programs  sometimes  suffer

from the uncertainty we bring to bear on them.

                                   26


Citations


     [1]  Cohen  and  Postel,  "On  Protocol  Multiplexing",  Sixth Data

Communications Symposium, ACM/IEEE, November 1979.


     [2] Bunch and Day, "Control Structure Overhead in TCP", Trends  and

Applications:  Computer Networking, NBS Symposium, May 1980.
上一页 1 2 34
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -