📄 rfc817.txt
字号:
one location to another. Since a number of data copies are probably
already required as part of the processing structure, this kind of
sharing might conceivably pay off if it didn't cause too much trouble to
20
the modularity of the program. Finally, computation of the checksum
seems to be one place where careful attention to the details of the
algorithm used can make a drastic difference in the throughput of the
program. The Multics system provides one of the best case studies of
this, since Multics is about as poorly organized to perform this
function as any machine implementing TCP. Multics is a 36-bit word
machine, with four 9-bit bytes per word. The eight-bit bytes of a TCP
segment are laid down packed in memory, ignoring word boundaries. This
means that when it is necessary to pick up the data as a set of 16-bit
units for the purpose of adding them to compute checksums, horrible
masking and shifting is required for each 16-bit value. An early
version of a program using this strategy required 6 milliseconds to
checksum a 576-byte segment. Obviously, at this point, checksum
computation was becoming the central bottleneck to throughput. A more
careful recoding of this algorithm reduced the checksum processing time
to less than one millisecond. The strategy used was extremely dirty.
It involved adding up carefully selected words of the area in which the
data lay, knowing that for those particular words, the 16-bit values
were properly aligned inside the words. Only after the addition had
been done were the various sums shifted, and finally added to produce
the eventual checksum. This kind of highly specialized programming is
probably not acceptable if used everywhere within an operating system.
It is clearly appropriate for one highly localized function which can be
clearly identified as an extreme performance bottleneck.
Another area of TCP processing which may cause performance problems
is the overhead of examining all of the possible flags and options which
21
occur in each incoming packet. One paper, by Bunch and Day [2], asserts
that the overhead of packet header processing is actually an important
limiting factor in throughput computation. Not all measurement
experiments have tended to support this result. To whatever extent it
is true, however, there is an obvious strategy which the implementor
ought to use in designing his program. He should build his program to
optimize the expected case. It is easy, especially when first designing
a program, to pay equal attention to all of the possible outcomes of
every test. In practice, however, few of these will ever happen. A TCP
should be built on the assumption that the next packet to arrive will
have absolutely nothing special about it, and will be the next one
expected in the sequence space. One or two tests are sufficient to
determine that the expected set of control flags are on. (The ACK flag
should be on; the Push flag may or may not be on. No other flags should
be on.) One test is sufficient to determine that the sequence number of
the incoming packet is one greater than the last sequence number
received. In almost every case, that will be the actual result. Again,
using the Multics system as an example, failure to optimize the case of
receiving the expected sequence number had a detectable effect on the
performance of the system. The particular problem arose when a number
of packets arrived at once. TCP attempted to process all of these
packets before awaking the user. As a result, by the time the last
packet arrived, there was a threaded list of packets which had several
items on it. When a new packet arrived, the list was searched to find
the location into which the packet should be inserted. Obviously, the
list should be searched from highest sequence number to lowest sequence
22
number, because one is expecting to receive a packet which comes after
those already received. By mistake, the list was searched from front to
back, starting with the packets with the lowest sequence number. The
amount of time spent searching this list backwards was easily detectable
in the metering measurements.
Other data structures can be organized to optimize the action which
is normally taken on them. For example, the retransmission queue is
very seldom actually used for retransmission, so it should not be
organized to optimize that action. In fact, it should be organized to
optimized the discarding of things from it when the acknowledgement
arrives. In many cases, the easiest way to do this is not to save the
packet at all, but to reconstruct it only if it needs to be
retransmitted, starting from the data as it was originally buffered by
the user.
There is another generality, at least as important as optimizing
the common case, which is to avoid copying data any more times than
necessary. One more result from the Multics TCP may prove enlightening
here. Multics takes between two and three milliseconds within the TCP
layer to process an incoming packet, depending on its size. For a 576-
byte packet, the three milliseconds is used up approximately as follows.
One millisecond is used computing the checksum. Six hundred
microseconds is spent copying the data. (The data is copied twice, at
.3 milliseconds a copy.) One of those copy operations could correctly
be included as part of the checksum cost, since it is done to get the
data on a known word boundary to optimize the checksum algorithm.
23
However, the copy also performs another necessary transfer at the same
time. Header processing and packet resequencing takes .7 milliseconds.
The rest of the time is used in miscellaneous processing, such as
removing packets from the retransmission queue which are acknowledged by
this packet. Data copying is the second most expensive single operation
after data checksuming. Some implementations, often because of an
excessively layered modularity, end up copying the data around a great
deal. Other implementations end up copying the data because there is no
shared memory between processes, and the data must be moved from process
to process via a kernel operation. Unless the amount of this activity
is kept strictly under control, it will quickly become the major
performance bottleneck.
7. Conclusions
This document has addressed two aspects of obtaining performance
from a protocol implementation, the way in which the protocol is layered
and integrated into the operating system, and the way in which the
detailed handling of the packet is optimized. It would be nice if one
or the other of these costs would completely dominate, so that all of
one's attention could be concentrated there. Regrettably, this is not
so. Depending on the particular sort of traffic one is getting, for
example, whether Telnet one-byte packets or file transfer maximum size
packets at maximum speed, one can expect to see one or the other cost
being the major bottleneck to throughput. Most implementors who have
studied their programs in an attempt to find out where the time was
going have reached the unsatisfactory conclusion that it is going
24
equally to all parts of their program. With the possible exception of
checksum processing, very few people have ever found that their
performance problems were due to a single, horrible bottleneck which
they could fix by a single stroke of inventive programming. Rather, the
performance was something which was improved by painstaking tuning of
the entire program.
Most discussions of protocols begin by introducing the concept of
layering, which tends to suggest that layering is a fundamentally
wonderful idea which should be a part of every consideration of
protocols. In fact, layering is a mixed blessing. Clearly, a layer
interface is necessary whenever more than one client of a particular
layer is to be allowed to use that same layer. But an interface,
precisely because it is fixed, inevitably leads to a lack of complete
understanding as to what one layer wishes to obtain from another. This
has to lead to inefficiency. Furthermore, layering is a potential snare
in that one is tempted to think that a layer boundary, which was an
artifact of the specification procedure, is in fact the proper boundary
to use in modularizing the implementation. Again, in certain cases, an
architected layer must correspond to an implemented layer, precisely so
that several clients can have access to that layer in a reasonably
straightforward manner. In other cases, cunning rearrangement of the
implemented module boundaries to match with various functions, such as
the demultiplexing of incoming packets, or the sending of asynchronous
outgoing packets, can lead to unexpected performance improvements
compared to more traditional implementation strategies. Finally, good
performance is something which is difficult to retrofit onto an existing
25
program. Since performance is influenced, not just by the fine detail,
but by the gross structure, it is sometimes the case that in order to
obtain a substantial performance improvement, it is necessary to
completely redo the program from the bottom up. This is a great
disappointment to programmers, especially those doing a protocol
implementation for the first time. Programmers who are somewhat
inexperienced and unfamiliar with protocols are sufficiently concerned
with getting their program logically correct that they do not have the
capacity to think at the same time about the performance of the
structure they are building. Only after they have achieved a logically
correct program do they discover that they have done so in a way which
has precluded real performance. Clearly, it is more difficult to design
a program thinking from the start about both logical correctness and
performance. With time, as implementors as a group learn more about the
appropriate structures to use for building protocols, it will be
possible to proceed with an implementation project having more
confidence that the structure is rational, that the program will work,
and that the program will work well. Those of us now implementing
protocols have the privilege of being on the forefront of this learning
process. It should be no surprise that our programs sometimes suffer
from the uncertainty we bring to bear on them.
26
Citations
[1] Cohen and Postel, "On Protocol Multiplexing", Sixth Data
Communications Symposium, ACM/IEEE, November 1979.
[2] Bunch and Day, "Control Structure Overhead in TCP", Trends and
Applications: Computer Networking, NBS Symposium, May 1980.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -