📄 zerocpy.tex
字号:
%% Experiments with zero-copy UDP transmission under Linux% Joris van Rantwijk, May 2001% NIKHEF, Amsterdam%\documentclass[twoside,a4paper,11pt]{article}\usepackage[english]{babel}\usepackage{a4wide}\usepackage{epsfig}\parindent 0cm\parskip 0.2cm\clubpenalty=1000\widowpenalty=1000\title{ Experiments with zero-copy UDP transmission under Linux }\author{ Joris van Rantwijk }\begin{document}\maketitle\section{Introduction}The off-shore part of the \textsc{antares} DAQ will be controlled by embeddedboards with a Motorola MPC860 PowerPC.The design of the DAQ calls for fast transmission of large amounts ofdata over Ethernet.To this end, the PowerPC boards are equipped with a 100 Mbit Ethernetinterface, connected to the processor's on-chip Fast Ethernet Controller (FEC).The performance of this interface as we have measured it under Linux, iswell below the theoretical limit of 100 Mbit/sec.It is believed that the speed of the CPU and the bandwidth of the memory busare a bottleneck for the throughput of the FEC.Experiments in Saclay with the VxWorks OS on the MPC860 indicate that theuse of a zero-copy interface to the network hardware can greatly enhancethe throughput.Unfortunately, a zero-copy interface is currently not available in thestandard Linux kernel.To be able to measure the performance impact of zero-copying under Linux,we implemented an ad-hoc zero-copy interface for UDP transmission.We also modified the low-level FEC driver to implement the Linuxscatter/gather interface.These modifications are restricted to transmission of UDP/IP datagrams;a zero-copy interface for TCP/IP is currently beyond our programming skills.\section{Copying}In most OS's, the standard implementation of single datagram transmissionproceeds as follows:\begin{enumerate}\vspace{-0.3cm} \item The application prepares a buffer in memory, containing the data to be transmitted. \item The application invokes a system call, passing a pointer to the data buffer and an indication of the amount of data. \item The kernel allocates a new buffer in a protected address space and copies the contents of the application buffer into this new buffer, leaving sufficient room for protocol headers. \item The kernel writes protocol headers in the new buffer. \item The kernel invokes a transmission routine in the low-level network driver, passing a pointer to the protected data buffer. \item The driver adds the buffer pointer to the transmission queue and returns {\bf without guarantee} that the transmission has completed. \item The kernel returns from the system call, and the application can continue its work (fill its buffer with new data for example). \item Some time later, the network hardware fetches the contents of the kernel buffer through DMA, and transmits it to the network. It then invokes an interrupt to inform the kernel that the buffer may be released.\end{enumerate}The extra copy to a new buffer in step 3 imposes a performance penalty.This penalty can be significant, especially on hardware where memory accessis expensive (like our PowerPC board).Why then, does the kernel make this copy ?Copying the data is done for four reasons:\begin{enumerate}\vspace{-0.3cm} \item After the system call returns, the application is free to use the data buffer for other purposes (a new data packet for example). This would not be possible without copying, since in general, the network hardware fetches the data after the return of the syscall. \item During copying, the kernel also calculates a checksum over the buffer contents. This checksum is stored in the protocol header, and can be used by the receiver to detect corrupted datagrams. Even without copying, the kernel would still need to go through the buffer contents just to calculate the checksum. Since memory bandwidth is the most important performance factor, this would take almost as much time as copying. \item The application memory space is mapped onto the physical memory space. Memory locations that seem contiguous to the application, may be far apart in physical memory. The application buffer is therefore in general not a contiguous region of physical memory. The kernel buffer, however, will be forced in a contiguous region. \item Some reserved room is needed before the start of the data to store protocol headers. The appliciation programmer in general can't or doesn't want to deal with this.\end{enumerate}\section{Elimitating the copy}We implemented an ad-hoc interface to provide zero-copy UDP transmissionunder Linux.This is done in the form of a Linux 2.4.4 kernel module which addsa character device driver to the system.An application can then call this driver to transmit a datagram;the normal {\tt socket}-based interface is not used.To allow eliminating the copy, we needed to find a different way ofaddressing the points which are normally achieved by copying.The first two points are easy: the semantics of our ad-hoc interface simplydon't guarantee that the buffer can be reused after the system call.Instead, the application should preserve the buffer contents until it hastransmitted some fixed number of other buffers.We completely avoid checksum calculation by leaving out the UDP checksum inthe protocol header.This might reduce the reliability of the protocol.However, the checksum at the Ethernet level alone should suffice to catchmost damaged packets.Modern high-speed network interface cards often have facilities to calculateIP checksums in hardware, allowing a nicer solution to the checksum problem.Unfortunately, the PowerPC FEC has no such provisions.We developed two seperate methods to deal with the other two points: eitherby mapping a kernel buffer in application memory, or through the networkdriver's scatter/gather support.\subsection{The {\tt mmap} method}After opening the device driver, the application invokes the {\tt mmap}syscall to map a kernel buffer into application memory space.Since this buffer is allocated by the kernel, it is guaranteed to becontiguous in physical memory.The application writes data packets directly into the mapped buffer,taking the responsibility of reserving sufficient room before andafter the data.Transmission of a packet is requested through an {\tt ioctl} syscall.This method is much less flexible than the standard {\tt socket}-basedinterface: the application is forced to put its data in a pre-arranged areaand to leave room before and after each packet.Consequently, the dividing of data into packets must take place in anearly stage, complicating the design of the application.\subsection{The {\tt kiobuf} method}Modern network interface cards often support scatter/gather DMA transactions.It removes the need to store the packet in a single contiguous rangeof memory.Instead, a packet may consist of multiple fragments, which are scattered overphysical memory.During packet transmission, the network hardware {\bf gathers} the packetcontents from the various memory locations.The most recent Linux kernels (2.4.4) allow fragmented network packets{\bf provided that} both the network hardware and the low-level network driversupport scatter/gather DMA.Unfortunately, the Linux 2.4.4 version of the FEC driver doesn't supportscatter/gather.However, inspection of the driver code and documentation from Motorolaindicated that the MPC860 hardware can handle scatter/gather through itsregular buffer descriptor mechanism.We modified the FEC driver code to take advantage of this mechanismand implemented the fragmented-packet-interface.This feature eliminates the problem with non-contiguous application buffers,allowing us to map the application buffer to kernel memory space instead ofthe other way around.It also removes the need to explicitly reserve room for protocol headers:these headers can comfortably be stored in an additional first fragment.The application is now free to store the packet contents wherever itwants.Transmission of a packet is requested through a {\tt write} syscall.The device driver then uses the new {\tt kiobuf} interface in the Linuxkernel to map the application buffer into kernel memory space.\section{Performance}We measured the performance of our extensions on an RPX CLLF board equippedwith a Motorola MPC860T PowerPC processor running at 50Mhz witha 50Mhz bus, 16 MB DRAM and a QS6612 100baseT Ethernet interface.The operating system consisted of the standard Linux 2.4.4 kernel withthe runtime environment provided by the MontaVista Hardhat development kit.To work around a known bug in the MPC860T data cache, we enabledthe {\it CPU6 Silicon Errata} option in the kernel configuration.Our test setup consisted of the PowerPC board, connected to a Intel-basedLinux PC through 100Mbit UTP.The performance of the network interface was measured by transmittingseries of UDP datagrams from the PowerPC to the Intel PC.Socked-based transmission was done through the {\tt ttcp} program.In all cases, we used the {\tt ttcp} program on the Intel PC to receivethe datagrams and to obtain the performance results presented below.A series of 10,000 datagrams were transmitted in each measurement.All measurements were repeated 3 times and the median values were selected.\begin{table}[h]\renewcommand{\arraystretch}{1.2}\center\begin{tabular}{|l|c|c|c|c|}\hline{\bf interface} & {\bf checksums} & {\bf dgram size} & {\bf KByte/sec} \footnotemark[1] & {\bf Mbit/sec} \footnotemark[2] \\\hlinestandard socket & & 1400 & 3157 & 25.9 \\standard socket & & 8192 & 4334 & 35.5 \\copy extension \footnotemark[3] & NO & 1400 & 4251 & 34.8 \\copy extension \footnotemark[3] & YES & 1400 & 3801 & 31.1 \\mmap extension & NO & 1400 & 6872 & 56.3 \\mmap extension & YES & 1400 & 5120 & 41.9 \\kiobuf extension & NO & 1400 & 4758 & 39.0 \\\hline\end{tabular}\end{table}\footnotetext[1]{Values obtained from the receiving {\tt ttcp} program.}\footnotetext[2]{Calculated as Mbit = KByte * 1024 * 8 / 1,000,000; this is not exactly right: it leaves out the IP and Ethernet protocol headers.}\footnotetext[3]{Copies the data to a kernel buffer but otherwise uses the same extension interface.}\section{Discussion}Our best result (the {\tt mmap} extension without checksums) shows afactor 2 performance enhancement when compared to the standard socketinterface with the same datagram size.However, it should be noted that the extension interface is severely lessflexible than the standard socket interface,especially on the following points:no UDP checksums (reduced reliability);limited to UDP (no TCP, no streaming, no automated retransmission);application data buffers must reside in a fixed address region;the application is responsible for reserving room at the head and tailof the buffer.In its current form, this zero-copy interface doesn't seem suitablefor employment in the {\sc antares} DAQ system.However, our results clearly demonstrate (1) that the MPC860 hardwareis capable of delivering a much better throughput than indicated byprevious measurementsand (2) that copying of data buffers on the MPC860 imposes a significantperformance penalty and should be avoided if possible.\end{document}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -