📄 gulp.html
字号:
<HTML><HEAD><TITLE>Lossless Gigabit Remote Packet Capture With Linux</TITLE></HEAD><BODY><!BODY BGCOLOR=#f4f0f4!><Center><H1><A HREF=http://staff.washington.edu/corey/gulp/>Lossless Gigabit Remote Packet Capture With Linux</A></H1><H4>Corey Satten<BR>University of Washington Network Systems<BR><A HREF=http://staff.washington.edu/corey/>http://staff.washington.edu/corey</A><BR>August 9, 2007<BR><FONT SIZE=-1>(Updated: March 18, 2008)</FONT></H4></Center><H2> Overview </H2><P> This paper is about two distinct but related things:<OL><LI> How to achievelossless gigabit packet capture to disk with unmodified Linux onordinary/modest PC hardware and<LI> Capturing packets remotely on a campus network (without connectinga capture box to the remote network).</OL></P>My software which does both is <A HREF=#links>freely available</A>and is called Gulp (visualize drinking quickly from the network firehose).<P> By publishing this paper, I hope to:<OL TYPE=A><LI> efficiently share my code, methods andinsight with others interested in doing this and<LI> shed light onlimitations in the Linux code base which hopefully can be fixed so Gulpis no longer needed.</OL></P><H2> Background </H2><P> At the University of Washington, we have a large network with manyhundreds of subnets and close to 120,000 IP devices on our campusnetwork. Sometimes it is necessary to look at network traffic todiagnose problems. Recently, I began a project to allow us to capturesubnet-level traffic remotely (without having to physically connectremotely) to make life easier for our Security and Network Operationsgroups and to help diagnose problems more efficiently. </P><P> Our Cisco 7600 routers have the ability to create a limited numberof "Encapsulated Remote SPAN ports" (ERSPAN ports) which are similar tomirrored switch ports except the router "GRE" encapsulates the packetsand sends them to an arbitrary IP address. (GRE is in quotes becausethe Cisco GRE header is larger than the standard GRE header (it is 50bytes) so Linux and/or unmodified<A HREF=http://www.tcpdump.org>tcpdump</A> can not correctly decapsulateit). </P><P> Because the router will send the "GRE" encapsulated packets without anyestablished state or confirmation on the receiver (as if sending UDP), Idon't need to establish a tunnel on Linux to receive the packets. Iinitially wrote a tiny (30-line) proof-of-concept decapsulator in Cwhich could postprocess a tcpdump capture like this: </P><PRE>tcpdump -i eth1 -s0 -w - proto gre | <A HREF=conv.c>conv</A> > pcapfileortcpdump -i eth1 -s0 -w - proto gre | <A HREF=conv.c>conv</A> | tcpdump -s0 -r - -w pcapfile ...</PRE><P> My initial measurements indicated that the percentage of droppedpackets and CPU overhead of writing through the conversion program andthen to disk were not significantly higher than writing directly to diskso I thought this was a reasonable plan. On my old desktop workstation,a 3.2GHz P4 Dell Optiplex 270 with slow 32-bit PCI bus and a built-in10/100/1000 Intel 82540EM NIC) running Fedora Core 6 Linux (2.6.19kernel, ethtool -G eth0 rx 4096), I could captureand save close to 180Mb/s of <AHREF=http://dast.nlanr.net/Projects/Iperf/>iperf</A> traffic with about1% packet loss so it seemed worth pursuing. Partly to facilite this andpartly for unrelated reasons, I bought a newer/faster office PC. </P><H2> What Did and Didn't Work </H2><P> To my surprise, my new office PC (a Dell Precision 690 with 2.66 GHzquad-core Xeon x5355, PCI-Express-based Intel Pro-1000-PT NIC, fasterRAM and SATA disks) running the same (Fedora Core 6) OS, initiallydropped more packets than my old P4 system did, even though each of the 4CPU cores does about 70% more than my old P4 system (according to mybenchmarks). I spent a long time trying to tune the OS by changingvarious parameters in <CODE><B>/proc</B></CODE> and<CODE><B>/sys</B></CODE>, trying to tune the e1000 NIC driver's tunableparameters and fiddling with scheduling priority and processor affinity(for processes, daemons and interrupts). Although the number ofcombinations and permutations of things to change was high, I graduallymade enough progress that I continued down this path for far too longbefore discovering the right path. </P><P> Two things puzzled me:"<A HREF=http://xosview.sourceforge.net/>xosview</A>"(a system load visualization tool) always showed plenty of idleresources when packets were dropped and writing packets to disk seemedto have a disproportionate impact on packet loss, especially when thesystem buffer cache was full. </P><P> It eventually occurred to me to try to decouple disk writing from packetreading. I tried piping the output of the capturing tcpdump program into anold (circa 1990) <A HREF=http://gd.tuwien.ac.at/utils/archivers/buffer>tapebuffering program</A> (written by Lee McLoughlin) which ran as two processeswith a small shared-memory ring buffer. Remarkably, piping the output throughMcLoughlin's buffer program caused tcpdump to drop fewer packets. Pipingthrough "dd" with any write size and/or buffer size or through "cat" did notprovide any improvement. My best guess as to why McLoughlin's buffer helped isthat even though the select(2) system call says writes to disk never block,they effectively do. When the writes block, tcpdump can't read packets fromthe kernel quickly enough to prevent the NIC's buffer from overflowing.</P><P> A quick look at the code in McLoughlin's buffer program convinced me Iwould do better starting from scratch so I wrote a simple multi-threadedring-buffer program (which became Gulp). For both simplicity and efficiencyunder load, I designed it to be completely lock-free. The multi-threaded ringbuffer worked remarkably well and considerably increased the rate at which Icould capture without loss but, at higher packet rates, it still droppedpackets--especially while writing to disk. </P><P> I emailed <A HREF=http://luca.ntop.org>Luca Deri</A>, the authorof Linux's <A HREF=http://www.ntop.org/PF_RING.html>PF_RING NIC driver</A>,and he (correctly) suggested that it would be easy toincorporate the packet capture into the ring buffer program itself(which I did). This ultimately was a good idea but initially didn'tseem to help much. Eventually I figured out why: the Linux schedulersometimes scheduled both my reader and writer threads on the sameCPU/core which caused them to run alternately instead of simultaneously.When they ran alternately, the packet reader was again starved of CPUcycles and packet loss occurred. The solution was simply to explicitlyassign the reader and writer threads to different CPU/cores and toincrease the scheduling priority of the packet reading thread. Thesetwo changes improved performance so dramatically that dropping anypackets on a gigabit capture, written entirely to disk, is now a rareoccurrence and many of the system performance tuning hacks I resorted toearlier have been backed out. (I now suspect they mostly helped byindirectly influencing process scheduling and cpu affinity--something Inow control directly--however on systems with more thantwo CPU cores, the<A HREF=http://staff.washington.edu/corey/tools/inter-core-benchmark.html>inter-core-benchmark</A> I developed may still be helpful to determine whichcores work most efficiently together). </P><P> On some systems, increasing the defaultsize of receive socket buffers also helps: <BR><CODE>echo 4194304 > /proc/sys/net/core/rmem_max;echo 4194304 > /proc/sys/net/core/rmem_default</CODE></P><H2> Performance of Our Production System </H2><P> Our (pilot) production system for gigabit remote packet capture is aDell PowerEdge model 860 with a single Intel Core2Duo CPU (x3070) at2.66 GHz (hyperthreading disabled) running RedHat Enterprise Linux 5(RHEL5 2.6.18 kernel). It has 2GB RAM, two WD2500JS 250GB SATA drivesin a striped ext2 logical volume (essentially software RAID 0 using LVM)and an Intel Pro1000 PT network interface (NIC) for packet capture.(The builtin BCM5721 Broadcom NICs are unable to capture the slightlyjumbo frames required for Cisco ERSPAN--they may work for non-jumbopacket capture but I haven't tested them. The Intel NIC does consume aPCI-e slot but costs only about $40.) </P><P> A 2-minute capture of as much<A HREF=http://dast.nlanr.net/Projects/Iperf/>iperf</A> data as I cangenerate into a 1Gb ERSPAN port (before the ERSPAN link saturates andthe router starts dropping packets) results in a nearly 14GB pcap fileusually with no packets dropped by Linux. The packet rate for thattraffic is about 96k pps avg. The router port sending the ERSPANtraffic was nearly saturated (900+Mb/s) and the sum of the average iperfthroughputs was 818-897Mb/s (but unlike ethernet, I believe iperfreports only payload bits counted in 1024^2 millions so this translatesto 857-940Mb/s in decimal/ethernet millions not counting packetheaders). Telling iperf to use smaller packets, I was able to captureall packets at 170k pps avg but I could only 2/3 saturate the gigabitnetwork using iperf and small packets with the hardware at my disposal.</P><P> A subsequent test using a "SmartBits" packet generator to roughly84% saturate the net with 300-byte packets indicates I can capture andwrite to disk 330k pps without dropping any packets. Interestingly thefailure mode at higher packet rates is that there is insufficient CPU
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -