📄 poller_bench.html
字号:
2.4.0-test10-pre4 was slower than 2.2.14 in all cases tested.<p>I should show results for pipes as well as socketpairs.<p>The Linux 2.2.14 /dev/poll driver printed messages to the consolewhen sockets were closed; this should probably be disabled for production.<h3>kqueue()</h3>It looks offhand like kqueue() performs best of all the tested methods.It's even faster than, and scales better than, /dev/poll, at least in this microbenchmark.<h3>/dev/poll vs. poll</h3>In all cases tested involving sockets, /dev/poll was appreciably faster than poll().<p>The 2.2.14 Linux /dev/poll driver was about six times faster than poll()for 1000 fds, but fell down to only 2.7 times faster at 10000 fds.The Solaris /dev/poll driver was about seven times faster than poll() at100 fds, and increased to 40 times faster at 10000 fds.<h3>Scalability of poll() and /dev/poll</h3>Under Solaris 7, when the number of idle sockets was increased from 100 to 10000,the time to check for active sockets with poll() and /dev/poll increased by a factor of only 6.5 (good) and 1.5 (fantastic), respectively. <p>Under Linux 2.2.14, when the number of idle sockets was increased from 100 to 10000,the time to check for active sockets with poll() and /dev/poll increased by a factor of 493 and 224, respectively. This is terribly, horribly bad scaling behavior.<p>Under Linux 2.4.0-test10-pre4, when the number of idle sockets was increased from 100 to 10000,the time to check for active sockets with poll() increased by a factor of 300. This is terribly, horribly bad scaling behavior.<p>There seems to be a scalability problem in poll() under both Linux 2.2.14 and 2.4.0-test10-pre4and in /dev/poll under Linux 2.2.14. <p>poll() is stuck with an interface that dictates O(n) behavior on total pipes;still, Linux's implementation could be improved.The design of the current Linux /dev/poll patch is O(n) in total pipes, in spite of the fact that its interface allows it to be O(1) in total pipesand O(n) only in <i>active</i> pipes.<p>See also the <a href="http://boudicca.tux.org/hypermail/linux-kernel/2000week44/index.html#9">recent discussions on linux-kernel</a>.<h2><a name="profiling">Results - kernel profiling</a></h2>To look for the scalability problem, I added support to thebenchmark to trigger the Linux kernel profiler. A few resultsare shown below. (No smoking gun was found, but then, I wouldn'tknow a smoking gun if it hit me in the face. Perhaps real kernelhackers can pick up the hunt from here.)<p>If you run the above test on a Linux system booted with 'profile=2',Poller_bench will output one kernel profiling data file per test condition.Poller_bench.sh does a gross analysis using 'readprofile | sort -rn | head > bench%d%c.top'to find the kernel functions with the highest CPU usage,where %d is the number of socketpairs, and %c is p for poll, d for /dev/poll, etc.<p>'more bench10000*.top' shows the results for 10000 socketpairs.On 2.2.14, it shows:<pre>::::::::::::::bench10000d.dat.top:::::::::::::: 901 total 0.0008 833 dp_poll 1.4875 27 do_bottom_half 0.1688 7 __get_request_wait 0.0139 4 startup_32 0.0244 3 unix_poll 0.0203::::::::::::::bench10000p.dat.top:::::::::::::: 584 total 0.0005 236 unix_poll 1.5946 162 sock_poll 4.5000 148 do_poll 0.6727 24 sys_poll 0.0659 7 __generic_copy_from_user 0.1167</pre>This seems to indicate that /dev/poll spends nearly all of its time in dp_poll(),and poll spends a fair bit of time in three routines: unix_poll, sock_poll, and do_poll.<p>On 2.4.0-test10-pre4 smp, 'more bench10000*.top' shows:<pre>::::::::::::::2.4/bench10000p.dat.top:::::::::::::: 1507 total 0.0011 748 default_idle 14.3846 253 unix_poll 1.9167 209 fget 2.4881 195 sock_poll 5.4167 29 sys_poll 0.0342 29 fput 0.1272 29 do_pollfd 0.1648</pre>It seems curious that the idle routine should show up so much,but it's probably just the second CPU doing nothing.<p>Poller_bench.sh will also try to do a fine analysis of dp_poll() using the 'profile' tool (source included), which is a variant of readprofile that shows hotspots within kernel functions. Looking at its output forthe run on 2.2.14, the three four-byte regions that take up the mostCPU time in dp_poll() in the 10000 socketpair case are<pre> c01d9158 39.135654% 326 c01d9174 11.404561% 95 c01d91a0 27.250900% 227</pre>Looking at the output of 'objdump -d /usr/src/linux/vmlinux', thatregion corresponds to the object code:<pre>c01d9158: c7 44 24 14 00 00 00 movl $0x0,0x14(%esp,1)c01d915f: 00 c01d9160: 8b 74 24 24 mov 0x24(%esp,1),%esic01d9164: 8b 86 8c 04 00 00 mov 0x48c(%esi),%eaxc01d916a: 3b 50 04 cmp 0x4(%eax),%edxc01d916d: 73 0a jae c01d9179 <dp_poll+0xc5>c01d916f: 8b 40 10 mov 0x10(%eax),%eaxc01d9172: 8b 14 90 mov (%eax,%edx,4),%edxc01d9175: 89 54 24 14 mov %edx,0x14(%esp,1)c01d9179: 83 7c 24 14 00 cmpl $0x0,0x14(%esp,1)c01d917e: 75 12 jne c01d9192 <dp_poll+0xde>c01d9180: 53 push %ebxc01d9181: ff 74 24 3c pushl 0x3c(%esp,1)c01d9185: e8 5a fc ff ff call c01d8de4 <dp_delete>c01d918a: 83 c4 08 add $0x8,%espc01d918d: e9 d1 00 00 00 jmp c01d9263 <dp_poll+0x1af>c01d9192: 8b 7c 24 10 mov 0x10(%esp,1),%edic01d9196: 0f bf 4f 06 movswl 0x6(%edi),%ecxc01d919a: 31 c0 xor %eax,%eaxc01d919c: f0 0f b3 43 10 lock btr %eax,0x10(%ebx)c01d91a1: 19 c0 sbb %eax,%eax</pre>I'm not yet familiar enough with kernel hacker tools to associate thosewith lines of code in /usr/src/linux/drivers/char/devpoll.c, butthat 'lock btr' hotspot appears to be the call to test_and_clear_bit().<h2><a name="lmbench">lmbench results</a></h2>lmbench results are presented here to help people trying to comparethe Intel and Sparc parts of the results shown above.<p>The source used was lmbench-2alpha10 from bitmover.com.I did not check into why the TCP test failed on the linux box.<pre> L M B E N C H 1 . 9 S U M M A R Y ------------------------------------ (Alpha software, do not distribute) Processor, Processes - times in microseconds - smaller is better----------------------------------------------------------------Host OS Mhz null null open selct sig sig fork exec sh call I/O stat clos inst hndl proc proc proc--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----sparc-sun SunOS 5.7 167 2.9 12. 48 55 0.40K 6.6 81 3.8K 15K 32Ki686-linu Linux 2.2.14d 651 0.5 0.8 4 5 0.03K 1.4 2 0.3K 1K 6K Context switching - times in microseconds - smaller is better-------------------------------------------------------------Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw--------- ------------- ----- ------ ------ ------ ------ ------- -------sparc-sun SunOS 5.7 19 69 235 114 349 116 367i686-linu Linux 2.2.14d 1 5 17 5 129 30 129 *Local* Communication latencies in microseconds - smaller is better-------------------------------------------------------------------Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP ctxsw UNIX UDP TCP conn--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----sparc-sun SunOS 5.7 19 60 120 197 215 1148i686-linu Linux 2.2.14d 1 7 13 31 80 File & VM system latencies in microseconds - smaller is better--------------------------------------------------------------Host OS 0K File 10K File Mmap Prot Page Create Delete Create Delete Latency Fault Fault--------- ------------- ------ ------ ------ ------ ------- ----- -----sparc-sun SunOS 5.7 6605 15 5.2Ki686-linu Linux 2.2.14d 10 0 19 1 5968 1 0.5K *Local* Communication bandwidths in MB/s - bigger is better-----------------------------------------------------------Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem UNIX reread reread (libc) (hand) read write--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----sparc-sun SunOS 5.7 60 55 54 84 122 177 89 122 141i686-linu Linux 2.2.14d 528 366 -1 357 451 150 138 451 171 Memory latencies in nanoseconds - smaller is better (WARNING - may not be correct, check graphs)---------------------------------------------------Host OS Mhz L1 $ L2 $ Main mem Guesses--------- ------------- --- ---- ---- -------- -------sparc-sun SunOS 5.7 167 12 59 273 i686-linu Linux 2.2.14d 651 4 10 131</pre><hr><i>Dan Kegel</i><br><a href="http://www.kegel.com/">www.kegel.com</a></body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -