📄 performancelog
字号:
From mk Sat Jun 30 13:06:33 2001Received: from mast.cwi.nl (mast.cwi.nl [192.16.196.89]) by hera.cwi.nl with ESMTP id NAA15285 for <martin.kersten@mail.cwi.nl>; Sat, 30 Jun 2001 13:06:32 +0200 (MEST)Received: from echolood.cwi.nl (IDENT:root@echolood.cwi.nl [192.16.196.65]) by mast.cwi.nl (8.11.2/8.9.3/FLW-3.11M) with ESMTP id f5UB6Wq13328 for <Martin.Kersten@cwi.nl>; Sat, 30 Jun 2001 13:06:32 +0200From: Martin Kersten <Martin.Kersten@cwi.nl>Received: (from mk@localhost) by echolood.cwi.nl (8.11.0/8.9.3/FLW-3.2C) id f5UB6WI24560 for Martin.Kersten@cwi.nl; Sat, 30 Jun 2001 13:06:32 +0200Message-Id: <200106301106.f5UB6WI24560@echolood.cwi.nl>Subject: Re: performanceTo: Martin.Kersten@cwi.nl (Martin Kersten)Date: Sat, 30 Jun 2001 13:06:31 +0200 (CEST)In-Reply-To: <200106292337.f5TNbhl29637@echolood.cwi.nl> from "Martin Kersten" at Jun 30, 2001 01:37:43 AMX-Mailer: ELM [version 2.5 PL3]MIME-Version: 1.0Content-Type: text/plain; charset=us-asciiContent-Transfer-Encoding: 7bitStatus: ROForwarded message:> > here the results of the little experiment of this afternoon,> after trying to answer the question "why does it take 0.8 sec> to start Mserver on Stefan's machine compared to 2.7 sec on> Martin's"> > On Stefan's machine (dual 550Mhz)> V4-3 -O6 V4.3 -O2 V5.0 -O2> base 0.8/0.2/0.6 0.8/0.2/0.5 0.8/0.2/0.5> tst400a 7.7/6.5/0.6 7.0/6.4/0.5 1.8/1.2/0.5> tst400b 3.3/2/5/0.8 3.0/2.4/0.6 1.6/0.7/0.8> tst400c 14.2/13.3/0.9 14.5/13/9/0.5 1.9/0.7/0.9> tst400d 9.2/8.6/0.5 10.1/9.3/0.7 0.8/0.1/0.6> > On Martin's machine (450 Mhz)> V4.3 -O2 V5.0 -O2> base 2.7/0.3/2.4 2.6/0.25/2.3> tst400a 8.7/6.3/2.4 3.5/1.2/2.3> tst400b 5.0/2.5/2.5 3.2/0.7/2.5> tst400c 5.6/3.0/2.6 3.2/0.6/2.6> tst400d 3.0/0.5/2.5 2.6/0.2/2.4> > Observations:> - original question not answered> - tst c+d involve b.insert(1,1), which uses a simplified> reference counting scheme for bats (no BBPfix() in module interface)> - synchronization seems really expensive on the dual> - handling empty 1M loop in M4 doesn't align with tst400d loop> - new parser is quite effective> The situation on June 30, 2001, calling from homeOn Martin's machine (450 Mhz) V5.0 -O2base 2.6/0.2/2.4 tst400b 2.9/0.5/2.4 10k counttst400c 3.0/0.5/2.5 10k inserttst400d 2.6/0.2/2.4 100K insert tupletst400e 7.0/3.6/2.4 1M mal call increment functionTimes at 25/8/01 echolood(450Mz load 1.5) Version 4.3tst400a 3.3/0.9/2.4 1M loop 8.7/ 6.2/2.4tst400bHuge 13 /7.3/5.8 100k count (io) 24.8/11.9/4.1tst400cHuge 12.3/6.5/5.6 100k insert(io) 22.0/16.0/3.6tst400d 4.9/2.3/2.6 1M insert tuple 31.8/26.5/4.8tst400e 5.2/2.8/2.4 1M incr fcn 16.2/12.2/2.4 gaffel (2x400Mhz Celeron load 0.0) Version 4.3tst400a 1.5/0.9/0.6 1M loop 7.1/ 6.6/0.5tst400bHuge 13 /8.2/4.6 100k count (io) 14.6/13.0/1.5tst400cHuge 12 /8.2/4.8 100k insert(io) 17.6/15.5/2.1tst400d 3.4/2.5/0.9 1M insert tuple 33.0/29.4/3.6tst400e 3.7/3.0/0.6 1M incr fcn 13.1/12.6/0.5 bezaan ( Athlon load 1.5) Version 4.3base 11.8/0.4/5.6 0m13/0.9/5.7tst400a 22.5/5.1/6.0 1M loop 1m21/38/5.6tst400bHuge 1m28/32/15 100k count (io) 2m02/56/8.1tst400cHuge 1m30/31/15 100k insert(io) 2m12/1m01/9.4tst400d 0m38/13/7 1M insert tuple 4m02/1m53/13.9tst400e 0m36/16/6 1M incr fcn 2m20/1m06/5.7 thor (XEON load 0.0) Version 4.3base 0.5/0.2/0.3 0.9/0.3/0.4tst400a 1.1/0.7/0.4 1M loop 5.1/4.7/0.4tst400bHuge 6.4/4.3/1.9 100k count (io) 10.5/9.6/0.9tst400cHuge 6.4/4.1/2.0 100k insert(io) 12.8/11.4/1.4tst400d 2.4/1.7/0.6 1M insert tuple 25.0/22.4/2.5tst400e 2.7/2.2/0.4 1M incr fcn 9.6/9.2/0.4Times as of 17-09-01 echolood(450Mz load 0) Version 4.3base 2.5/0.2/2.3tst400a 3.0/0.7/2.3 1M loop 8.7/ 6.2/2.4tst400a2 3.4/1.0/2.3 1M loop inctst400bHuge 4.8/2.4/2.4 100k count (io) 24.8/11.9/4.1tst400cHuge 5.4/2.9/2.5 100k insert(io) 22.0/16.0/3.6tst400d 4.5/1.9/2.5 1M insert tuple 31.8/26.5/4.8tst400e 4.9/2.6/2.3 1M incr fcn 16.2/12.2/2.4Multiplex operationsWe assume a bat[lng,lng] with 1M elements in increasing order.tst901a contains a fast implementation assuming aligned batsin both V5 and V4.3tst901b uses the MAL/MIL generic loop to achieve the same resultFor V5 a C-version was also constructed, which assumed acode generation technique based on translation of each individualMAL statement to the relevant C-statements. No inter-statementanalysis was performed. It represents what would be achievedby a 'naive' expansion to a C-function.The running time for this function was 3.6 seconds.Compiled version as part of batcalc runs in 3.6sectst901a 9.4/5.6/3.6 [+] call 35.7/30.4/5.0 make 5.7 32.4 multiplex 0.31 0.35tst901b 20.6/16.3/4.3 explicit loop 82.8/68.5/7.3 make 5.5 41.5 multiplex 14.6 59.6Multiplex library calls,Assume bats with 1M unique elements, all results in milliseconds Monet V5.0 Monet V4.3tst901d [:+=] [dbl] [log] [==] [:+=] [dbl] [log] [==][void,lng] 110 240 630 144 960 720 220 110 258 650 234 130 967 727 224 108 253 657 240 130 965 726 225[lng,lng] 140 280 710 143 1020 880 345 142 300 696 381 164 1136 974 402The operation [+](lng,lng) costs 180 and 346 micro seconds, respectively.The BATcopy underlying this operation takes: 141 and 269 ms.Although [+] == batcopy followed by [:+=], this is not cost effective(180 ms versus 141+140)C- implementation of a trivial iterator, compile time 0.2 sec[ 1, 639855.00] profiler.setAll();[ 1, 99.00] b := mal.new(lng,lng);[ 1, 26.00] t0 := system.usec();[ 1, 3.00] i := -1;[ 1, 3.00] t0 := system.usec();[1000001, 2.56] barrier v := mal.nextElement(i,0,1000000);[1000000, 6.97] bat.insert(b,i,i);[1000000, 2.13] redo v ;[ 1, 5.00] t1 := system.usec();[ 1, 118475.00] c := algebra.copy(b);[ 1, 7.00] t2 := system.usec();[ 1, 111.00] d := mal.new(lng,lng);[ 1, 5.00] h := 0;[ 1, 3.00] t := 0;[1000001, 4.55] barrier mloop := mal.bunStream(b,h,t);[1000000, 5.55] B2 := algebra.find(b,h);[1000000, 4.83] B3 := algebra.find(c,h);[1000000, 2.38] cr := calc.+(B2,B3);[1000000, 7.85] bat.insert(d,h,cr);[1000000, 2.24] catch GDKerror;[1000000, 2.30] redo mloop ;[ 1, 5.00] t3 := system.usec();[ 1, 3714520.00] batcalc.CMD000(b,b);[ 1, 8.00] t4 := system.usec();[ 1, 5.00] d1 := calc.-(t1,t0);[ 1, 3.00] d2 := calc.-(t3,t2);[ 1, 3.00] d3 := calc.-(t4,t3);[ 1, 39.00] cnt := aggr.count(d);[ 1, 123.00] system.printf("cnt %d ",cnt);[ 1, 19.00] system.printf("make %d ",d1);[ 1, 79.00] system.printf("multiplex %d\n",d2);[ 1, 87.00] system.printf("compiled multiplex %d\n",d3);[ 1, 2.00] end function;--------------- MARCH 2002 ---------------------------Just before we are moving to the Athlon desktopsCalling convention with -monetrc .../monet.confBoth servers use a 'hot' measurements on quite machineTimes as of 22-3-03 echolood(450Mz load 0) Version 4.3base 0.7/0.3/0.4 0.7/ 0.3/0.4tst400a 1.3/0.9/0.4 1M loop 8.2/ 7.9/0.4tst400a2 1.3/1.3/0.4 1M loop inctst400bHuge 3.7/3.0/0.5 100k count (io) 17.6/15.8/1.3tst400cHuge 3.8/3.2/0.4 100k insert(io) 22.0/18.0/1.6tst400d 6.5/5.1/1.4 1M insert tuple 38.2/34.7/2.8tst400e 3.8/3.4/0.4 1M incr fcn 15.6/15.2/0.4tst901a 8.3/6.7/1.5 [+] call 39.4/36.3/2.9tst901b 27.4/23.4/3.6 [+] call 119.9/112.4/6.3mk@gaffel::~/monet_5-0/src/mal/Tests> time Mserver -monetrc ../../../monet.conf tst400a </dev/null >/dev/null gaffel(2x400Mhz Celeron load 0.5) Version 4.3base 0.97/0.4/0.53 1.0/ 0.4/0.57tst400a 1.6/1.0/0.6 1M loop 9.7/ 9.2/0.5tst400a2 2.1/1.6/0.5 1M loop inctst400bHuge 4.3/3.5/0.7 100k count (io) 18.7/16.9/1.7tst400cHuge 4.6/3.7/0.8 100k insert(io) 22.6/20.5/2.1tst400d 7.7/5.9/1.7 1M insert tuple 39.2/36.2/3.5tst400e 4.5/3.8/0.5 1M incr fcn 17.7/17.2/0.5tst901a 9.2/6.8/2.4 [+] call 40.4/36.6/3.7tst901b 27.0/23.4/3.5 [+] call 118.8/110.5/7.4Overall picture is a loss of performance of about 25% onboth sides compared to Sept 2001. Since V5 and V4.3 use the sameGDK stuff, I suspect this to be a result of Linux kernel changes.Outlier are cHuge and tst400d. Both test insertion in a BAT.The overall loss is in the order of 10%. Furthermore, tst400don V5 illustrates unexpected behavior. tst400d: changing the BBPfix in CMDinsert into a direct update of the reference count saved 0.7 usr sec. In combination with BBPunfix part of the garbagecollector being called after before each instruction this means that 25% of the time the test is involved in locking activity. It seems that locking has become significantly more expensive. (this might also explain the slow start of gdb on Mserver)The chains in the symbol table have become quite large(mdb->scope)Forexample, reaching +() requires passing 15 modules and 35 signature teston average.If you identify the module calc.+() you skip one module for inspectionand save around 300 ms/100K calls in tst400bHuge.If you could 'jump' to the correct signature within the module thenyou would save another 300ms. This means that type resolution in tst400bHugetakes about 20% of the resources.insert() requires 16 module passes and 63 (!) signature comparisons on avg.This leads to a difference in tst400cHuge of 970 ms/100calls about 25% of testMinor update to source (before/after/final- hooks)and better stack preparation in mal_interpetertst400a 1.2/0.8/0.4 1M loop tst400a2 1.6/1.2/0.4 1M loop inctst400bHuge 3.3/2.8/0.5 100k count (io) tst400cHuge 3.4/2.8/0.6 100k insert(io) tst400d 6.6/5.4/1.2 1M insert tuple tst400e 3.0/2.6/0.4 1M incr fcn tst901a 7.6/5.9/1.7 [+] call tst901b 24.2/21.1/3.0 [+] call The first hour of the dual Athlon and V4.3 compiled with -O2!!! orion(2x1.4Mh load 0) base 0.3/0.1/0.2 0.29/0.11/0.18tst400a 0.45/0.3/0.15 1M loop 2.0/1.89/0.14tst400a2 0.6/0.4/0.2 1M loop inc tst400bHuge 1.26/1.0/0.22 100k count (io) 4.1/3.90/0.27tst400cHuge 1.3/1.0/0.28 100k insert(io) 4.5/4.11/0.47tst400d 1.4/1.0/0.4 1M insert tuple 7.0/6.1/0.88tst400e 1.1/0.9/0.2 1M incr fcn 3.8/3.63/0.21tst901a 1.9/1.5/0.4 [+] call 7.1/6.19/0.91tst901b 5.9/5.4/0.5 [+] call 19/17.8/1.64Alignment of the compilation flags illustrates that V5 is about 4 times faster on the tests(previous tests indicate 3.3-7 (17/9) and 3.9-5 (22/3)======================== 15 March 2003 ===================The compilation mode for both V5 and V4.3.7 is CFLAGS=-O2 command: time Mserver -c perf.conf TST </dev/null >/dev/null orion(2x1.4Mh load 0) base 0.33/0.13/0.20 0.28/0.8/0.20tst400a 0.49/0.29/0.19 1M loop 2.0/1.84/0.18tst400a2 0.62/0.44/0.18 1M loop inc tst400bHuge 1.47/1.2/0.24 100k count (io) 5.3/4.90/0.44tst400cHuge 1.55/1.3/0.24 100k insert(io) 6.0/5.60/0.42tst400d 1.4/1.1/0.3 1M insert tuple 7.3/6.6/0.68tst400e 1.2/1.0/0.2 1M incr fcn 3.7/3.6/0.19tst901a 1.9/1.5/0.4 [+] call 7.4/6.7/0.7tst901b 5.5/5.0/0.5 [+] call 20/18.5/1.64======================== 13 April 2003 ===================The compilation mode for both V5 and V4.3.7 is CFLAGS=-O2 command: time Mserver -c perf.conf TST </dev/null >/dev/null orion(2x1.4Mh load 0) base 0.33/0.13/0.20 0.28/0.8/0.20tst400a 0.52/0.31/0.20 1M loop 2.0/1.84/0.18tst400a2 0.65/0.44/0.19 1M loop inc tst400bHuge 1.10/0.85/0.24 100k count (io) 5.3/4.90/0.44tst400cHuge 1.32/1.08/0.24 100k insert(io) 6.0/5.60/0.42tst400d 1.36/1.07/0.29 1M insert tuple 7.3/6.6/0.68tst400e 1.32/1.12/0.2 1M incr fcn 3.7/3.6/0.19tst901a 1.9/1.3/0.6 [+] call 7.4/6.7/0.7tst901b 4.3/4.0/0.26 [+] call 20/18.5/1.64
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -