📄 cg-tech-docs.html
字号:
|(uninit)| (padding) (1 byte)|i_addr2 | instr_addr (4 bytes)|0 | I.a (8 bytes)|0 | I.m1 (8 bytes)|0 | I.m2 (8 bytes)|0 | D.a (8 bytes)|0 | D.m1 (8 bytes)|0 | D.m2 (8 bytes)</pre><p>(Note that this step is not performed if a basic block isre-translated; see <a href="cg-tech-docs.html#cg-tech-docs.retranslations">Handling basic block retranslations</a> formore information.)</p><p>GCC inserts padding before the<code class="computeroutput">instr_size</code> field so that it isword aligned.</p><p>The instrumentation added to call the cache simulationfunction looks like this (instrumentation is indented todistinguish it from the original UCode):</p><pre class="programlisting">MOVL $0x0, t20PUTL t20, %EAX PUSHL %eax PUSHL %ecx PUSHL %edx MOVL $0x4091F8A4, t46 # address of 1st CC PUSHL t46 CALLMo $0x12 # second cachesim function CLEARo $0x4 POPL %edx POPL %ecx POPL %eaxINCEIPo $5LEA1L -4(t4), t14MOVL $0x99, t18 MOVL t14, t42STL t18, (t14) PUSHL %eax PUSHL %ecx PUSHL %edx PUSHL t42 MOVL $0x4091F8C4, t44 # address of 2nd CC PUSHL t44 CALLMo $0x13 # second cachesim function CLEARo $0x8 POPL %edx POPL %ecx POPL %eaxINCEIPo $7</pre><p>Consider the first instruction's UCode. Each call issurrounded by three <code class="computeroutput">PUSHL</code> and<code class="computeroutput">POPL</code> instructions to save andrestore the caller-save registers. Then the address of theinstruction's cost centre is pushed onto the stack, to be thefirst argument to the cache simulation function. The address isknown at this point because we are doing a simultaneous passthrough the cost centre array. This means the cost centre lookupfor each instruction is almost free (just the cost of pushing anargument for a function call). Then the call to the cachesimulation function for non-memory-reference instructions is made(note that the <code class="computeroutput">CALLMo</code>UInstruction takes an offset into a table of predefinedfunctions; it is not an absolute address), and the singleargument is <code class="computeroutput">CLEAR</code>ed from thestack.</p><p>The second instruction's UCode is similar. The onlydifference is that, as mentioned before, we have to pass theaddress of the data item referenced to the cache simulationfunction too. This explains the <code class="computeroutput">MOVL t14,t42</code> and <code class="computeroutput">PUSHLt42</code> UInstructions. (Note that the seeminglyredundant <code class="computeroutput">MOV</code>ing will probablybe optimised away during register allocation.)</p><p>Note that instead of storing unchanging information abouteach instruction (instruction size, data size, etc) in its costcentre, we could have passed in these arguments to the simulationfunction. But this would slow the calls down (two or three extraarguments pushed onto the stack). Also it would bloat the UCodeinstrumentation by amounts similar to the space required for themin the cost centre; bloated UCode would also fill the translationcache more quickly, requiring more translations for largeprograms and slowing them down more.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.retranslations"></a>2.5.燞andling basic block retranslations</h2></div></div></div><p>The above description ignores one complication. Valgrindhas a limited size cache for basic block translations; if itfills up, old translations are discarded. If a discarded basicblock is executed again, it must be re-translated.</p><p>However, we can't use this approach for profiling -- wecan't throw away cost centres for instructions in the middle ofexecution! So when a basic block is translated, we first lookfor its cost centre array in the hash table. If there is no costcentre array, it must be the first translation, so we proceed asdescribed above. But if there is a cost centre array already, itmust be a retranslation. In this case, we skip the cost centreallocation and initialisation steps, but still do the UCodeinstrumentation step.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.cachesim"></a>2.6.燭he cache simulation</h2></div></div></div><p>The cache simulation is fairly straightforward. It justtracks which memory blocks are in the cache at the moment (itdoesn't track the contents, since that is irrelevant).</p><p>The interface to the simulation is quite clean. Thefunctions called from the UCode contain calls to the simulationfunctions in the files<code class="filename">vg_cachesim_{I1,D1,L2}.c</code>; these calls areinlined so that only one function call is done per simulated x86instruction. The file <code class="filename">vg_cachesim.c</code> simply<code class="computeroutput">#include</code>s the three filescontaining the simulation, which makes plugging in new cachesimulations is very easy -- you just replace the three files andrecompile.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.output"></a>2.7.燨utput</h2></div></div></div><p>Output is fairly straightforward, basically printing thecost centre for every instruction, grouped by files andfunctions. Total counts (eg. total cache accesses, total L1misses) are calculated when traversing this structure rather thanduring execution, to save time; the cache simulation functionsare called so often that even one or two extra adds can make asizeable difference.</p><p>Input file has the following format:</p><pre class="programlisting">file ::= desc_line* cmd_line events_line data_line+ summary_linedesc_line ::= "desc:" ws? non_nl_stringcmd_line ::= "cmd:" ws? cmdevents_line ::= "events:" ws? (event ws)+data_line ::= file_line | fn_line | count_linefile_line ::= ("fl=" | "fi=" | "fe=") filenamefn_line ::= "fn=" fn_namecount_line ::= line_num ws? (count ws)+summary_line ::= "summary:" ws? (count ws)+count ::= num | "."</pre><p>Where:</p><div class="itemizedlist"><ul type="disc"><li><p><code class="computeroutput">non_nl_string</code> is any string not containing a newline.</p></li><li><p><code class="computeroutput">cmd</code> is a command line invocation.</p></li><li><p><code class="computeroutput">filename</code> and <code class="computeroutput">fn_name</code> can be anything.</p></li><li><p><code class="computeroutput">num</code> and <code class="computeroutput">line_num</code> are decimal numbers.</p></li><li><p><code class="computeroutput">ws</code> is whitespace.</p></li><li><p><code class="computeroutput">nl</code> is a newline.</p></li></ul></div><p>The contents of the "desc:" lines is printed out at the topof the summary. This is a generic way of providing simulationspecific information, eg. for giving the cache configuration forcache simulation.</p><p>Counts can be "." to represent "N/A", eg. the number ofwrite misses for an instruction that doesn't write tomemory.</p><p>The number of counts in each<code class="computeroutput">line</code> and the<code class="computeroutput">summary_line</code> should not exceedthe number of events in the<code class="computeroutput">event_line</code>. If the number ineach <code class="computeroutput">line</code> is less, cg_annotatetreats those missing as though they were a "." entry.</p><p>A <code class="computeroutput">file_line</code> changes thecurrent file name. A <code class="computeroutput">fn_line</code>changes the current function name. A<code class="computeroutput">count_line</code> contains counts thatpertain to the current filename/fn_name. A "fn="<code class="computeroutput">file_line</code> and a<code class="computeroutput">fn_line</code> must appear before any<code class="computeroutput">count_line</code>s to give the contextof the first <code class="computeroutput">count_line</code>s.</p><p>Each <code class="computeroutput">file_line</code> should beimmediately followed by a<code class="computeroutput">fn_line</code>. "fi="<code class="computeroutput">file_lines</code> are used to switchfilenames for inlined functions; "fe="<code class="computeroutput">file_lines</code> are similar, but areput at the end of a basic block in which the file name hasn'tbeen switched back to the original file name. (fi and fe linesbehave the same, they are only distinguished to helpdebugging.)</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.summary"></a>2.8.燬ummary of performance features</h2></div></div></div><p>Quite a lot of work has gone into making the profiling asfast as possible. This is a summary of the importantfeatures:</p><div class="itemizedlist"><ul type="disc"><li><p>The basic block-level cost centre storage allows almost free cost centre lookup.</p></li><li><p>Only one function call is made per instruction simulated; even this accounts for a sizeable percentage of execution time, but it seems unavoidable if we want flexibility in the cache simulator.</p></li><li><p>Unchanging information about an instruction is stored in its cost centre, avoiding unnecessary argument pushing, and minimising UCode instrumentation bloat.</p></li><li><p>Summary counts are calculated at the end, rather than during execution.</p></li><li><p>The <code class="computeroutput">cachegrind.out</code> output files can contain huge amounts of information; file format was carefully chosen to minimise file sizes.</p></li></ul></div></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.annotate"></a>2.9.燗nnotation</h2></div></div></div><p>Annotation is done by cg_annotate. It is a fairlystraightforward Perl script that slurps up all the cost centres,and then runs through all the chosen source files, printing outcost centres with them. It too has been carefully optimised.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.extensions"></a>2.10.燬imilar work, extensions</h2></div></div></div><p>It would be relatively straightforward to do othersimulations and obtain line-by-line information about interestingevents. A good example would be branch prediction -- allbranches could be instrumented to interact with a branchprediction simulator, using very similar techniques to thosedescribed above.</p><p>In particular, cg_annotate would not need to change -- thefile format is such that it is not specific to the cachesimulation, but could be used for any kind of line-by-lineinformation. The only part of cg_annotate that is specific tothe cache simulation is the name of the input file(<code class="computeroutput">cachegrind.out</code>), although itwould be very simple to add an option to control this.</p></div></div><div><br><table class="nav" width="100%" cellspacing="3" cellpadding="2" border="0" summary="Navigation footer"><tr><td rowspan="2" width="40%" align="left"><a accesskey="p" href="mc-tech-docs.html"><<
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -