📄 cg-tech-docs.html
字号:
<html xmlns:cf="http://docbook.sourceforge.net/xmlns/chunkfast/1.0"><head><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"><title>2.燞ow Cachegrind works</title><link rel="stylesheet" href="vg_basic.css" type="text/css"><meta name="generator" content="DocBook XSL Stylesheets V1.69.0"><link rel="start" href="index.html" title="Valgrind Documentation"><link rel="up" href="tech-docs.html" title="Valgrind Technical Documentation"><link rel="prev" href="mc-tech-docs.html" title="1.燭he Design and Implementation of Valgrind"><link rel="next" href="cl-format.html" title="3.燙allgrind Format Specification"></head><body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF"><div><table class="nav" width="100%" cellspacing="3" cellpadding="3" border="0" summary="Navigation header"><tr><td width="22px" align="center" valign="middle"><a accesskey="p" href="mc-tech-docs.html"><img src="images/prev.png" width="18" height="21" border="0" alt="Prev"></a></td><td width="25px" align="center" valign="middle"><a accesskey="u" href="tech-docs.html"><img src="images/up.png" width="21" height="18" border="0" alt="Up"></a></td><td width="31px" align="center" valign="middle"><a accesskey="h" href="index.html"><img src="images/home.png" width="27" height="20" border="0" alt="Up"></a></td><th align="center" valign="middle">Valgrind Technical Documentation</th><td width="22px" align="center" valign="middle"><a accesskey="n" href="cl-format.html"><img src="images/next.png" width="18" height="21" border="0" alt="Next"></a></td></tr></table></div><div class="chapter" lang="en"><div class="titlepage"><div><div><h2 class="title"><a name="cg-tech-docs"></a>2.燞ow Cachegrind works</h2></div></div></div><div class="toc"><p><b>Table of Contents</b></p><dl><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.profiling">2.1. Cache profiling</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.costcentres">2.2. Cost centres</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.ccstore">2.3. Storing cost-centres</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.instrum">2.4. Instrumentation</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.retranslations">2.5. Handling basic block retranslations</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.cachesim">2.6. The cache simulation</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.output">2.7. Output</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.summary">2.8. Summary of performance features</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.annotate">2.9. Annotation</a></span></dt><dt><span class="sect1"><a href="cg-tech-docs.html#cg-tech-docs.extensions">2.10. Similar work, extensions</a></span></dt></dl></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.profiling"></a>2.1.燙ache profiling</h2></div></div></div><p>[Note: this document is now very old, and a lot of its contents are outof date, and misleading.]</p><p>Valgrind is a very nice platform for doing cache profilingand other kinds of simulation, because it converts horrible x86instructions into nice clean RISC-like UCode. For example, forcache profiling we are interested in instructions that read andwrite memory; in UCode there are only four instructions that dothis: <code class="computeroutput">LOAD</code>,<code class="computeroutput">STORE</code>,<code class="computeroutput">FPU_R</code> and<code class="computeroutput">FPU_W</code>. By contrast, because ofthe x86 addressing modes, almost every instruction can read orwrite memory.</p><p>Most of the cache profiling machinery is in the file<code class="filename">vg_cachesim.c</code>.</p><p>These notes are a somewhat haphazard guide to howValgrind's cache profiling works.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.costcentres"></a>2.2.燙ost centres</h2></div></div></div><p>Valgrind gathers cache profiling about every instructionexecuted, individually. Each instruction has a <span><strong class="command">costcentre</strong></span> associated with it. There are two kinds of costcentre: one for instructions that don't reference memory(<code class="computeroutput">iCC</code>), and one for instructionsthat do (<code class="computeroutput">idCC</code>):</p><pre class="programlisting">typedef struct _CC { ULong a; ULong m1; ULong m2;} CC;typedef struct _iCC { /* word 1 */ UChar tag; UChar instr_size; /* words 2+ */ Addr instr_addr; CC I;} iCC; typedef struct _idCC { /* word 1 */ UChar tag; UChar instr_size; UChar data_size; /* words 2+ */ Addr instr_addr; CC I; CC D; } idCC; </pre><p>Each <code class="computeroutput">CC</code> has three fields<code class="computeroutput">a</code>,<code class="computeroutput">m1</code>,<code class="computeroutput">m2</code> for recording references,level 1 misses and level 2 misses. Each of these is a 64-bit<code class="computeroutput">ULong</code> -- the numbers can getvery large, ie. greater than 4.2 billion allowed by a 32-bitunsigned int.</p><p>A <code class="computeroutput">iCC</code> has one<code class="computeroutput">CC</code> for instruction cacheaccesses. A <code class="computeroutput">idCC</code> has two, onefor instruction cache accesses, and one for data cacheaccesses.</p><p>The <code class="computeroutput">iCC</code> and<code class="computeroutput">dCC</code> structs also storeunchanging information about the instruction:</p><div class="itemizedlist"><ul type="disc"><li><p>An instruction-type identification tag (explained below)</p></li><li><p>Instruction size</p></li><li><p>Data reference size (<code class="computeroutput">idCC</code> only)</p></li><li><p>Instruction address</p></li></ul></div><p>Note that data address is not one of the fields for<code class="computeroutput">idCC</code>. This is because for manymemory-referencing instructions the data address can change eachtime it's executed (eg. if it uses register-offset addressing).We have to give this item to the cache simulation in a differentway (see Instrumentation section below). Some memory-referencinginstructions do always reference the same address, but we don'ttry to treat them specialy in order to keep things simple.</p><p>Also note that there is only room for recording info aboutone data cache access in an<code class="computeroutput">idCC</code>. So what aboutinstructions that do a read then a write, such as:</p><pre class="programlisting">inc %(esi)</pre><p>In a write-allocate cache, as simulated by Valgrind, thewrite cannot miss, since it immediately follows the read whichwill drag the block into the cache if it's not already there. Sothe write access isn't really interesting, and Valgrind doesn'trecord it. This means that Valgrind doesn't measure memoryreferences, but rather memory references that could miss in thecache. This behaviour is the same as that used by the AMD Athlonhardware counters. It also has the benefit of simplifying theimplementation -- instructions that read and write memory can betreated like instructions that read memory.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.ccstore"></a>2.3.燬toring cost-centres</h2></div></div></div><p>Cost centres are stored in a way that makes them very cheapto lookup, which is important since one is looked up for everyoriginal x86 instruction executed.</p><p>Valgrind does JIT translations at the basic block level,and cost centres are also setup and stored at the basic blocklevel. By doing things carefully, we store all the cost centresfor a basic block in a contiguous array, and lookup comes almostfor free.</p><p>Consider this part of a basic block (for expositionpurposes, pretend it's an entire basic block):</p><pre class="programlisting">movl $0x0,%eaxmovl $0x99, -4(%ebp)</pre><p>The translation to UCode looks like this:</p><pre class="programlisting">MOVL $0x0, t20PUTL t20, %EAXINCEIPo $5LEA1L -4(t4), t14MOVL $0x99, t18STL t18, (t14)INCEIPo $7</pre><p>The first step is to allocate the cost centres. Thisrequires a preliminary pass to count how many x86 instructionswere in the basic block, and their types (and thus sizes). UCodetranslations for single x86 instructions are delimited by the<code class="computeroutput">INCEIPo</code> instruction, theargument of which gives the byte size of the instruction (notethat lazy INCEIP updating is turned off to allow this).</p><p>We can tell if an x86 instruction references memory bylooking for <code class="computeroutput">LDL</code> and<code class="computeroutput">STL</code> UCode instructions, and thuswhat kind of cost centre is required. From this we can determinehow many cost centres we need for the basic block, and theirsizes. We can then allocate them in a single array.</p><p>Consider the example code above. After the preliminarypass, we know we need two cost centres, one<code class="computeroutput">iCC</code> and one<code class="computeroutput">dCC</code>. So we allocate an array tostore these which looks like this:</p><pre class="programlisting">|(uninit)| tag (1 byte)|(uninit)| instr_size (1 bytes)|(uninit)| (padding) (2 bytes)|(uninit)| instr_addr (4 bytes)|(uninit)| I.a (8 bytes)|(uninit)| I.m1 (8 bytes)|(uninit)| I.m2 (8 bytes)|(uninit)| tag (1 byte)|(uninit)| instr_size (1 byte)|(uninit)| data_size (1 byte)|(uninit)| (padding) (1 byte)|(uninit)| instr_addr (4 bytes)|(uninit)| I.a (8 bytes)|(uninit)| I.m1 (8 bytes)|(uninit)| I.m2 (8 bytes)|(uninit)| D.a (8 bytes)|(uninit)| D.m1 (8 bytes)|(uninit)| D.m2 (8 bytes)</pre><p>(We can see now why we need tags to distinguish between thetwo types of cost centres.)</p><p>We also record the size of the array. We look up the debuginfo of the first instruction in the basic block, and then stickthe array into a table indexed by filename and function name.This makes it easy to dump the information quickly to file at theend.</p></div><div class="sect1" lang="en"><div class="titlepage"><div><div><h2 class="title" style="clear: both"><a name="cg-tech-docs.instrum"></a>2.4.營nstrumentation</h2></div></div></div><p>The instrumentation pass has two main jobs:</p><div class="orderedlist"><ol type="1"><li><p>Fill in the gaps in the allocated cost centres.</p></li><li><p>Add UCode to call the cache simulator for each instruction.</p></li></ol></div><p>The instrumentation pass steps through the UCode and thecost centres in tandem. As each original x86 instruction's UCodeis processed, the appropriate gaps in the instructions costcentre are filled in, for example:</p><pre class="programlisting">|INSTR_CC| tag (1 byte)|5 | instr_size (1 bytes)|(uninit)| (padding) (2 bytes)|i_addr1 | instr_addr (4 bytes)|0 | I.a (8 bytes)|0 | I.m1 (8 bytes)|0 | I.m2 (8 bytes)|WRITE_CC| tag (1 byte)|7 | instr_size (1 byte)|4 | data_size (1 byte)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -