📄 cache_management.html
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head> <title></title> <link rel="stylesheet" media="screen" type="text/css" href="./style.css" /> <link rel="stylesheet" media="screen" type="text/css" href="./design.css" /> <link rel="stylesheet" media="print" type="text/css" href="./print.css" /> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body><a href=start.html>start</a></br><div class="toc"><div class="tocheader toctoggle" id="toc__header">Table of Contents</div><div id="toc__inside"><ul class="toc"><li class="level1"><div class="li"><span class="li"><a href="#cache_management" class="toc">Cache Management</a></span></div><ul class="toc"><li class="level2"><div class="li"><span class="li"><a href="#memory_overview" class="toc">Memory Overview</a></span></div></li><li class="level2"><div class="li"><span class="li"><a href="#cache_overview" class="toc">Cache Overview</a></span></div><ul class="toc"><li class="level3"><div class="li"><span class="li"><a href="#cache_operation" class="toc">Cache Operation</a></span></div></li><li class="level3"><div class="li"><span class="li"><a href="#dma_and_cache_considerations" class="toc">DMA and Cache Considerations</a></span></div></li><li class="level3"><div class="li"><span class="li"><a href="#instruction_cache" class="toc">Instruction Cache</a></span></div></li><li class="level3"><div class="li"><span class="li"><a href="#cache_setup_and_control" class="toc">Cache Setup and Control</a></span></div></li></ul></li><li class="level2"><div class="li"><span class="li"><a href="#using_l2_memory" class="toc">Using L2 Memory</a></span></div></li></ul></li></ul></div></div><h1><a name="cache_management" id="cache_management">Cache Management</a></h1><div class="level1"></div><!-- SECTION [1-33] --><h2><a name="memory_overview" id="memory_overview">Memory Overview</a></h2><div class="level2"><p> Blackfin processors have a modified Harvard architecture, which implements a unified 4Gigabyte address range that spans a combination of on-chip and off-chip memory and memory-mapped I/O resources. Of this range, some of the address space is dedicated to internal, on-chip resources. The processor populates portions of this internal memory space with: </p><ul><li class="level1"><div class="li"> L1 Static Random Access Memories (SRAM)</div></li><li class="level1"><div class="li"> L2 Static Random Access Memories (SRAM) (Not all Blackfin processor have L2)</div></li><li class="level1"><div class="li"> L3 External Synchronous Dynamic Random Access Memories (SDRAM)</div></li><li class="level1"><div class="li"> A set of memory-mapped registers (MMRs)</div></li><li class="level1"><div class="li"> A boot Read-Only Memory (ROM)</div></li></ul><p> Different Blackfin processors have different amounts of each L1 and L2, while L3 depends on the board design. The table below shows the internal memory layout on some Blackfin Processors. </p><table class="inline"> <tr> <th class="rightalign"> </th><th class="centeralign"> BF536 </th><th class="centeralign"> BF537 </th><th class="centeralign"> BF531 </th><th class="centeralign"> BF532 </th><th class="centeralign"> BF533 </th><th class="centeralign"> BF561<a href="#fn__1" name="fnt__1" id="fnt__1" class="fn_top" onmouseover="fnt('1', this, event);">1)</a> </th> </tr> <tr> <td class="leftalign"> Maximum performance (<acronym title="Megahertz">MHz</acronym>) </td><td class="centeralign"> 400 </td><td class="centeralign"> 600 </td><td class="centeralign"> 400 </td><td class="centeralign"> 400 </td><td class="centeralign"> 600 </td><td class="centeralign"> 600 </td> </tr> <tr> <td class="leftalign"> Instruction SRAM/Cache <a href="#fn__2" name="fnt__2" id="fnt__2" class="fn_top" onmouseover="fnt('2', this, event);">2)</a> (bytes) </td><td class="centeralign"> 16K </td><td class="centeralign"> 16K </td><td class="centeralign"> 16K </td><td class="centeralign"> 16K </td><td class="centeralign"> 16K </td><td class="centeralign"> 16K </td> </tr> <tr> <td class="leftalign"> Instruction SRAM <a href="#fn__3" name="fnt__3" id="fnt__3" class="fn_top" onmouseover="fnt('3', this, event);">3)</a> (bytes) </td><td class="centeralign"> 48K </td><td class="centeralign"> 48K </td><td class="centeralign"> 16K </td><td class="centeralign"> 32K </td><td class="centeralign"> 64K </td><td class="centeralign"> 16K </td> </tr> <tr> <td class="leftalign"> Data SRAM/Cache <a href="#fn__4" name="fnt__4" id="fnt__4" class="fn_top" onmouseover="fnt('4', this, event);">4)</a> (bytes) </td><td class="centeralign"> 16K </td><td class="centeralign"> 32K </td><td class="centeralign"> 16K </td><td class="centeralign"> 32K </td><td class="centeralign"> 32K </td><td class="centeralign"> 32k </td> </tr> <tr> <td class="leftalign"> Data SRAM <a href="#fn__5" name="fnt__5" id="fnt__5" class="fn_top" onmouseover="fnt('5', this, event);">5)</a> (bytes) </td><td class="centeralign"> 16K </td><td class="centeralign"> 32K </td><td class="rightalign"> </td><td class="rightalign"> </td><td class="centeralign"> 32K </td><td class="centeralign"> 32K </td> </tr> <tr> <td class="leftalign"> Scratchpad <a href="#fn__6" name="fnt__6" id="fnt__6" class="fn_top" onmouseover="fnt('6', this, event);">6)</a> (bytes) </td><td class="centeralign"> 4K </td><td class="centeralign"> 4K </td><td class="centeralign"> 4K </td><td class="centeralign"> 4K </td><td class="centeralign"> 4K </td><td class="centeralign"> 4K </td> </tr> <tr> <td class="leftalign"> L2 <a href="#fn__7" name="fnt__7" id="fnt__7" class="fn_top" onmouseover="fnt('7', this, event);">7)</a> (bytes) </td><td class="rightalign"> </td><td class="rightalign"> </td><td class="rightalign"> </td><td class="rightalign"> </td><td class="rightalign"> </td><td class="leftalign"> 128K </td> </tr></table><br /><p> The L3 off chip SDRAM has a maximum bus speed of 133MHz but it is cheap and you can have lots of it (128M on the BF533-Stamp and 64M on the BF537-STAMP boards). Since the SDRAM access time is much slower than the maximum core processor speed (600 Mhz or more), if all memory accesses were limited to SDRAM timing, the processor could not take advantage of it’s own higher clock speeds, because it would be waiting for data to be read or written to L3 (SDRAM).</p><p><a href="media/buses.png" class="media" target="_blank" title="buses.png"><img src="media/buses.png" class="mediaright" title="Blackfin memory architecture" alt="Blackfin memory architecture" width="400" /></a></p><p>To reduce this problem, the on chip L1 can run at the maximum core speed giving a 4 or 6 times speed advantage over L3 (SDRAM). If the processor has L2, it can run at 1/2 the core speed, giving a 2-3 time speed advantage. The trade off is that there is not as much L1 or L2 and it has to be split between instructions and data.</p><p>Since the Blackfin processor has multiple memory buses, and treats L1 separately from L3, (see fig) the core is able to do up to four core memory accesses per core clock cycle (one 64-bit instruction fetch, two 32-bit data loads, and one pipelined 32-bit data store). This also allows simultaneous system DMA, cache maintenance, and core accesses. The combination of these increases the throughput and performance of the Blackfin processor, if everything is configured properly.</p><p>In the following sections, different strategies for increasing performance will be described. This will include how cache works, and how to ensure proper configuration of cache, as well as how to put different parts of the kernel or your application into L1 instruction or data SRAM.</p></div><!-- SECTION [34-3245] --><h2><a name="cache_overview" id="cache_overview">Cache Overview</a></h2><div class="level2"><p> The use of cache can help speed up SDRAM accesses by storing a cache line of 32 bytes at a time in the much faster on-chip SRAM. This SRAM acts as a sort of mirror or copy of the slower external SDRAM. When you want a byte of SDRAM which has been tagged as cacheable you will actually get a 32 byte cache line ( aligned to a 32 byte boundary) read into cache at the same time. So if the first byte was on a 32 byte boundary it will take a short while to get the first byte but the remaining 31 bytes will act as if they were stored in the on chip SRAM rather than the slower external SDRAM.</p><p>So the first byte will cost a bit of time but later bytes have a much faster access time.</p><p>This works well when you are reading some data but when you are in a read / modify / write cycle there are some additional complexities.</p><p>If you write to some cached memory the data in the cache will be updated so subsequent reads will see the new value but the actual SDRAM may not contain the new value. Similarly, if you read from some cached memory, the read will occur from cache, and if something like a DMA operation has updated SDRAM, the read will not reflect the proper value.</p><p>This may not be a problem for cpu based memory access but as soon as you start to use DMA to access this data problems will start to arrive. ( see DMA considerations)</p><p>The possibility of a memory location having two values at the same time is called a coherency problem.</p><p>If the new value is in cache some process must be used to transfer the new value to the actual SDRAM.</p><p>To help reduce this problems, cache can be configured in three different modes (Cache mode is selected by the DCPLB descriptors):</p><ul><li class="level1"><div class="li"> Write-through with cache line allocation only on reads</div></li><li class="level1"><div class="li"> Write-through with cache line allocation on both reads and writes</div></li><li class="level1"><div class="li"> Write-back which allocates cache lines on both reads and writes</div></li></ul><p> For each store operation, write-through caches initiate a write to external memory immediately upon the write to cache. If the cache line is replaced or explicitly flushed by software, the contents of the cache line are invalidated rather than written back to external memory. Writes are a bit slower but reads are as fast as they can be.</p><p>A write-back cache does not write to external memory until the line is replaced by a load operation that needs the line. You do not have control over when external memory update will happen.</p><p>Whether or not the cache is in write-back, or write-through, software can flush the cache line. This causes an area of cache to be written back to external memory.</p></div><!-- SECTION [3246-5847] --><h3><a name="cache_operation" id="cache_operation">Cache Operation</a></h3><div class="level3"><p> This is a very complex topic but here is a simplified version of how it works.</p><p>When a SDRAM address is marked as cacheable the hardware will do a special search in the cache memory to see if this address is already in cache. This is called finding a cache Tag. Any given data memory address is given the option of 2 different cache memory locations ( called ways ). If a match is found then the data is in cache. No costly SDRAM access is needed. The instruction cache can have up to 4 ways.</p></div><h4><a name="dirty_data" id="dirty_data">Dirty Data</a></h4><div class="level4"><p> If there is no Tag match then the system will look at one of the 2 ways or cache blocks to use for a given address. Not every possible SDRAM address is given its own cache location and only a few address bits are used to identify the possible cache location options for any given address. If none of those are available then an existing cache slot is reused for that SDRAM data access. The selected cache slot is called a <strong>victim</strong> If this slot contained modified data that has not yet been written back to SDRAM it is called <strong>dirty</strong> the data needs to be written back to SDRAM before the <strong>victim</strong> slot is made available for the new cache line. Data modified in cache is called “dirty”. If the cache is marked as <strong>write through</strong> the SDRAM will have already been updated so the <strong>dirty</strong> bit can be ignored.</p><p>Once the <strong>victim</strong> cache slot is freed then it is time to load the new data into it. The new data is actually loaded into a temp buffer before being transferred to the actual cache memory. When ever a selected address contents is loaded into the temp buffer the read operation is completed even before the rest of the cache line is read from SDRAM and before the cache data is moved from the temp buffer to the true cache location. You can see why hardware designers have some real challenges here.</p><p>The cache ways can be <strong>locked</strong> to prevent swapping out of data so that a small section of SDRAM can be moved into cache and kept there while it is being used. The cost of doing this is increased use of the remaining ways for other data operations.</p></div><h4><a name="making_it_all_work" id="making_it_all_work">Making it all work</a></h4><div class="level4"><p> The secret of using this is careful code design with very tight data operation loops. As you get to a given data location it is better to do all the operations needed on that item before moving on to the next item.</p><p>Loops like this</p><pre class="code c"><span class="kw4">char</span> a<span class="br0">[</span><span class="nu0">4096</span><span class="br0">]</span>;<span class="kw4">char</span> b<span class="br0">[</span><span class="nu0">4096</span><span class="br0">]</span>;<span class="kw4">char</span> c<span class="br0">[</span><span class="nu0">4096</span><span class="br0">]</span>; <span class="kw1">for</span> <span class="br0">(</span> i = <span class="nu0">0</span> ; i < <span class="nu0">4096</span> ; i++ <span class="br0">)</span> <span class="br0">{</span> a<span class="br0">[</span>i<span class="br0">]</span> = b<span class="br0">[</span>i<span class="br0">]</span>;<span class="br0">}</span> <span class="kw1">for</span> <span class="br0">(</span> i = <span class="nu0">0</span> ; i < <span class="nu0">4096</span> ; i++ <span class="br0">)</span> <span class="br0">{</span> a<span class="br0">[</span>i<span class="br0">]</span> += c<span class="br0">[</span>i<span class="br0">]</span>;<span class="br0">}</span></pre><p>would be not as efficient as </p><pre class="code c"><span class="kw1">for</span> <span class="br0">(</span> i = <span class="nu0">0</span> ; i < <span class="nu0">4096</span> ; i++ <span class="br0">)</span> <span class="br0">{</span> a<span class="br0">[</span>i<span class="br0">]</span> = <span class="br0">(</span>b<span class="br0">[</span>i<span class="br0">]</span> + c<span class="br0">[</span>i<span class="br0">]</span><span class="br0">)</span>;<span class="br0">}</span></pre><p>When the variables a[0], b[0] and c[0] are read into cache you will get a[31], b[31] and c[31] in at the same time.</p><p> Another secret is to keep data structures on a cache boundary ( 32 bytes). If the data structure is less than 32 bytes and it is on a boundary then when you get the first byte the whole structure is contained in a single cache line. Keeping the data structure on a cache boundary will stop you needing two cache lines to store the structure (if it is less than 32 bytes in size). Keep all commonly used data items in a structure close together to minimize the cache hits required to use the structure. Some architectures define </p><pre class="code">__cacheline_aligned </pre><p> to help with this.</p><p>In general keep all your most frequently used code and data items close together in your applications.</p></div><!-- SECTION [5848-9315] --><h3><a name="dma_and_cache_considerations" id="dma_and_cache_considerations">DMA and Cache Considerations</a></h3><div class="level3"><p>Well all of this works really well if JUST the CPU is looking at data. The CPU has the cache hardware at its disposal so it can take best advantage of the cache system. The Blackfin has many peripherals ( SPORT, PPI ... ) that rely on DMA processes to read in high throughput io data.</p><p>A DMA process knows nothing about any data caches. It directly reads and writes to SDRAM locations or to the selected peripherals.</p><p>This means that you can have a selected memory location in cache but the actual real SDRAM location has been modified by a DMA transfer and now holds different data to the in cache copy. You can attempt to read the data as many times as you like but you will not normally see the updated contents of the SDRAM (caused by the DMA write operation).</p></div><h4><a name="invalidate" id="invalidate">INVALIDATE</a></h4><div class="level4"><p> The <strong>invalidate</strong> operation allow the cache entry to be updated from the SDRAM memory area the next time an attempt is made to read the address. This function forces a reread of the SDRAM even though the SDRAM address contents may have already been held in cache memory.</p><p>The whole cache may be invalidated or just an address range. </p><p>The <strong>invalidate</strong> function can be used for a range of addresses even if the addresses are not actually in the cache.</p><p> Here are the results from a simple test.</p><pre class="code">DMA Test Resultsinsmod /lib/modules/2.6.12.1/kernel/drivers/char/test/tdma.ko buf 1 before copy 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 DMA_D0 dma irq 28 status 1DMA status 0x0 buf 2 after dma copy where is the data ?? 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 buf 2 after invalidate OK an invalidate was needed 00 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15root:~> cat /proc/interrupts 6: 14033 BFIN Timer Tick 14: 0 rtc 21: 0 BFIN_UART_RX 22: 2281 BFIN_UART_TX 28: 1 MEM DMA0 40: 45 eth0Err: 0</pre></div><h4><a name="flushing_cache" id="flushing_cache">Flushing Cache</a></h4><div class="level4"><p> With the kernel configuration option
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -