📄 making_the_blackfin_perform.html

📁 ADI 公司blackfin系列的用户使用文挡。
💻 HTML
字号:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head>  <title></title>  <link rel="stylesheet" media="screen" type="text/css" href="./style.css" />  <link rel="stylesheet" media="screen" type="text/css" href="./design.css" />  <link rel="stylesheet" media="print" type="text/css" href="./print.css" />  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body><a href=start.html>start</a></br><div class="toc"><div class="tocheader toctoggle" id="toc__header">Table of Contents</div><div id="toc__inside"><ul class="toc"><li class="clear"><ul class="toc"><li class="level2"><div class="li"><span class="li"><a href="#making_the_blackfin_perform" class="toc">Making the Blackfin Perform</a></span></div></li><li class="level2"><div class="li"><span class="li"><a href="#measuring_performance" class="toc">Measuring Performance</a></span></div></li><li class="level2"><div class="li"><span class="li"><a href="#example_-_dot_product" class="toc">Example - Dot Product</a></span></div></li><li class="level2"><div class="li"><span class="li"><a href="#example_-_fft" class="toc">Example - FFT</a></span></div></li><li class="level2"><div class="li"><span class="li"><a href="#conclusion" class="toc">Conclusion</a></span></div></li><li class="level2"><div class="li"><span class="li"><a href="#acknowledgements" class="toc">Acknowledgements</a></span></div></li></ul></li></ul></div></div><h2><a name="making_the_blackfin_perform" id="making_the_blackfin_perform">Making the Blackfin Perform</a></h2><div class="level2"><p> Optimizing source code to run on a specific platform can be challenging. While code creation tools are improving, they have not kept up with the the rapidly increasing functionality and complexity of hardware. The more complex the hardware architecture is, the harder it is to program in assembly language.  This creates the need for abstraction via a robust C compiler, or operating system, however, since not all compilers handle source code in the same manner, it can take many iterations of re-writing your C source to achieve an efficient output.</p><p>While trial and test is a valuable method of optimizing source code, there are other methods that can be used.  Two specific methods are by allowing the compiler to use built-in functions and implementing C-callable specific libraries.  The combination of the built-in functions and the core specific libraries will improve the code performance, without getting involved in the complexity of the architecture.</p><p>The Analog Devices Blackfin processor is supported by open source tools including a gcc compiler and uClinux kernel.  Recent efforts to incorporate built-in functions to the compiler and port several signal processing specific libraries have greatly improved code efficiency.</p><p>Two code examples will show how to accomplish this. First a <a href="#measuring_performance" title=":making_the_blackfin_perform.txt &crarr;" class="wikilink1">demonstration</a> of how easy it is to measure performance will be made. Then a short <a href="#example_-_dot_product" title=":making_the_blackfin_perform.txt &crarr;" class="wikilink1">Dot-Product</a> example will be reviewed - going over the trial and test methodology that many software developers still do. At the end, a short example of using a C callable library will be made, using the <a href="#example_-_fft" title=":making_the_blackfin_perform.txt &crarr;" class="wikilink1">Fast Fourier Transform</a> demonstrating the relative ease of including someone else&rsquo;s library functions in your code. By showing these few examples, it can be demonstrated the relative simplicity of finding C callable function libraries, and using them in your application.</p></div><!-- SECTION [1-1971] --><h2><a name="measuring_performance" id="measuring_performance">Measuring Performance</a></h2><div class="level2"><p>Blackfin, like many modern processor architectures, has a complex <a href="cache_management.html" class="wikilink1" title="cache_management.html">memory</a> architecture which uses L1 (cache and no-cache), L2, L3. Depending on how memory is configured, and where code is running from will impact execution speed more than any other parameter in the system. In a typical system, cache will will be turned on, and the code in question will be run a few times, allowing it to be pre-loaded into cache, in order that the time of loading things into cache will not be measured. The basic method is:</p><pre class="code c">foo <span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="br0">&#123;</span>  code to be measured<span class="br0">&#125;</span>&nbsp;main <span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="br0">&#123;</span>  foo<span class="br0">&#40;</span><span class="br0">&#41;</span>                <span class="coMULTI">/* run once - load into cache */</span>  start=clock<span class="br0">&#40;</span><span class="br0">&#41;</span>;      <span class="coMULTI">/* capture clock              */</span>  foo<span class="br0">&#40;</span><span class="br0">&#41;</span>                <span class="coMULTI">/* take measurement          */</span>  stop=clock<span class="br0">&#40;</span><span class="br0">&#41;</span>;        <span class="coMULTI">/* capture clock              */</span>  interval=stop-start; <span class="coMULTI">/* find the time              */</span>  print interval;      <span class="coMULTI">/* print the results          */</span><span class="br0">&#125;</span></pre><p>This is done with:</p><p>and is used like:</p></div><!-- SECTION [1972-3121] --><h2><a name="example_-_dot_product" id="example_-_dot_product">Example - Dot Product</a></h2><div class="level2"><p> A <a href="http://en.wikipedia.org/wiki/Dot_Product" class="interwiki iw_wp" title="http://en.wikipedia.org/wiki/Dot_Product">dot product</a> is a very common mathematical operation in many signal processing algorithms which takes two vectors and returns a scalar quantity. For example:</p><p>If  and  <br/></p><p>then the dot product would be:</p><p><br/></p><p>This is represented in generic, portable C code as: </p><p>the performance of this is close to: (in each case, a low number of cycles is better) </p><p> Notice that how the first time that the application is run, the number of cycles takes longer than normal, since the function and data are not in instruction or data cache. After things are in cache, the data and instructions are reached at CCLK rates.</p><p>Since many signal processing applications use a native multiple accumulate, simply using an the assembly instructions to perform the dot product may not approach the theoretical limit of the processor (which in the Blackfin case is 1/2 a MAC per cycle).</p><p>Changing things to assembly provides:</p><p>This provides an output of: </p><p>This is more than a 8x times improvement in performance! However, this code is now completely optimized for the Blackfin, and is not able to be run on any other architecture, and is still over 3x the theoretical performance. Since this function is using both the instruction and data cache, some optimizations can still be done.</p><p>The next step to do is to add the cycles measurement into the same function which actually does the dot product:</p><p>This function does not have the overhead of calling a function to measure the cycles of the multiply accumulate which does the dot product. This has an output of:</p><p>While this still approaches the theoretical maximum, it is still 2x what it should be. This is due to the memory structure of the processor. Since the main part of the hardware loop involves a parallel instruction <code>A1 += R0.H * R1.H, A0 += R0.L * R1.L || R0 = [I0++] || R1 = [I1++];</code> where the two loads are from <code>I0</code> and <code>I1</code>, if both point to the same memory bank (see <a href="cache_management.html#memory_overview" class="wikilink1" title="cache_management.html">memory overview</a>), then the access will be stalled one instruction, as the two load data buses come from L1 Bank A, and L1 Bank B. The way that this code is written is <strong>not</strong> the way it should be done, however, at the time of writing there are not l1 malloc functions for obtaining blocks of internal memory.</p><p>This can be shown with:</p><p>which ensures that the two vectors are in different data banks. This provides an output close to the theoretical maximum.</p><p>A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks. </p><p>Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs. </p><p>This example showed that even if you hand write functions in assembly, it is easy to slow your application by a factor of 2x to 4x by not understanding the chip&rsquo;s implementation of the memory structure, or simply by calling a function. With a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip. </p><p>Normally writing hand-optimising assembler sounds terrible, and most people do not enjoy it much. However, for a the few 鈥渋nner loop鈥
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -