⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 userext.tex

📁 basic.c */ /**//* Project:NeuroBasic, basic package*//**/ /* Survey:This is a simple Basic b-code
💻 TEX
📖 第 1 页 / 共 4 页
字号:
\item When a function needs certain data which is being communicated  at the moment, it can either wait for the end of the communication  phase or it can immediately access the part of the data which is  stored in one of the transfer buffers. To allow such sophisticated  programming, the transfer buffer must remember which data is stored  in which of the two buffers. It does this by storing the pointer of  where the data is being sent to by the communication network.\end{itemize}\noindent The transfer buffer structure should not be accesseddirectly by user programs. Three functions, defined in {\ttbasic/neurolib.h}, provide an interface to it.\begin{fctitem}{tbuf\_sync(pcons)}\index{tbuf_sync@{\tt tbuf\_sync()}}This function must be called before accessing or modifying anyparallel data set. {\tt pcons} points to the beginning of the data set.{\tt tbuf\_sync()} makes two things:  \begin{enumerate}  \item It waits until the data is really available, \ie if the data    is being communicated at the moment it waits for the end of the    communication by calling the {\tt Wait\_data()} function of the    \MUSIC\ library.  \item It informs the transfer buffer that the data can be modified    and a possibly synchronization between one of the transfer buffers    and the data no longer exists.  \end{enumerate}If the function is called with {\tt NULL} as argument, {\tttbuf\_sync(NULL)}, then it waits for the and of any currently ongoingcommunication. In this case the function is equivalent to {\ttWait\_data(ALL\_DATA)} (see \MUSIC\ Tutorial).  \end{fctitem}\begin{fctitem}{tbuf\_getfree(pcons)}\index{tbuf_getfree@{\tt tbuf\_getfree()}}This function returns a pointer to one of the transfer buffers whichis free to use at the moment. {\tt pcons} points to the beginning ofthe final destination of the data after communication or to {\tt NULL}(see the {\tt tbuf\_getdata()} function). It is used to keep theinformation about data synchronization in the transfer buffer up todate.Each {\tt tbuf\_getfree()} call must be followed by the communicationof the data in the transfer buffer (see {\tt Init\_comm()} and {\ttData\_ready()} in the \MUSIC\ tutorial) before any of the three {\tttbuf\_()} functions is called again. This is because the function {\tttbuf\_getfree()} modifies the transfer buffer as if the communicationhad already taken place.\end{fctitem}\begin{fctitem}{tbuf\_getdata(pcons)}\index{tbuf_getdata@{\tt tbuf\_getdata()}}This function is meant to be used only to write sophisticated highperformance code. It returns a pointer to a transfer buffer if one ofthem stores the temporary data which is being communicated to thefinal destination {\tt pcons} at the moment, otherwise it returns the{\tt NULL} pointer.Because the function which uses the data does not know which part ofit is stored in the transfer buffer, it assumes that it is thestandard part which was assigned to the processing element by the {\ttComplete\_prod\_window()} function (see \MUSIC\ tutorial). If afunction uses a nonstandard data distribution in conjunction with thetransfer buffer it must call the {\tt tbuf\_getfree()} function withthe {\tt NULL} pointer ({\tt tbuf\_getfree(NULL)}). This way thetransfer buffer does not consider the data as being synchronized withone of its buffers. The function {\tt tbuf\_sync()} still workscorrectly, because if it finds in the currently used transfer buffer alink to {\tt NULL} then it waits for the end of any communication.\end{fctitem}\noindent It is very important, that every function that uses thetransfer buffer or modifies a neuro-object obeys all the rules, evenif it doesn't use all the features of the transfer buffer. If itdoesn't do so, this might cause unpredictable results in otherneuro-functions which do use those features. See the source files ofthe {\em example\/} package for detailed examples.\index{transfer buffer|)}\section{Getting the Performance}\index{optimizing code|(}%================================The computing elements of the \MUSIC\ system are Motorola digital\index{DSP 96002@{\small DSP} 96002} signal processors ({\small DSP}s)96002. They can carry out two memory transfers, a multiply-, an add-and a subtract-operation in a single instruction (the instructionclock rate is 20\thinspace{}MHz). This results in a peak performanceof 60~Mflops. Getting there requires Assembly programming,\footnote{Towrite assembly code we recommend the {\em DSP96002 IEEE Floating-PointDual-Port Processor User's Manual\/} from Motorola.} but also a Cprogrammer has much influence on the performance of the code. A factorof 10 between the speed of a ``normal'' and a \MUSIC\ adequate Cprogram is typical! This Section describes some important issues ofhow to write optimal parallel C code for the \MUSIC\ system. Andremember, if it isn't fast it's not worth the effort!\subsection{Amdahl's Law}\index{Amdahl's law|(}%------------------------This is the most important point. A simple way to look at programs isto divide them into an overhead part and loops which carry out theactual work. In parallel programming we call the overhead part the{\em sequential\/} \index{sequential part} and the loops the {\emparallel part\/} \index{parallel part} of the program. Sequentialbecause the overhead of a program typically cannot be parallelizedwhich means that it does not run faster if more processing elementsare added. The speed of the parallel part, on the other hand,increases with the number of processing elements.On a sequential computer the parallel part of a program normallyclearly dominates the computing time. We therefore learned that it isimportant to optimize this part, whereas the overhead part is notimportant. On parallel computers this rule is almost reverted. Theparallel part of the program can be improved simply by adding moreprocessors to the system, but the sequential part remains constant inexecution time and represents a fundamental limit in performance.Suppose the sequential part of a program is only 2\thinspace\%. Now wetake 50 processing elements to run this program. The 98\thinspace\% ofthe parallel part will be reduced to about 2\thinspace\% of theinitial computing time. This means that now half of the totalcomputing time is spent in the sequential (overhead) part! Thespeedup, compared to one processor, is 25 and the maximum possiblespeedup (for an infinite number of processing elements) is limited to50. Optimizing the parallel part does not help in this situation butreducing the sequential part, lets say to 1\thinspace\%, will increasethe maximum speedup to a factor of 100. This is known as {\em Amdahl'slaw}.\footnote{Gene M. Amdahl. Validity of the Single ProcessorApproach to Achieving Large Scale Computing Capabilities. AFIPS SpringComputer Conference, pp. 483--485, April 1967 Atlantic City, USA.}The conclusion is that, \begin{quote}{\em on parallel systems, optimizing the overhead part is at least asimportant as optimizing the loops.}\end{quote}\noindent Sequential parts are, for example, the communication and theinterpretation of Basic programs. Because communication andcomputation can overlap on the \MUSIC\ system, these two parts andother overhead parts can be carried out at the same time to helpcutting down the overall sequential part. This is the reason for theintroduction of the transfer buffer.\index{Amdahl's law|)}\subsection{Memory Types}%------------------------{\small DSP}s typically have separated memory spaces for programs anddata (for speed reason). The processing elements of the \MUSIC\ systemhave additionally divided memory types, illustrated inTable~\ref{tab_memtypes}. The memory sizes, indicated for each type,are for the second hardware version \MUSIC\ boards and may vary fromsystem to system.\begin{table}[htb]\hfil\begin{tabular}{llr}\hlineClass           & Type          & Size\\\hline\hline                & internal      & 0.5 Kword\\Data memory     & static        & 128 Kword\\                & producer      & 256 Kword\\                & consumer      & 256 Kword\\\hlineProgram memory  & internal      & 1 Kword\\                & static        & 128 Kword\\\hline\end{tabular}\caption{Memory types of a MUSIC processing element.} \label{tab_memtypes}\end{table}Producer and consumer memory are both dynamic ({\smallVRAM}). \index{dynamic memory} \index{DRAM@{\small DRAM}} They havean access time of 3 clock cycles or 7 clock cycles if a page faultoccurs. Data from the producer part \index{producer memory} can besent directly to the communication network and the consumer part\index{consumer memory} can receive data directly from thecommunication network (see {\tt init\_comm()} in the \MUSIC\Tutorial).The access of static ({\small SRAM}) \index{static memory}\index{SRAM@{\small SRAM}} or internal memory \index{internal memory}is faster (2 clock cycles) but communication to or from such memory isslow. The internal memory is allocated by NeuroBasic at the start ofthe program. It is intended to be used locally by highly optimizedneuro-functions like the Assembler versions of the back-propagationfunctions ({\tt aprop()} and {\tt abackprop()}). The global variable{\tt piram} points to the beginning and the macro {\tt IRAM\_SIZE}contains the size of the internal memory. The different memory typesare allocated with the {\tt dmalloc()} function (see \MUSIC\Tutorial).The program memory will be described in the following section.\subsection{Internal Program Memory}\index{program memory} \index{internal program memory}%-----------------------------------------------------The internal program memory itself is not faster than the otherprogram memory but an instruction fetch does not occupy the processorport. Code from the internal memory is therefore still carried outfaster. The {\em order of linking\/} \index{link order} determineswhich part of the program will be put into the internal programmemory.It is a good Idea to put time critical functions into individual smallsource modules and to put them first into the module list of the localmakefile. Furthermore the corresponding package has to be put to thefirst place in the package list of the file {\ttmacros.mk}. Otherwise the code of other packages will occupy theinternal memory. Not that 512 word is not much. It is important to putonly the really time critical functions into the first places of thelink order.\subsection{Parallel Operations}\label{sec_parop}\index{parallel operations}%-------------------------------The \MUSIC\ C compiler (g96k) \index{g96k compiler} is not capable ofusing all parallel features of the {\small DSP}96002. However, it canproduce code which can carry out one arithmetical operation and onedata move in parallel, if they follow each other in the C source codeand if they are independent of each other.For better illustration we look at the computation of the dot (inner)product of two vectors as an example. Suppose the variables {\tt pa}and {\tt pb} point to the two vectors and {\tt len} is the theirlength. The ``normal'' C code to compute the dot product would be\begin{verbatim}  sum = 0.0;  for (i = 0; i < len; i++)    sum += *pa++ * *pb++;\end{verbatim}\noindent In the optimizing mode the compiler produces the followingassembly code for the loop:\begin{verbatim}        do      d0.l,L7         ; hardware do-loop        move    y:(r1)+,d1.s    ; get element from vector a        move    y:(r2)+,d0.s    ; get element from vector b        fmpy.s  d1,d0           ; multiply the two elements        fadd.s  d0,d2           ; add the result to sum  L7                            ; end of the loop\end{verbatim}\noindent The compiler can't do any better with the given C sourcecode. The hardware {\tt do}-loop is used and critical variables areautomatically placed in processor registers. However, fourinstructions are needed to load the two input values, to multiply andto add them.The processor, on the other hand, is capable of moving new inputvalues into registers while carrying out the multiply and addoperation on the current values. To make this possible, we have toformulate the C source code differently. The current elements have tobe multiplyed and accumulated first, before the new elements areloaded.\begin{verbatim}  sum = 0.0  a = *pa++;  b = *pb++;  for (i = 0; i < len; i++)  {    sum += a * b;    a = *pa++;    b = *pb++;  }\end{verbatim}\noindent This allows the compiler to produce code which carries outan arithmetical operation with the current values and to move a newvalue into a processor register in the same instruction. The result isthe following two-instruction loop which is twice as fast compared tothe first result.\begin{verbatim}        do      d0.l,L7                   ; hardware do-loop        fmpy.s  d0,d1,d0  y:(r2)+,d1.s    ; multiply and move        fadd.s  d0,d2     y:(r1)+,d0.s    ; add and move  L7                                      ; end of the loop\end{verbatim}\noindent To see the Assembler code produced by the g96k compiler usethe {\tt -S} option. The compiler will then produce an assemblersource file which contains the original C source lines as comments.\subsection{Global and Local Variables}\index{global variables} \index{local variables} \index{variables}%-----------------------------------------------------------------The g96k compiler \index{g96k compiler} does not put global variablesinto registers. Therefore, make sure to use only local variables fortime-critical arithmetical operations and loop counters. Parameterspassed to a function are considered local.\subsection{Debugging the Performance}\index{debugging performance}%-------------------------------------Once a program runs correctly on the \MUSIC\ computer it should bedebugged for errors in the performance. An error in performance issomething which prevents the system from running as fast as it should,for example, having unnecessary address calculations inside aloop. This has to be taken seriously because parallel systems havemany more possibilities of losses than a sequential computer. Thefollowing are some hints for finding {\em speed bugs}. \index{speed bugs}\begin{itemize}\item Determine the execution time of individual functions, for  instance, by calling the same function many times from the Basic  interpreter. The {\small CPU} time \index{CPU@{\small{}CPU}~time}  can be measured with the function {\tt clock()}.   \index{clock@{\tt clock()}}\item Estimate the performance by counting the number of operations in  a function. Compare the estimate with the measured performance.\item Find out how much time is lost in the sequential overhead and in  the communication. This can be done by executing the same function  with and without communication and with and without the  loops. Observing the \index{LED bars@{\small LED} bars} {\small LED}  bars also gives a good hint of the balance between computation and  communication and of the distribution of the load.\end{itemize}\noindent After collecting all this information determine the reasonfor the three or four most significant losses and try to eliminatethem.\index{optimizing code|)}

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -