📄 ibm.tex
字号:
\documentstyle[11pt]{article}\renewcommand{\baselinestretch}{1.25}\oddsidemargin 0.5cm\evensidemargin 0.5cm\topmargin .5in\headheight 0pt\headsep 0pt\textwidth 5.7 in\textheight 8.3 in\title{Converting EISPACK to Run Efficiently on a Vector Processor}\author{ Alan Kaylor Cline \\ James Meyering \\[.2in] Pleasant Valley Software, \\ 8603 Altus Cove, \\ Austin, Texas 78759 \\[.2in] }\date{\today}\begin{document}\maketitle\section{Introduction}This report documents the production, performance evaluation, and testingof a version of the single precision FORTRAN software package (known as EISPACK)that takes advantage of the high-speed vector registers and cache of theIBM 3090-VF.In addition, the results of testing and timing a carefully crafted and tuneddouble precision version of the same package help put the single precisionresults into perspective.We explain how design objectives shaped the single precision end-product.Timing data are then presented and we describe how various low-levelmodifications to the FORTRAN source affected the performance of individualroutines.Our testing methodology is examined and the results of our tests arepresented and analyzed.To facilitate the discussion of the three versions of EISPACK mentioned above,we assign them names. NATS will refer to the original code (dated 1983) that was obtained from the netlib archive. Both single and double precisionversions are referenced: if the precision of the version beingdiscussed is not stated explicitly, it should be apparent from the context.The carefully crafted double precision code will be called PASC.It was written at the IBM Palo Alto Scientific Center by Augustin Dubrulleand is based on the collection of ALGOL procedures from which EISPACK evolved.The third version, the single precision code that is the major topic of thisreport, will be referred to as PVS.\section{Design Considerations}Several factors influenced the ultimate form of our versionof single precision EISPACK (PVS).One of the most important of those was the requirement that each subroutineand function retain the calling sequence and functionality of the respectivemodule in the original package.A related objective was to keep the code of PVS as close as possible to thatof NATS, modifying NATS just enough to obtain the desired degree ofvectorization.The intent was to produce a version of EISPACK that would take advantage ofthe environment (hardware, compiler, operating system) peculiar to the 3090-VF,and yet would remain portable.However, a single non-portable construct was introduced into thePVS code: many subroutines call a routine from the system library,{\tt XUFLOW}, with a flag indicating that floating point underflowsare silently to be set to zero.The default behavior, which involves the generation of a system interrupt andmuch diagnostic output, can significantly decrease performance.The effort to maintain portability should be kept in mind when comparingthe times of our routines to those that were carefully tuned for the 3090.A key assumption was that since the package would use single precision,it would be used almost exclusively for small problems ($N \leq 150$).Accordingly, our aim was to produce software that would be fast for suitablysmall problems, recognizing that performance might degrade as $N$ increasedbeyond the $N = 150$ threshold.\section{High Performance Hardware of the IBM 3090-VF}In the following discussion, it is important that the reader bear in mind somekey features of the 3090-VF and its vector compiler.The 3090-VF has a fairly standard virtual memory hierarchy: secondary storage,primary storage, cache, vector and scalar registers.The cache and vectors registers deserve special attention,since the performance of any software run on this system dependsmost heavily on the efficiency with which these resources are used.The cache can be viewed simply as a very fast extension of primarymemory, but for our purposes, we must elaborate.We note that the cache has a capacity of $64$K bytes and is divided into $4$Klines of $16$ bytes each.A cache line is analogous to a page in primary storage.Whenever a memory access is performed and the desired element is not foundin the cache (called a cache miss) then an entire line (probably containingone or more additional elements) is copied from primary storage into the cache.Adding the optional VF hardware to a 3090 gives it an $128$-element, off-chipvector processing unit and several general purpose vector registers.The vector processor pipelines arithmetic operations such as elementary vectortransformations and inner products, and can achieve much higher throughputthan a conventional, scalar processor.The efficient use of the vector processor depends heavily upon the compilerand upon effective use of the cache. For example, the vectorization of anisolated inner product involving a large stride will not normally be efficient,since memory accesses will be widely separated.A greater familiarity with the equipment and the computing environmentcan be obtained in [2][3][4][5][6].\section{Classes of Program Transformations}In this section, several classes of program transformations are presented.With each is an example or two illustrating how it is typically applied.Most of the transformations enable the 3090's vectorizing compiler to producemore efficient code. Using these techniques, the reader should be able torecognize how nearly all of PVS was obtained from the original code, NATS.Notably exceptional is the new eigenvalue convergence test used in both{\tt bisect} and {\tt tsturm}. The test used in NATS is architecture-and compiler-dependent when intermediate results may be stored in extendedprecision registers. The new test is more portable and generally yieldsresults at least as accurate as the NATS version.\begin{enumerate} \item {\bf Avoid the use of critical temporary variables as parameters to subroutines.}In the code fragment below, the variable {\tt xxr} is used as atemporary in the column exchange and is also a parameter to the subroutine{\tt cdiv}. The last two arguments to {\tt cdiv} are not input parameters,however, and since the compiler does not perform inter-procedural analysisit is unable to determine that the last value computed in the loop isnot used as input to the subroutine. The compiler must then assume that thevalue assigned to {\tt xxr} in the last iteration of the loop may be used inthe subroutine. This assumption preventsthe vectorization of the {\tt do 100} loop, since the vectorization of sucha loop requires that the temporary be expanded to a vector, and there is noway to extract from the temporary vector the final value of {\tt xxr}.\begin{verbatim} do 100 i = 1,n xxr = a(i, j) a(i, j) = a(i, k) a(i, k) = xxr 100 continue . . call cdiv(ar, ai, br, bi, xxr, xxi)\end{verbatim}A solution that permits the desired vectorizationis to replace the two occurrences of {\tt xxr} in the loop with a variable(e. g. {\tt foo} ) that is never used as anything but a compiler-recognizabletemporary.\begin{verbatim} do 100 i = 1,n foo = a(i, j) a(i, j) = a(i, k) a(i, k) = foo 100 continue . . call cdiv(ar, ai, br, bi, xxr, xxi)\end{verbatim}Another, slightly less time-efficient option would be simply to assign to{\tt xxr} some constant value just after the loop, thus:\begin{verbatim} do 100 i = 1,n xxr = a(i, j) a(i, j) = a(i, k) a(i, k) = xxr 100 continue xxr = 0.0e0 . . call cdiv(ar, ai, br, bi, xxr, xxi)\end{verbatim} \item {\bf Reorder simple loops to traverse arrays by column.} For example, in a few EISPACK routines, an input array is traversed by row to initialize it to an identity matrix. Another common instance of unnecessary row operations was found in the computation of matrix norms. Such small, simple loops are easily converted to a form more amenable to effective vectorization. More time-consuming is the task of correctly converting even so straightforward an algorithm as that for reducing a real matrix to Hessenberg form using stabilized elementary vector transformations (as is done in {\tt elmhes}). The following example was taken verbatim from the NATS routine {\tt minfit}.\begin{verbatim} do 500 i = m1, n do 500 j = 1, ip b(i,j) = 0.0e0 500 continue\end{verbatim}This code is trivially transformed into its column-major counterpart which appears in the PVS version of {\tt minfit}. Column-major traversal of the array {\tt b} makes much more efficient use of the cache.\begin{verbatim} do 500 j = 1, ip do 500 i = m1, n b(i,j) = 0.0e0 500 continue\end{verbatim}The following segment of code (here, slightly modified for clarity) is usedin the NATS routine {\tt hqr} to compute the norm of the Hessenbergmatrix {\tt h(n,n)}. Note that the elements of {\tt h} are accessed byrow, and that the auxiliary variable, {\tt k}, is needed to conform tothe ANSI 66 requirement that initial and final {\tt do}-loop indices besimple variables or constants (expressions, like {\tt max0(1,i-1)},were not permitted).\begin{verbatim} k = 1 do 50 i = 1, n do 40 j = k, n 40 norm = norm + dabs(h(i,j)) k = i 50 continue\end{verbatim}Code with the same functionality appears in the PVS version of {\tt hqr}.However, that code (below) is much more readable and, since the array istraversed by column, it is much more efficient as well.\begin{verbatim} do 50 j = 1, n do 50 i = 1,min0(n,j+1) 50 norm = norm + dabs(h(i,j))\end{verbatim} \item {\bf Force vectorization of operations on rows and diagonals.} When it is not possible to rearrange a computation (i.e.\ converting row operations to equivalent column operations) one should specify with a compiler directive in the FORTRAN source code that the row (or diagonal) operation should, if possible, be performed using the vector hardware. Here we take advantage of the assumption that $N$ is small. When $N$ is small enough that a large portion of the data to be accessed will fit in the 3090's cache, the disadvantages of row access are outweighed by the advantages of vectorization. When $N$ is this small, the archetypical problem associated with row operations, page faulting, is insignificant. On the 3090-VF, there is generally a penalty for performing vector operations on rows because (with single precision) the attempt to access each element in at least one row out of four will cause a cache fault. The problem is even more pronounced when the data are double precision or complex, since only two elements are copied into the cache after each fault. Another problem with row vector operations arises when $N$ is very small. If $N$ is a sufficiently small fraction of the vector section length, the overhead of setting up for the vector operation is not offset by the few pipelined arithmetic operations. However, there is a middle ground, where $N$ is large enough to gain more from vector operation than is lost to page and cache faults, and yet not so small that pipelining overhead dominates. For most routines that middle ground seems to be in a range from $N = 60$ to about $150$. To illustrate this transformation, consider the following code taken from the NATS routine {\tt cinvit}. The inner loop is eligible to be vectorized. That is, it has no vectorization-prohibiting characteristics (e.g.\ recursive dependencies, references to user functions or subroutines, etc.). However, the two arrays {\tt rm1} and {\tt rm2} are traversed by row, and that fact leads the compiler to estimate that the code would run faster in scalar mode than in vector mode. Since we are targeting this code for relatively small problems, we believe that estimation is inapplicable, and think that these operations should be vectorized. One possibility is that the compiler would interchange the two loops, effectively converting the row operations to column operations. Unfortunately, there is a dependency of the inner loop index, {\tt j}, on the outer loop index, {\tt i}, that prevents the interchange. To force the compiler to vectorize an eligible loop, one need simply insert a comment of a prespecified form just before that loop.\begin{verbatim} do 400 i = 2, uk do 380 j = i, uk rm1(i,j) = rm1(i,j) - x * rm1(mp,j) + y * rm2(mp,j) rm2(i,j) = rm2(i,j) - x * rm2(mp,j) - y * rm1(mp,j) 380 continue 400 continue\end{verbatim}The comment below instructs the compiler that we would like it to generatevector code for a loop that (in this case) it has estimated would run fasterin scalar mode.\begin{verbatim} do 400 i = 2, ukc" (prefer vector do 380 j = i, uk rm1(i,j) = rm1(i,j) - x * rm1(mp,j) + y * rm2(mp,j) rm2(i,j) = rm2(i,j) - x * rm2(mp,j) - y * rm1(mp,j) 380 continue 400 continue\end{verbatim} \item {\bf Replace references to the function {\tt pythag} by an in-line
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -