📄 ibm.tex
字号:
equivalent using the {\tt sqrt} function.} For example, the statement\begin{verbatim} X = pythag(Y,Z)\end{verbatim}would become the code segment:\begin{verbatim} if (abs(Y) .gt. abs(Z)) then X = abs(Y)*sqrt(1.0e0 + (Z/Y)**2) else if (Z .ne. 0.0e0) then X = abs(Z)*sqrt((Y/Z)**2 + 1.0e0) else X = abs(Y) endif\end{verbatim}If, for example, it is known that {\tt Z} is non-zero or if either of {\tt Y}or {\tt Z} is a constant, the above section of code may be appropriatelysimplified. It is important to make in-line any small, auxiliary routinesthat appear in critical inner loops, since their presence will prohibit thevectorization of that or any containing loop. In the case of {\tt pythag},nearly all occurrences were replaced because the code using the hardware{\tt sqrt} was much more efficient. \item {\bf Convert ANSI 66 FORTRAN-style, zero-pass loops to FORTRAN 77.}ANSI 66 FORTRAN did not specify whether the body of a {\tt do}-loop should beexecuted when the initial index exceeded the final index. This forcedthe writers of portable code to insert just before any {\tt do}-loop withsuch initial and final indices, a logical {\tt if} that wouldcause a jump around the loop. In FORTRAN 77, it is guaranteed that thebodies of such loops will not be executed, so removing the associated logical{\tt if} statements will not change the semantics of a FORTRAN 77 program.The removal of the offending {\tt if} statement may permit a vectorcompiler to rearrange and vectorize loops that would otherwise havebeen executed in scalar mode.In the example below, the test and branch before the {\tt do 140} loopprohibit the ``loop interchange'' that would have permitted vectorizationon the outer loop. By default, the inner loop is not vectorized becauseit performs a row operation, and scalar code would execute more quickly ingeneral.Simply removing the {\tt if} statement permits vectorization of the outer loop,effectively yielding the more efficient {\em column} vector operations.\begin{verbatim} do 160 i = 1, n if (k.gt.m) go to 160 do 140 j = k, m 140 a(i,j) = a(i,j) - y * a(m,j) 160 continue\end{verbatim}The following code is very similar to a part of the NATS routine, {\tt elmhes}.The test and jump are used mainly to avoid the execution of the inner loopwhen it would have no net effect. Unfortunately, such ``optimizations''impair the compiler's ability to recognize opportunities for vectorization.As in the previous example, removing the {\tt if} statement permitsvectorization of the outer loop without changing the semantics of the code.\begin{verbatim} do 160 i = 1, n y = a(i,mm1) if (y .eq. 0.0e0) go to 160 y = y / x a(i,mm1) = y do 140 j = m, n 140 a(i,j) = a(i,j) - y * a(m,j) 160 continue\end{verbatim} Here, there is a lesson more generally applicable than ``eliminate explicit branches to enable vectorization.'' Unless the above code is targeted for use in an application involving fairly sparse matrices, the scale factor, {\tt y}, will seldom be $0$. Clearly, executing the inner loop with ${\tt y} = 0$ is a waste of time, but if {\tt y} is rarely $0$, the degradation (admittedly slight) of average performance is not worth the occasional ``best case'' improvements. Another argument in favor of removing the {\tt go to} is that for most modern compilers, structured code is more easily and efficiently optimized. A final example demonstrates dramatically how the improper use of explicit branches can inhibit vectorization. The EISPACK subroutine, {\tt figi}, does not perform many operations (only $O(n)$). It is unusual in that it performs nearly as many operations to test the integrity of input data as it performs operations on those data. The NATS version of {\tt figi} interleaves the error testing and computation. Since one of the error branches (to the {\tt 1000} label) exits the loop, the entire loop must be executed in scalar mode. Here is the body of the NATS code.\begin{verbatim} do 100 i = 1, n if (i .eq. 1) go to 90 e2(i) = t(i,1) * t(i-1,3) if (e2(i)) 1000, 60, 80 60 if (t(i,1).eq.0.0e0 .and. t(i-1,3).eq.0.0e0) go to 80 ierr = -(3 * n + i) 80 e(i) = sqrt(e2(i)) 90 d(i) = t(i,2) 100 continue\end{verbatim} It is possible, in this case, to perform computation and testing in separate loops without duplicating any computation. Thus, at least the computation loop may be vectorized. The only expense is that when bad input is detected, more computation will have been performed before returning, since the computation loop must precede the loop for input validation. The net performance gain for this transformation is an average of about 50\% for $N$ in the range $50$ to $150$.\begin{verbatim} d(1) = t(1,2) do 100 i = 2, n e2(i) = t(i,1) * t(i-1,3) e(i) = sqrt(abs(e2(i))) d(i) = t(i,2) 100 continue do 200 i = 2, n if (e2(i) .gt. 0.e0) go to 200 if (e2(i) .lt. 0.e0) go to 1000 if (t(i,1).eq.0.e0.and.t(i-1,3).eq.0.e0) go to 200 ierr = -(3 * n + i) 200 continue\end{verbatim} \item {\bf When performing elementary vector operations, use the form $Y = Y + sX$ rather than $Y = Y - sX$.} If necessary, the sign of the scale factor $s$ is changed before entering the loop. A related transformation is to convert inner products with negative stride to (mathematically) equivalent inner products with positive stride.\footnote {However, using a positive stride when forming inner products of vectors from a graded matrix may result in an unacceptable loss of accuracy. See [Wilkinson, pp.339-358] for an in-depth discussion on this topic.} In general, negative stride loops should be avoided, since they are less efficiently vectorized. Both of these transformations were central to producing the PVS routine, {\tt orthes}. For example, the NATS code {\tt orthes} :\begin{verbatim} do 130 j = m, n f = 0.0d0 do 110 ii = m, igh i = mp - ii f = f + ort(i) * a(i,j) 110 continue f = f / h do 120 i = m, igh 120 a(i,j) = a(i,j) - f * ort(i) 130 continue\end{verbatim} was transformed into the PVS code:\begin{verbatim} do 130 j = m, n f = 0.0d0 do 110 i = m, igh f = f + ort(i) * a(i,j) 110 continue f = -f * h do 120 i = m, igh 120 a(i,j) = a(i,j) + f * ort(i) 130 continue\end{verbatim} Note that in the NATS version, the inner product is computed with a negative stride, and in the elementary transformation the product {\tt f * ort(i)} is subtracted from {\tt a(i,j)}. Making the two small changes to obtain the PVS code yielded more than a 10\% performance improvement for $N$ as small as $100$.\footnote{Note that there was a third change: the division by {\tt h} in NATS was replaced in PVS by a multiplication by its inverse. Although multiplies are cheaper than divides on the 3090-vf, such a change does not yield a discernible performance improvement in this case. } \item {\bf Use judiciously the ``ignore recrdeps'' compiler directive.}Let us examine a section of code taken in part from the PVS routine,{\tt elmhes}. At first glance, the code would seem to be well-suitedto vectorization, since the inner loop performs elementarytransformations on columns. Unfortunately, the compiler reports thatthere {\em may} be a recursive dependency in the innermost loop, anddoes not generate vector instructions for it. We examine the innermostloop and note that there cannot possibly be a dependence on the scalefactor, {\tt a(j,m-1)}, so the problem must lie with the term {\tt a(i,j)}.Studying the inner loop in isolation we see that depending onthe relationship between the indices {\tt m} and {\tt j}, there mayindeed be a recursive dependency.\footnote{If a loopon $i$ computes $a_i$ as a function of $a_{i-1}$, then$a_i$ is said to be recursively dependent on $a_{i-1}$}Now, stepping back to consider both loops, it is easy todetermine that {\tt j} will always be at least one greater than {\ttm}, so actually there is no dependence.\begin{verbatim} do 140 j = m+1, igh do 139 i = 1, igh 139 a(i,m) = a(i,m) + a(j,m-1)*a(i,j) 140 continue\end{verbatim}The vector directive to ignore recursive dependencies in the code belowserves to inform the compiler that {\tt a(i,m)} does {\em not} dependrecursively upon {\tt a(i,j)}. Thus, the inner loop is vectorized.Caution must be used when specifying `ignore' directives, since if improperlyapplied, they (unlike `prefer' directives) may profoundly change the semanticsof affected subroutines.\begin{verbatim} do 140 j = m+1, ighc" ( ignore recrdeps do 139 i = 1, igh 139 a(i,m) = a(i,m) + a(j,m-1)*a(i,j) 140 continue\end{verbatim}\end{enumerate}\section{Timings of individual PVS routines}In this section, we shall discuss the individual PVS routines, relatingwhich transformations (from section 4) were applied and the effectsthey had on vectorization.There are $75$ separate subroutines and functions in the NATS package.Of those, $13$ are simply driver subroutines and $4$ are the utilitysubprograms {\tt cdiv}, {\tt csroot}, {\tt epslon}, and {\tt pythag}.No drivers or utilities were modified.In nearly every other routine, some of the following general transformationswere used to convert NATS code into PVS code.\begin{itemize} \item ANSI 66 style loops with negative stride were converted to FORTRAN 77 {\tt do}-loops with negative increments. \item When a temporary variable (often with a name like {\tt ip1} or {\tt mm1}) was defined only to be used a single time (e.g.\ the initial or terminal index of a {\tt do}-loop), it was replaced by its definition. Although efficiency was not appreciably affected by such changes, the resultant code was more readable. \item Many spurious ANSI 66 style zero-pass branches were removed. The removal of such branches is not mentioned below unless it was critical to loop rearrangement and/or vectorization, since only in that case is performance significantly affected. \item Nearly every reference to the function {\tt pythag} was mechanically replaced by an in-line equivalent. For the few cases in which the replacement code was in a critical path, further refinements were made to decrease the number of floating operations in an inner loop. Those cases are noted below.\end{itemize}Of the $58$ remaining routines, several were not substantially modified.Most unmodified routines were not timed, and none are discussed below.That is, if none of the transformations that were applied to a routineaffected its vectorization, it was neither timed nor is it discussed below.Neither of the first two transformations affect vectorization, and in onlya few cases was the removal of a branch required to achieve vectorization.\begin{itemize}\item {\bf invit} - One ``prefer vector'' directive was used on a row summation in the computation of the infinity norm. A second was used to initialize a row to all zeros. The renaming of a temporary variable involved in a column exchange permitted the vectorization of that operation.\item {\bf cinvit} - Renaming a temporary variable removed a compiler-induced dependence on a row exchange. Then, ``prefer vector'' directives forced the vectorization of the associated loop and one performing a complex row saxpy. Finally, transformation \#6 was applied to the saxpy.\item {\bf tinvit} - The single significant transformation applied to {\tt tinvit} was to issue a ``prefer vector'' directive for the sole eligible loop.\item {\bf combak, elmbak} - The application of an ``ignore recrdeps'' directive and the removal of a branch around an inner loop permitted a loop interchange and subsequent vectorization.\item {\bf comhes, elmhes} - {\tt Elmhes} was rewritten to perform all column operations and to make only one pass over the input matrices. One ``ignore recrdeps'' directive was given. {\tt Comhes}, the complex analog of {\tt elmhes}, was modified in exactly the same manner with one exception: an auxiliary routine, {\tt cdiv}, was coded in-line to obtain the same degree of vectorization.\item {\bf comlr, comlr2 } - The transformations applied to {\tt comlr} and {\tt comlr2} were very similar. Both used ``prefer vector'' directives to vectorize row exchanges, elementary row transformations, and a diagonal shift. Numerous applications of the ``rename overused temporaries'' rule permitted vectorization of several row and column exchanges. Two additional transformations affected {\tt comlr2} alone. A matrix norm was reformulated to permit its vectorization, and a ``prefer vector'' directive was applied to a mixed row/column inner product.\item {\bf comqr, comqr2} - Much like {\tt comlr} and {\tt comlr2}, these two routines are similar enough in functionality that most transformations successfully applied to one are applied to the other as well.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -