📄 ibm.tex

📁 fortran版本的eslack,好不容易找到的。里面有程序的索引。主要包括了常用了矩阵计算代码。
💻 TEX
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
     They shared a directive to encourage vectorization of a     diagonal shift and the effective equivalent of an ``ignore recrdeps'' directive     to permit vector operation in a pair of complex row saxpys.     In {\tt comqr2} alone, a matrix norm was reformulated to permit     its vectorization, and a ``prefer vector'' directive was applied to a mixed row/column     inner product.\item {\bf corth} -	Two transformations were applied to {\tt corth}:     a complex inner product was modified to use a positive stride,      a pair of complex elementary transformations (one row-, the other     column-oriented) were converted according to transformation \#6.\item {\bf eltran, htribk} -	Here are two cases in which initial, injudicious application of	the ``prefer vector'' directive resulted in noticeably anomalous	performance.	In {\tt eltran's} case, the compiler was instructed to generate     vector instructions for the innermost loop, which performs a     row copy (not an interchange).  The timing data (the PVS version	executed from $10$\% to $20$\% more slowly than the NATS code)	made it quite clear that, in this case, the scalar code is much	more efficient.  Removal of the offending directive yielded code	effectively identical to NATS {\tt eltran.}  The moral here is,	of course, that ``copy'' loops (in contrast to those performing	vector-exchanges) generally execute more efficiently in scalar mode.	In the case of {\tt htribk,} the ``prefer vector'' directive was first	applied to a loop that the compiler would have vectorized in any case.	That loop was at nesting level $2$.  The timings of the code with the	directives applied to the more deeply nested (level $3$) loops show	an increase in performance of from 10 to 15\% over the NATS {\tt htribk}.     It should be noted that the loops at nesting level $3$ performed	{\em mixed} row and column complex inner products and elementary	transformations.\item {\bf figi, figi2 } -	see discussion in section 4\item {\bf hqr   } -	``prefer vector'' directives were used to vectorize a diagonal shift and an     initialization of a diagonal to zero.  The use of ``prefer vector'' directives and     the renaming of temporary variables permitted vectorization of     many additional row and column operations.\item {\bf hqr2  } -	The transformations that were applied to {\tt hqr} applied     equally well here.  Some transformations specific to {\tt hqr2}     involved additional renamings and the application of a	``prefer vector'' directive to a matrix multiply.\item {\bf htrib3} -	The sole transformation consisted of changing the signs of     scale factors in a complex row/column saxpy.\item {\bf htridi, htrid3 } -	In {\tt htridi}, ``prefer vector'' directives were used on the following     complex, row-oriented operations: sum, saxpy, sdot, scale.  To achieve     the same results in {\tt htrid3}, ``ignore recrdeps'' directives had	to be used in each case as well.\item {\bf imtql1, imtql2, imtqlv, tql1, tql2} -	The same transformation was     performed for each of these routines.  Invocations of the function     {\tt pythag} were replaced by equivalent in-line code.  The major     benefit in making such a replacement is that it permitted the     vectorization of a loop at nesting level $2$.\item {\bf minfit, svd} -	In two places in each routine, an ``ignore recrdeps'' directive     was given to vectorize a doubly nested loop which performed at     each iteration a column inner product and a column saxpy.	A ``prefer vector''     directive was needed to vectorize a row scale, a row sum, and,     with the help of an ``ignore recrdeps'' directive, a combined	column and row     initialization.  The only transformation not performed in both     cases affected {\tt minfit}: a matrix initialization loop was     modified so that the array was traversed by column rather than     by row.\item {\bf orthes} -	As mentioned in section $4$, the transformations in {\tt orthes}     involved negating scale factors in elementary transformations and     modifying inner products to use positive strides.     Here, both loops at nesting level $2$ were vectorized, in spite of the     fact that in one all operations were column oriented.  (This, the     authors believe, was a mistake and probably led to a degradation in     performance.)\item {\bf qzhes } - Two elementary row transformations were vectorized.\item {\bf qzit  } - As in {\tt qzhes}, two elementary row transformations     were vectorized.  A ``prefer vector'' directive was applied	to a row summation, vectorizing it as well.\item {\bf qzvec } -	A ``prefer vector'' directive was applied	to a single row oriented inner product.\item {\bf reduc, reduc2 } - A single row inner product was vectorized.\item {\bf trbak1} - A ``prefer vector'' directive	and an ``ignore recrdeps'' directive were applied to a loop     at nesting level $2$. The innermost loops are mixed row and     column operations.\item {\bf trbak3,tred2} - The same problem experienced in the development of     {\tt trbak1} was     noted for each of these routines.  The same solution was applied in both     cases.  In the NATS {\tt trbak3}, all operations were column oriented,     so no compiler directives were needed.  For {\tt tred2}, transformation     \#6 was also applied.\item {\bf tred1, tred3} - In each case, transformation	\#6 was applied.  In {\tt tred1}     a section of code resembling a $3$-way row exchange was vectorized.\end{itemize}\section{Timings}Comparison testing was done on PASC code versus NATS (i.\ e.\ the double precision versions) and PVS code versus NATS (i.\ e.\ the singleprecision versions).The tables are found in Appendices 1 and 2.Times are recorded in seconds, and for each table, the number of samplesper value of $N$ is in parentheses following the name of the subroutine.\subsection{PASC vs.\ NATS (double precision)}The first four routines listed in the table({\tt balanc}, {\tt balbak}, {\tt cbak2}, and {\tt cbal}) werenot part of PASC: the timings are actually NATS vs.\ NATS and are  included only for purposes of calibration.It should also be noted that seven of the routines to be comparedto the PASC routines wereactually NATS routines that had been modified to run more efficientlyon the 3090-VF.  They were {\tt comhes}, {\tt elmhes}, {\tt orthes},{\tt qzhes}, {\tt qzit}, {\tt tql1}, and {\tt tred1}.These seven were involved in an experiment to assess how productivea moderate amount of recoding effort would be. In the experimentno more than a day was spent per routine in the recoding of the NATS originaland in testing. These routines were timed against the PASC routinesthat had been carefully crafted for the 3090-VF from the originalAlgol source. The corresponding PASC codes were not examined untilafter the tests.The seven routines were selected simply to represent the spectrum ofEISPACK capabilities.The testing involved matrices of dimensions $50$, $60$, $70$, $80$,$100$, $150$, $200$, and $300$. What was seen was that $26$ of the PASC routines outperformed the original NATS by at least $10$\% at dimension $300$(although because of the large standard deviation associated with thetimings for {\tt qzhes}, that routine should possibly be excluded).The PASC versions of the important procedures{\tt hqr} and {\tt hqr2} achieved about $130$\%and $90$\% speedup over the NATS codes, respectively.However, the timings of thesimilarly important {\tt ortran} showed no significant differences.Such was also the case for {\tt orthes} where PASC was compared tothe recoded NATS. In general, the PASC code was about $40$\% fasterthan recoded NATS {\tt qzit} and $20$\% faster than recoded NATS {\tt tql1}.Ten of the other tests produced timings within $10$\% of eachother at dimension $300$ and one, the original NATS {\tt qzval},actually ran about $10$\% faster than the PASC code.\subsection{PVS vs.\ NATS (single precision)}There were two separate runs, one for PVS, one for NATS.The same drivers and timing routines were used for each run.The test matrices had dimensions $10$ through $200$ in steps of $10$,and their elements were uniformly distributed over the interval$[-1 \ldots 1]$.Both versions were compiled by VS FORTRAN Version 2, Release 3.All code was compiled with flags to enable the production ofvector code at the highest level of optimization.Each routine was run several times for each value of $N$ toobtain more reliable results.Although the CPU times were obtained during off-peak hours,some routines (e.g.\ {\tt figi1, figi2, and qzval}) requiredtime on the order of the clock granularity.As a result, the distributions of the times for such routines werecharacterized by unusually high standard deviations.A table entry with a ``z'' appended to thename of a subroutine (e.g.\ {\tt qzitz}) documents the performanceof that routine when it was called with a flag indicating thattransformations should be accumulated.  The corresponding entrywithout the trailing ``z'' in the name contains times for thecalls when no accumulation was performed.PVS routines that performed outstandingly well did so becausethe PVS routine had been converted from the NATS stride-$N$ codeto all stride-one code, either through conventional means (recoding)or through judicious application of compiler directives.The best examples of the latter technique are {\tt elmbak} and {\tt combak}.With only very minor modifications and the application of a fewvector directives, the code ran approximately 6 and 7 (respectively)times faster for $N = 200$.Notice that the important subroutines {\tt hqr} and {\tt hqr2}performed $70$\% and $90$\%, respectively, better at dimension $100$with the PVS versions than the with NATS. Alternatively, PVS {\tt orthes}performed only $20$\% better at dimension $100$ and {\tt ortran}performed no better than its NATS version.Since improvements for small dimension matrices were realized with PASC double precision versions of {\tt eltran}, {\tt htribk},and {\tt ortran} but not with the PVS versions, we are ledto believe that the significant stride-one reorientationthat was employed for the PASC codes are the correct approachto employ even for single precision.\section{Correctness Testing}Both PASC and PVS codes were subjected to tests to ensuretheir correctness. These tests were created by modifyingthe thirteen NATS-supplied drivers. These programs originally reada large set of test matrices, called a sequence of EISPACKmodules to produce eigenvalues and/or eigenvectors, comparedresults across alternative paths, and computed residuals.Each test produces a figure of merit that essentially isthe residual norm scaled by the the product of ten, themachine precision, the matrix dimension, and the matrix norm.These tests have been employed since 1972 to determine thecorrectness of versions of EISPACK on various machines.The modifications employed allowed for a set of random perturbationsto be applied to each of the input matrices. For certainsituations (e.\ g.\ symmetric matrices, tridiagonal matrices)the perturbations had to reflect the character of the original matrix. A large and varied setof tests could be performed with this procedure.Although the codes were targetedfor the 3090-VF they were also run (employing simulation) onSun 3 and Sun 4 workstations and on a VAX 11/780. Testing on thealternative equipment with different floating point precisionsand ranges as well as on the 3090-VF was intended to revealsubtle errors. In fact, they did since several of the originalPASC codes (since recoded) showed difficulties withoverflows that had not previously been uncovered.The final versions of all PASC and PVS routines perform within the acceptable tolerances on all tests.\section{Conclusions}Two new versions of the EISPACK subroutine package have beentailored for the 3090-VF. One was produced by Augustin Dubrulleof IBM Palo Alto Scientific Centerin double precision and intended to be most efficient forextremely large problems. The other was produced by PleasantValley Software in single precision with the purpose ofattaining efficiency for smaller problems. The encodingof the PVS version has revealed several important transformationsthat can be important in making similar programs moreefficient for the 3090-VF.\newpage\section{References}\begin{enumerate}  \item    J.~J.~Dongarra, F.~G.~Gustavson, and A.~Karp  (January 1984),    ``Implementing Linear Algebra Algorithms for Dense Matrices    on a Pipeline Machine,''    Siam Review, Vol 26, No.~1, pp.91-113.  \item    A.~A.~Dubrulle, H.~G.~Kolsky, and R.~G.~Scarborough (1985),    ``How to Write Good Vectorizable FORTRAN,'' Technical Report G320-3396,    IBM Scientific Center, Palo Alto, CA.  \item    A.~A.~Dubrulle (June 1988),    ``A Version of Eispack for the IBM 3090VF,'' Technical Report G320-3510,    IBM Scientific Center, Palo Alto, CA.  \item    A.~Padegs, B.~B.~Moore, R.~M.~Smith and W.~Buchholz (1988),    ``The IBM System/370 Vector Architecture: Design Considerations,''    IEEE Trans.\ Comp., Vol.\ 37, No.\ 5.  \item    S.~G.~Tucker (1986),    ``The IBM 3090 System: An Overview,''    IBM Systems Journal, Vol.\ 25, No.\ 1,pp.\ 4-19.  \item    H.~H.~Wang (March 1986),    ``Introduction to Vectorizing Techniques on the IBM 3090 Vector Facility,''    Technical Report G320-3489, IBM Scientific Center, Palo Alto, CA.  \item    J.~H.~Wilkinson and C.~Reinsch (1971),    ``Handbook for Automatic Computation, Vol.\ II, Linear Algebra,''     Springer-Verlag, New York.\end{enumerate}\newpage\vfil\hfil {\Large Appendix 1} \hfil\vfil\newpage\vfil\hfil {\Large Appendix 2} \hfil\vfil\end{document}
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -