首页 › 资源下载 › 其他 › This data set contai › 源码查看

http:^^www.cs.utexas.edu^users^gunnels^transpose^index.html

来自「This data set contains WWW-pages collect」· HTML 代码 · 共 84 行

HTML

84 行

MIME-Version: 1.0
Server: CERN/3.0
Date: Monday, 06-Jan-97 21:20:59 GMT
Content-Type: text/html
Content-Length: 4310
Last-Modified: Thursday, 28-Mar-96 20:17:09 GMT

<TITLE>Transpose Case</TITLE> <CENTER><H1>John A. Gunnels</H1></CENTER>     <H4>        <i> Department of Computer Science <br>            University of Texas at Austin <br>            <a HREF="mailto:gunnels@cs.utexas.edu">gunnels@cs.utexas.edu </a>        </i>      </H4> <CENTER><H1>Markus Dale</H1></CENTER>     <H4>        <i> Department of Computer Science <br>            University of Texas at Austin <br>            <a HREF="mailto:mdale@cs.utexas.edu">mdale@cs.utexas.edu </a>        </i>      </H4> <H3>Notes : In a lesson on not putting off this stuff (waiting for llano oreureka to become available) I ran into a weird bug on Sunday (well, since wereally started on Sunday I guess that is kind of redundant).  For some reason the permutation routine "chokes" when the pieces being sent around climb abovea certain size.  For example -- I can do a 1600 sq. mult on 16 processorswith a block size of 10 but NOT 20.  I checked for malloc failures but thatwasn't it(*).  So my results here are very sketchy.  Actually, I would ratherget some timings on eureka or llano before I try to tune this thing.</H3><HR><H3>(*)Well, there is no problem on eureka in this respect.  Some preliminarydata is also more encouraging as far as performance on eureka.  Although itis not what we are aiming for, a 4x4 grid on eureka achieves 35.xx MFLOPS when the local size hits 400x400 or so.  <HR>(**)The problem has been fixed -- however, it was a matter of using Irecv insteadof receive (basically) in implementing the collect (perhaps I should have just used MPI_Allgather).  However, these messages were not very big -- I would need to read about the SP2 architecture I guess but this seems pretty fragile to me.<HR>BTW -- the code does handle non-square matrices and meshes, I just need tocollect timings for those cases.</H3><IMG ALIGN="middle" SRC="trnspse.gif"><IMG ALIGN="middle" SRC="tchart.gif"><H3>NOTE : I have tweaked the code a little.  Basically, add 2 MFLOPS toeverything here.  I will post the chart soon (3/27/96).<HR><H3> The code enclosed does not perform accuracy testing -- it was removed to make the runs because it creates the global matrix on each processor.  We dohave a version (on both spice and eureka) that does the testing.</H3><H3><A HREF="main.c">main.c</A></H3><H3><A HREF="csmmult1.c">csmmult1.c</A></H3><H3><A HREF="colrow1.c">colrow1.c</A></H3><H3><A HREF="globals.h">globals.h</A></H3><H3><A HREF="rand.c">rand.c</A></H3><BR>Note : There are at least 4 simple things we could do to improve the performanceof this code. <HR>1. Instead of scattering followed by permuting we could simply permute (from1-to-many).  That is send the blocks immediately to the processor that theywill arrive at after both the scatter and the permute step.  To really makethe code unreadable, we believe that you could use non-blocking sends to overlapthe copying to the send buffer with the sending of blocks. (1 hour to recode andtest)<HR>2. Using MPI_Allgather instead of our hand-written bucket-collect.  (30 minutesto re-code and test).<HR>3. There is a simple test to see if you are sending to yourself for all of theseroutines. This might improve performance a great deal on grids that are farfrom sqaure (although, this does partially void #2 -- or perhaps not, if MPI is"smart" enough to figure this out itself).  The only "drawback" would be that it really wouldn't be implementing the same code on the 1x1 mesh and on a small machine like eureka this MIGHT make the scalability appear either "bad" or hard to figure an alpha, beta, gamma equation for. <HR>4. On a square mesh there is a very simple trick that would REALLY speed thingsup (I think).   In the scatter step of A's rows simply send to the same rownumber that you are the col number of (that is if you are in column 0 you sendto row 0 within column 0, 1->1 etc.).  Then perform a broadcast within columns.Then you do the analogous thing for the columns of B.  I am pretty sure that Prof. van de Geijn discussed this in class as one of the shortcuts.  I really mention it just to point out that that is NOT what we did (because we are presenting data for the square mesh case).<UL><img src="/pub/cgi/Count.cgi?ft=0&dd=A|df=gunnels2.dat" align=absmiddle>

http:^^www.cs.utexas.edu^users^gunnels^transpose^index.html - 源码说明

本页面展示了「This data set contains WWW-pages collected from computer science departments of various universities」中的 http:^^www.cs.utexas.edu^users^gunnels^transpose^index.html 源码文件，采用 HTML 编程语言编写，共 84 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。

虫虫下载站收录了大量与universities相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。

⌨️ 快捷键说明

复制代码Ctrl + C

搜索代码Ctrl + F

全屏模式F11

增大字号Ctrl + =

减小字号Ctrl + -

显示快捷键?