⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 http:^^www.cs.utexas.edu^users^gunnels^transpose^index.html

📁 This data set contains WWW-pages collected from computer science departments of various universities
💻 HTML
字号:
MIME-Version: 1.0
Server: CERN/3.0
Date: Monday, 06-Jan-97 21:20:59 GMT
Content-Type: text/html
Content-Length: 4310
Last-Modified: Thursday, 28-Mar-96 20:17:09 GMT

<TITLE>Transpose Case</TITLE> <CENTER><H1>John A. Gunnels</H1></CENTER>     <H4>        <i> Department of Computer Science <br>            University of Texas at Austin <br>            <a HREF="mailto:gunnels@cs.utexas.edu">gunnels@cs.utexas.edu </a>        </i>      </H4> <CENTER><H1>Markus Dale</H1></CENTER>     <H4>        <i> Department of Computer Science <br>            University of Texas at Austin <br>            <a HREF="mailto:mdale@cs.utexas.edu">mdale@cs.utexas.edu </a>        </i>      </H4> <H3>Notes : In a lesson on not putting off this stuff (waiting for llano oreureka to become available) I ran into a weird bug on Sunday (well, since wereally started on Sunday I guess that is kind of redundant).  For some reason the permutation routine "chokes" when the pieces being sent around climb abovea certain size.  For example -- I can do a 1600 sq. mult on 16 processorswith a block size of 10 but NOT 20.  I checked for malloc failures but thatwasn't it(*).  So my results here are very sketchy.  Actually, I would ratherget some timings on eureka or llano before I try to tune this thing.</H3><HR><H3>(*)Well, there is no problem on eureka in this respect.  Some preliminarydata is also more encouraging as far as performance on eureka.  Although itis not what we are aiming for, a 4x4 grid on eureka achieves 35.xx MFLOPS when the local size hits 400x400 or so.  <HR>(**)The problem has been fixed -- however, it was a matter of using Irecv insteadof receive (basically) in implementing the collect (perhaps I should have just used MPI_Allgather).  However, these messages were not very big -- I would need to read about the SP2 architecture I guess but this seems pretty fragile to me.<HR>BTW -- the code does handle non-square matrices and meshes, I just need tocollect timings for those cases.</H3><IMG ALIGN="middle" SRC="trnspse.gif"><IMG ALIGN="middle" SRC="tchart.gif"><H3>NOTE : I have tweaked the code a little.  Basically, add 2 MFLOPS toeverything here.  I will post the chart soon (3/27/96).<HR><H3> The code enclosed does not perform accuracy testing -- it was removed to make the runs because it creates the global matrix on each processor.  We dohave a version (on both spice and eureka) that does the testing.</H3><H3><A HREF="main.c">main.c</A></H3><H3><A HREF="csmmult1.c">csmmult1.c</A></H3><H3><A HREF="colrow1.c">colrow1.c</A></H3><H3><A HREF="globals.h">globals.h</A></H3><H3><A HREF="rand.c">rand.c</A></H3><BR>Note : There are at least 4 simple things we could do to improve the performanceof this code. <HR>1. Instead of scattering followed by permuting we could simply permute (from1-to-many).  That is send the blocks immediately to the processor that theywill arrive at after both the scatter and the permute step.  To really makethe code unreadable, we believe that you could use non-blocking sends to overlapthe copying to the send buffer with the sending of blocks. (1 hour to recode andtest)<HR>2. Using MPI_Allgather instead of our hand-written bucket-collect.  (30 minutesto re-code and test).<HR>3. There is a simple test to see if you are sending to yourself for all of theseroutines. This might improve performance a great deal on grids that are farfrom sqaure (although, this does partially void #2 -- or perhaps not, if MPI is"smart" enough to figure this out itself).  The only "drawback" would be that it really wouldn't be implementing the same code on the 1x1 mesh and on a small machine like eureka this MIGHT make the scalability appear either "bad" or hard to figure an alpha, beta, gamma equation for. <HR>4. On a square mesh there is a very simple trick that would REALLY speed thingsup (I think).   In the scatter step of A's rows simply send to the same rownumber that you are the col number of (that is if you are in column 0 you sendto row 0 within column 0, 1->1 etc.).  Then perform a broadcast within columns.Then you do the analogous thing for the columns of B.  I am pretty sure that Prof. van de Geijn discussed this in class as one of the shortcuts.  I really mention it just to point out that that is NOT what we did (because we are presenting data for the square mesh case).<UL><img src="/pub/cgi/Count.cgi?ft=0&dd=A|df=gunnels2.dat" align=absmiddle>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -