📄 chap30.htm
字号:
<h1><a name="0954_19dd">30.2 CRCW algorithms versus EREW algorithms<a name="0954_19dd"></h1><P>
<a name="0954_19d9"><a name="0954_19da">The debate about whether or not concurrent memory accesses should be provided by the hardware of a parallel computer is a messy one. Some argue that hardware mechanisms to support CRCW algorithms are too expensive and used too infrequenly to be justified. Others complain that EREW PRAM's provide too restrictive a programming model. The answer to this debate probably lies somewhere in the middle, and various compromise models have been proposed. Nevertheless, it is instructive to examine what algorithmic advantage is provided by concurrent accesses to memory.<P>
<a name="0954_19db"><a name="0954_19dc">In this section, we shall show that there are problems on which a CRCW algorithm outperforms the best possible EREW algorithm. For the problem of finding the identities of the roots of trees in a forest, concurrent reads allow for a faster algorithm. For the problem of finding the maximum element in an array, concurrent writes permit a faster algorithm.<P>
<h2>A problem in which concurrent reads help</h2><P>
Suppose we are given a forest of binary trees in which each node <I>i</I> has a pointer <I>parent</I>[<I>i</I>] to its parent, and we wish each node to find the identity of the root of its tree. Associating processor <I>i </I>with each node <I>i</I> in a forest <I>F</I>, the following pointer-jumping algorithm stores the identity of the root of each node <I>i</I>'s tree in <I>root</I>[<I>i</I>].<P>
<pre><a name="0955_19dd">FIND-ROOTS(<I>F</I>)</sub></sup></pre><P>
<pre>1.<B> for</B> each processor <I>i</I>, in parallel</sub></sup></pre><P>
<pre>2.<B> do if</B> <I>parent</I>[<I>i</I>] = NIL</sub></sup></pre><P>
<pre>3.<B> then</B> <I>root</I>[i]<img src="../images/arrlt12.gif"> <I>i</I></sub></sup></pre><P>
<pre>4.<B> while</B> there exists a node <I>i</I> such that <I>parent</I>[<I>i</I>] <img src="../images/noteq.gif"> NIL</sub></sup></pre><P>
<pre>5.<B> do for</B> each processor <I>i</I>, in parallel</sub></sup></pre><P>
<pre>6.<B> do if</B> <I>parent</I>[<I>i</I>] <img src="../images/noteq.gif"> NIL</sub></sup></pre><P>
<pre>7.<B> then</B> <I>root</I>[<I>i</I>] <img src="../images/arrlt12.gif"><I> root</I>[<I>parent</I>[<I>i</I>]]</sub></sup></pre><P>
<pre>8. <I>parent</I>[<I>i</I>] <img src="../images/arrlt12.gif"> <I>parent</I>[<I>parent</I>[<I>i</I>]]</sub></sup></pre><P>
Figure 30.5 illustrates the operation of this algorithm. After the initialization performed by lines 1-3, shown in Figure 30.5(a), the only nodes that know the identities of their roots are the roots themselves. The <B>while </B>loop of lines 4-8 performs the pointer jumping and fills in the <I>root</I> fields. Figures 30.5(b)-(d) show the state of the forest after the first, second, and third iterations of the loop. As you can see, the algorithm maintains the invariant that if <I>parent</I>[<I>i</I>] = <FONT FACE="Courier New" SIZE=2>NIL</FONT>, then <I>root</I>[<I>i</I>] has been assigned the identity of the node's root.<P>
We claim that <FONT FACE="Courier New" SIZE=2>FIND</FONT>-<FONT FACE="Courier New" SIZE=2>ROOTS</FONT> is a CREW algorithm that runs in <I>O</I>(lg <I>d</I>) time, where <I>d</I> is the depth of the maximum-depth tree in the forest. The only writes occur on lines 3, 7, and 8, and these are all exclusive because in each one, processor <I>i</I> writes into only node <I>i</I>. The reads in lines 7-8 are concurrent, however, because several nodes may have pointers to the same node. In Figure 30.5(b), for example, we see that during the second iteration of the <B>while</B> loop, <I>root</I>[4] and <I>parent</I>[4] are read by processors 18, 2, and 7.<P>
The running time of <FONT FACE="Courier New" SIZE=2>FIND</FONT>-<FONT FACE="Courier New" SIZE=2>ROOTS</FONT> <I>O</I>(lg <I>d</I>) for essentially the same reason as for <FONT FACE="Courier New" SIZE=2>LIST</FONT>-<FONT FACE="Courier New" SIZE=2>RANK</FONT>: the length of each path is halved in each iteration. Figure 30.5 shows this characteristic plainly.<P>
How fast can <I>n</I> nodes in a forest determine the roots of their binary trees using only exclusive reads? A simple argument shows that <img src="../images/omega12.gif"><I></I>(lg <I>n)</I> time is required. The key observation is that when reads are exclusive, each step of the PRAM allows a given piece of information to be copied to at most one other memory location; thus the number of locations that can contain a given piece of information at most doubles with each step. Looking at a single tree, we have initially that at most 1 memory location stores the identity of the root. After 1 step, at most 2 locations can contain the identity of the root; after <I>k</I> steps, at most 2<I><SUP>k</I>-1</SUP> locations can contain the identity of the root. If the size of the tree is <img src="../images/bound.gif">(<I>n</I>), we need <img src="../images/bound.gif">(<I>n</I>) locations to contain the root's identity when the algorithm terminates; thus, <img src="../images/omega12.gif"><I></I>(lg <I>n</I>)<I> </I>steps are required in all.<P>
Whenever the depth <I>d</I> of the maximum-depth tree in the forest is 2<I><SUP>o</I>(lg <I>n</I>)</SUP>, the CREW algorithm <FONT FACE="Courier New" SIZE=2>FIND</FONT>-<FONT FACE="Courier New" SIZE=2>ROOTS</FONT> asymptotically outperforms any EREW algorithm. Specifically, for any <I>n</I>-node forest whose maximum-depth tree is a balanced binary tree with <img src="../images/bound.gif">(<I>n</I>) nodes, <I>d = O</I>(lg <I>n</I>), in which case <FONT FACE="Courier New" SIZE=2>FIND</FONT>-<FONT FACE="Courier New" SIZE=2>ROOTS </FONT>runs in <I>O</I>(lg lg <I>n</I>) time. Any EREW algorithm for this problem must run in <img src="../images/omega12.gif">(lg <I>n</I>) time, which is asymptotically slower. Thus, concurrent reads help for this problem. Exercise 30.2-1 gives a simpler scenario in which concurrent reads help.<P>
<img src="703_a.gif"><P>
<h4><a name="0955_19de">Figure 30.5 Finding the roots in a forest of binary trees on a CREW PRAM. Node numbers are next to the nodes, and stored root fields appear within nodes. The links represent parent pointers. (a)-(d) The state of the trees in the forest each time line 4 of <FONT FACE="Courier New" SIZE=2>FIND-ROOTS</FONT> is executed. Note that path lengths are halved in each iteration.<a name="0955_19de"></sub></sup></h4><P>
<P>
<h2>A problem in which concurrent writes help</h2><P>
<a name="0956_19de"><a name="0956_19df">To demonstrate that concurrent writes offer a performance advantage over exclusive writes, we examine the problem of finding the the maximum element in an array of real numbers. We shall see that any EREW algorithm for this problem takes <img src="../images/omega12.gif">(lg <I>n</I>) time and that no CREW algorithm does any better. The problem can be solved in <I>O</I>(1) time using a common-CRCW algorithm, in which when several processors write to the same location, they all write the same value.<P>
The CRCW algorithm that finds the maximum of <I>n</I> array elements assumes that the input array is <I>A</I>[0 . . <I>n</I>-1]. The algorithm uses <I>n</I><SUP>2</SUP> processors, with each processor comparing <I>A</I>[<I>i</I>] and <I>A</I>[<I>j</I>] for some <I>i</I> and <I>j</I> in the range 0 <img src="../images/lteq12.gif"> <I>i, j</I> <img src="../images/lteq12.gif"> <I>n</I> - 1. In effect, the algorithm performs a matrix of comparisons, and so we can view each of the <I>n</I><SUP>2</SUP> processors as having not only a one-dimensional index in the PRAM, but also a two-dimensional index (<I>i, j</I>).<P>
<pre><a name="0956_19e0">FAST-MAX(<I>A</I>)</sub></sup></pre><P>
<pre>1 <I>n</I> <img src="../images/arrlt12.gif"> <I>length</I>[<I>A</I>]</sub></sup></pre><P>
<pre>2<B> for</B> <I>i</I> <img src="../images/arrlt12.gif"> 0 <B>to</B> <I>n</I> - 1, in parallel</sub></sup></pre><P>
<pre>3<B> do</B> <I>m</I>[<I>i</I>] <img src="../images/arrlt12.gif"> TRUE</sub></sup></pre><P>
<pre>4<B> for</B><I> i</I> <img src="../images/arrlt12.gif"> 0 <B>to</B> <I>n</I> -1 and <I>j</I> <img src="../images/arrlt12.gif"> 0 <B>to</B> <I>n</I> - 1, in parallel</sub></sup></pre><P>
<pre>5<B> do if</B> <I>A</I>[<I>i</I>] < <I>A</I>[<I>j</I>]</sub></sup></pre><P>
<pre>6<B> then</B> <I>m</I>[<I>i</I>] <img src="../images/arrlt12.gif"> FALSE</sub></sup></pre><P>
<pre>7<B> for</B> <I>i</I> <img src="../images/arrlt12.gif"> 0 <B>to</B> <I>n</I> - 1, in parallel</sub></sup></pre><P>
<pre>8<B> do if</B> <I>m</I>[<I>i</I>] = TRUE</sub></sup></pre><P>
<pre>9<B> then</B> <I>max</I> <img src="../images/arrlt12.gif"> <I>A</I>[<I>i</I>]</sub></sup></pre><P>
<pre>10<B> return</B><I> max</I></sub></sup></pre><P>
Line 1 simply determines the length of the array <I>A</I>; it only needs to be executed on one processor, say processor 0. We use an array <I>m</I>[0 . . <I>n</I> -1], where processor <I>i</I> is responsible for <I>m</I>[<I>i</I>]. We want <I>m</I>[<I>i</I>] = <FONT FACE="Courier New" SIZE=2>TRUE</FONT> if and only if <I>A</I>[<I>i</I>] is the maximum value in array <I>A</I>. We start (lines 2-3) by believing that each array element is possibly the maximum, and we rely on comparisons in line 5 to determine which array elements are not the maximum.<P>
Figure 30.6 illustrates the remainder of the algorithm. In the loop of lines 4-6, we check each ordered pair of elements of array <I>A</I>. For each pair <I>A</I>[<I>i</I>] and <I>A</I>[<I>j</I>], line 5 checks whether <I>A</I>[<I>i</I>] < <I>A</I>[<I>j</I>]. If this comparison is <FONT FACE="Courier New" SIZE=2>TRUE</FONT>, we know that <I>A</I>[<I>i</I>] cannot be the maximum, and line 6 sets <I>m</I>[<I>i</I>] <img src="../images/arrlt12.gif"> <FONT FACE="Courier New" SIZE=2>FALSE</FONT> to record this fact. Several (<I>i, j</I>) pairs may be writing to <I>m</I>[<I>i</I>] simultaneously, but th ey all write the same value: <FONT FACE="Courier New" SIZE=2>FALSE</FONT>.<P>
<img src="705_a.gif"><P>
<h4><a name="0956_19e4">Figure 30.6 Finding the maximum of n values in O(1) time by the CRCW algorithm <FONT FACE="Courier New" SIZE=2>FAST-MAX<FONT FACE="Times New Roman" SIZE=2> </FONT></FONT>for each ordered pair of the elements in the input array A = <img src="../images/lftwdchv.gif">5, 6, 9, 2, 9<img src="../images/wdrtchv.gif">, the result of the comparison A[i] < A[j] is shown in the matrix, abbreviated T for <FONT FACE="Courier New" SIZE=2>TRUE</FONT> and F for <FONT FACE="Courier New" SIZE=2>FALSE</FONT>. For any row that contains a <FONT FACE="Courier New" SIZE=2>TRUE</FONT> value, the corresponding element of m, shown at the right, is set to <FONT FACE="Courier New" SIZE=2>FALSE</FONT>. Elements of m that contain <FONT FACE="Courier New" SIZE=2>TRUE</FONT> correspond to the maximum-valued elements of A. In this case, the value 9 is written into the variable max.<a name="0956_19e4"></sub></sup></h4><P>
After line 6 is executed, therefore, <I>m</I>[<I>i</I>] = <FONT FACE="Courier New" SIZE=2>TRUE</FONT> for exactly the indices <I>i </I>such that <I>A</I>[<I>i</I>] achieves the maximum. Lines 7-9 then put the maximum value into the variable <I>max</I>, which is returned in line 10. Several processors may write into the variable <I>max</I>, but if they do, they all write the same value, as is consistent with the common-CRCW PRAM model.<P>
Since all three "loops" in the algorithm are executed in parallel, <FONT FACE="Courier New" SIZE=2>FAST</FONT>-<FONT FACE="Courier New" SIZE=2>MAX</FONT> runs in <I>O</I>(1) time. Of course, it is not work-efficient, since it requires <I>n</I><SUP>2</SUP> processors, and the problem of finding the maximum number in an array can be solved by a <img src="../images/bound.gif">(<I>n</I>)-time serial algorithm. We can come closer to a work-efficient algorithm, however, as Exercise 30.2-6 asks you to show.<P>
<a name="0956_19e1"><a name="0956_19e2"><a name="0956_19e3">In a sense, the key to <FONT FACE="Courier New" SIZE=2>FAST</FONT>-<FONT FACE="Courier New" SIZE=2>MAX</FONT> is that a CRCW PRAM is capable of performing a boolean AND of <I>n</I> variables in <I>O</I>(1) time with <I>n</I> processors. (Since this capability holds in the common-CRCW model, it holds in the more powerful CRCW PRAM models as well.) The code actually performs several AND's at once, computing for <I>i</I> = 0, 1, . . . , <I>n</I> - 1,<P>
<img src="705_b.gif"><P>
which can be derived from DeMorgan's laws (5.2). This powerful AND capability can be used in other ways. For example, the capability of a CRCW PRAM to perform an AND in <I>O</I>(1) time obviates the need for a separate control network to test whether all processors are finished iterating a loop, such as we have assumed for EREW algorithms. The decision to finish the loop is simply the AND of all processors<FONT FACE="CG Times (W1)" SIZE=2>'</FONT> desires to finish the loop.<P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -