📄 chap30.htm
字号:
<HTML><HEAD>
<TITLE>Intro to Algorithms: CHAPTER 30: ALGORITHMS FOR PARALLEL COMPUTERS</TITLE></HEAD><BODY BGCOLOR="#FFFFFF">
<a href="chap31.htm"><img align=right src="../../images/next.gif" alt="Next Chapter" border=0></A>
<a href="toc.htm"><img align=right src="../../images/toc.gif" alt="Return to Table of Contents" border=0></A>
<a href="chap29.htm"><img align=right src="../../images/prev.gif" alt="Previous Chapter" border=0></A>
<h1><a name="094b_19c7">CHAPTER 30: ALGORITHMS FOR PARALLEL COMPUTERS<a name="094b_19c7"></h1><P>
<a name="094b_19b9">As parallel-processing computers have proliferated, interest has increased in <I><B>parallel algorithms</I></B><I>:</I> algorithms that perform more than one operation at a time. The study of parallel algorithms has now developed into a research area in its own right. Indeed, parallel algorithms have been developed for many of the problems we have solved in this text using ordinary serial algorithms. In this chapter, we shall describe a few simple parallel algorithms that illustrate fundamental issues and techniques.<P>
In order to study parallel algorithms, we must choose an appropriate model for parallel computing. The random-access machine, or RAM, which we have used throughout most of this book, is, of course, serial rather than parallel. The parallel models we have studied--sorting networks (Chapter 28) and circuits (Chapter 29)--are too restrictive for investigating, for example, algorithms on data structures.<P>
The parallel algorithms in this chapter are presented in terms of one popular theoretical model: the parallel random-access machine, or PRAM (pronounced "PEE-ram"). Many parallel algorithms for arrays, lists, trees, and graphs can be easily described in the PRAM model. Although the PRAM ignores many important aspects of real parallel machines, the essential attributes of parallel algorithms tend to transcend the models for which they are designed. If one PRAM algorithm outperforms another PRAM algorithm, the relative performance is not likely to change substantially when both algorithms are adapted to run on a real parallel computer.<P>
The PRAM model<P>
<a name="094b_19ba"><a name="094b_19bb"><a name="094b_19bc">Figure 30.1 shows the basic architecture of the <I><B>parallel random-access machine (PRAM)</I></B>. There are <I>p</I> ordinary (serial) processors <I>P</I><SUB>0</SUB>,<I> P</I><SUB>1</SUB>,<I> . . . </I>,<I> P<SUB>p</I>-1</SUB> that have as storage a shared, global memory. All processors can read from or write to the global memory "in parallel" (at the same time). The processors can also perform various arithmetic and logical operations in parallel.<P>
The key assumption regarding algorithmic performance in the PRAM model is that running time can be measured as the number of parallel memory accesses an algorithm performs. This assumption is a straight- forward generalization of the ordinary RAM model, in which the number of memory accesses is asymptotically as good as any other measure of running time. This simple assumption will serve us well in our survey of parallel algorithms, even though real parallel computers cannot perform parallel accesses to global memory in unit time: the time for a memory access grows with the number of processors in the parallel computer.<P>
<img src="689_a.gif"><P>
<h4><a name="094b_19c8">Figure 30.1 The basic architecture of the PRAM. There are p processors P<SUB>0</SUB>, P<SUB>1</SUB>, . . ., P<SUB>p </SUB>-<SUB> 1</SUB> connected to a shared memory. Each processor can access an arbitrary word of shared memory in unit time.<a name="094b_19c8"></sub></sup></h4><P>
Nevertheless, for parallel algorithms that access data in an arbitrary fashion, the assumption of unit-time memory operations can be justified. Real parallel machines typically have a communication network that can support the abstraction of a global memory. Accessing data through the network is a relatively slow operation in comparison with arithmetic and other operations. Thus, counting the number of parallel memory accesses executed by two parallel algorithms does, in fact, yield a fairly accurate estimate of their relative performances. The principal way in which real machines violate the unit-time abstraction of the PRAM is that some memory-access patterns are faster than others. As a first approximation, however, the unit-time assumption in the PRAM model is quite reasonable.<P>
The running time of a parallel algorithm depends on the number of processors executing the algorithm as well as the size of the problem input. Generally, therefore, we must discuss both time and processor count when analyzing PRAM algorithms; this contrasts with serial algorithms, in whose analysis we have focused mainly on time. Typically, there is a trade-off between the number of processors used by an algorithm and its running time. Section 30.3 discusses these trade-offs.<P>
Concurrent versus exclusive memory accesses<P>
<a name="094b_19bd"><a name="094b_19be"><a name="094b_19bf"><a name="094b_19c0">A <I><B>concurrent-read</I></B> algorithm is a PRAM algorithm during whose execution multiple processors can read from the same location of shared memory at the same time. An <I><B>exclusive-read</I></B> algorithm is a PRAM algorithm in which no two processors ever read the same memory location at the same time. We make a similar distinction with respect to whether or not multiple processors can write into the same memory location at the same time, dividing PRAM algorithms into <I><B>concurrent-write</I></B> and <I><B>exclusive-write</I></B> algorithms. Commonly used abbreviations for the types of algorithms we encounter are<P>
<a name="094b_19c1"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"> <I><B>EREW</I></B></FONT>: exclusive read and exclusive write,<P>
<a name="094b_19c2"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"> <I><B>CREW</I></B></FONT>: concurrent read and exclusive write,<P>
<a name="094b_19c3"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"> <I><B>ERCW</I></B></FONT>: exclusive read and concurrent write, and<P>
<a name="094b_19c4"><a name="094b_19c5"><FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"> <I><B>CRCW</I></B></FONT>: concurrent read and concurrent write.<P>
(These abbreviations are usually pronounced not as words but rather as strings of letters.)<P>
Of these types of algorithms, the extremes--EREW and CRCW--are the most popular. A PRAM that supports only EREW algorithms is called an <I><B>EREW PRAM</I></B>, and one that supports CRCW algorithms is called a <I><B>CRCW PRAM</I></B>. A CRCW PRAM can, of course, execute EREW algorithms, but an EREW PRAM cannot directly support the concurrent memory accesses required in CRCW algorithms. The underlying hardware of an EREW PRAM is relatively simple, and therefore fast, because it needn't handle conflicting memory reads and writes. A CRCW PRAM requires more hardware support if the unit-time assumption is to provide a reasonably accurate measure of algorithmic performance, but it provides a programming model that is arguably more straightforward than that of an EREW PRAM.<P>
Of the remaining two algorithmic types--CREW and ERCW--more attention has been paid in the literature to the CREW. From a practical point of view, however, supporting concurrency for writes is no harder than supporting concurrency for reads. In this chapter, we shall generally treat an algorithm as being CRCW if it contains either concurrent reads or concurrent writes, without making further distinctions. We discuss the finer points of this distinction in Section 30.2.<P>
<a name="094b_19c6">When multiple processors write to the same location in a CRCW algorithm, the effect of the parallel write is not well defined without additional elaboration. In this chapter, we shall use the <I><B>common-CRCW</I></B> model: when several processors write into the same memory location, they must all write a common (the same) value. There are several alternative types of PRAM's in the literature that handle this problem with a different assumption. Other choices include<P>
<FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"> <I><B>arbitrary</I></B><I>:</I></FONT> an arbitrary value from among those written is actually stored,<P>
<FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"> <I><B>priority</I></B><I>:</I></FONT> the value written by the lowest-indexed processor is stored, and<P>
<FONT FACE="Courier New" SIZE=2><img src="../images/dot12.gif"> <I><B>combining</I></B><I>:</I></FONT> the value stored is some specified combination of the values written.<P>
In the last case, the specified combination is typically some associative and commutative function such as addition (store the sum of all the values written) or maximum (store only the maximum value written).<P>
Synchronization and control<P>
PRAM algorithms must be highly synchronized to work correctly. How is this synchronization achieved? Also, the processors in PRAM algorithms must often detect termination of loop conditions that depend on the state of all processors. How is this control function implemented?<P>
We won't discuss these issues extensively. Many real parallel computers employ a control network connecting the processors that helps with synchronization and termination conditions. Typically, the control network can implement these functions as fast as a routing network can implement global memory references.<P>
For our purposes, it suffices to assume that the processors are inherently tightly synchronized. All processors execute the same statements at the same time. No processor races ahead while others are further back in the code. As we go through our first parallel algorithm, we shall point out where we assume that processors are synchronized.<P>
For detecting the termination of a parallel loop that depends on the state of all processors, we shall assume that a parallel termination condition can be tested through the control network in <I>O</I>(1) time. Some EREW PRAM models in the literature do not make this assumption, and the (logarithmic) time for testing the loop condition must be included in the overall running time (see Exercise 30.1-8). As we shall see in Section 30.2, CRCW PRAM<FONT FACE="CG Times (W1)" SIZE=2>'</FONT>s do not need a control network to test termination: they can detect termination of a parallel loop in <I>O</I>(1) time through the use of concurrent writes.<P>
Chapter outline<P>
Section 30.1 introduces the technique of pointer jumping, which provides a fast way to manipulate lists in parallel. We show how pointer jumping can be used to perform prefix computations on lists and how fast algorithms on lists can be adapted for use on trees. Section 30.2 discusses the relative power of CRCW and EREW algorithms and shows that concurrent memory accessing provides increased power.<P>
Section 30.3 presents Brent<FONT FACE="CG Times (W1)" SIZE=2>'</FONT>s theorem, which shows how combinational circuits can be efficiently simulated by PRAM<FONT FACE="CG Times (W1)" SIZE=2>'</FONT>s. The section also discusses the important issue of work efficiency and gives conditions under which a <I>p</I>-processor PRAM algorithm can be efficiently translated into a <I>p</I>'-processor PRAM algorithm for any <I>p</I>' < <I>p</I>. Section 30.4 reprises the problem of performing a prefix computation on a linked list and shows how a randomized algorithm can perform the computation in a work-efficient fashion. Finally, Section 30.5 shows how symmetry can be broken in parallel in much less than logarithmic time using a deterministic algorithm.<P>
The parallel algorithms in this chapter have been drawn principally from the area of graph theory. They represent only a scant selection of the present array of parallel algorithms. The techniques introduced in this chapter, however, are quite representative of the techniques used for parallel algorithms in other areas of computer science.<P>
<h1><a name="094d_19c9">30.1 Pointer jumping<a name="094d_19c9"></h1><P>
<a name="094d_19c7"><a name="094d_19c8">Among the more interesting PRAM algorithms are those that involve pointers. In this section, we investigate a powerful technique called pointer jumping, which yields fast algorithms for operating on lists. Specifically, we introduce an <I>O</I>(lg <I>n</I>)-time algorithm that computes the distance to the end of the list for each object in an <I>n</I>-object list. We then modify this algorithm to perform a "parallel prefix" computation on an <I>n</I>-object list in <I>O</I>(lg <I>n</I>) time. Finally, we investigate a technique that allows many problems on trees to be converted to list problems, which can then be solved by pointer jumping. All of the algorithms in this section are EREW algorithms: no concurrent accesses to global memory are required.<P>
<h2><a name="094e_19cd">30.1.1 List ranking<a name="094e_19cd"></h2><P>
<a name="094e_19c9"><a name="094e_19ca"><a name="094e_19cb">Our first parallel algorithm operates on lists. We can store a list in a PRAM much as we store lists in an ordinary RAM. To operate on list objects in parallel, however, it is convenient to assign a "responsible" processor to each object. We shall assume that there are as many processors as list objects, and that the <I>i</I>th processor is responsible for the <I>i</I>th object. Figure 30.2(a), for example, shows a linked list consisting of the sequence of objects <img src="../images/lftwdchv.gif">3,4,6,1,0,5<img src="../images/wdrtchv.gif">. Since there is one processor per list object, every object in the list can be operated on by its responsible processor in <I>O</I>(1) time.<P>
Suppose that we are given a singly linked list <I>L</I> with <I>n</I> objects and wish to compute, for each object in <I>L</I>, its distance from the end of the list. More formally, if <I>next</I> is the pointer field, we wish to compute a value <I>d</I>[<I>i</I>] for each object <I>i</I> in the list such that<P>
<img src="692_a.gif"><P>
We call the problem of computing the <I>d</I> values the <I><B>list-ranking problem.</I></B><P>
One solution to the list-ranking problem is simply to propagate distances back from the end of the list. This method takes <img src="../images/bound.gif">(<I>n</I>) time, since the <I>k</I>th object from the end must wait for the <I>k</I>- 1 objects following it to determine their distances from the end before it can determine its own. This solution is essentially a serial algorithm.<P>
<img src="693_a.gif"><P>
<h4><a name="094e_19ce">Figure 30.2 Finding the distance from each object in an n-object list to the end of the list in O(lg n) time using pointer jumping. (a) A linked list represented in a PRAM with d values initialized. At the end of the algorithm, each d value holds the distance of its object from the end of the list. Each object's responsible processor appears above the object. (b)-(d) The pointers and d values after each iteration of the while loop in the algorithm <FONT FACE="Courier New" SIZE=2>LIST<FONT FACE="Times New Roman" SIZE=2>-R<FONT FACE="Courier New" SIZE=2>ANK.<a name="094e_19ce"></FONT></FONT></FONT></sub></sup></h4><P>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -