⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 s_ext.htm

📁 Data Structure Ebook
💻 HTM
字号:
<html>
<head>
<title>External Sorting</title>
</head>
<body bgcolor="#ffffff">

<p align=right>
<A HREF="s_man.htm" tppabs="http://www.ee.uwa.edu.au/~plsd210/ds/niemann/s_man.htm" target="_top"><IMG SRC="c_man.gif" tppabs="http://www.ee.uwa.edu.au/~plsd210/ds/niemann/c_man.gif" WIDTH=74 HEIGHT=19 BORDER=0></A>
</p>

<h1>External Sorting</h1>

One method for sorting a file is to load the file into memory, sort the data
in memory, then write the results.  When the file cannot be loaded into
memory due to resource limitations, an <em>external sort</em> applicable.  
We will
implement an external sort using <em>replacement selection</em> to establish initial 
runs, followed by a <em>polyphase merge sort</em> to merge the runs into
one sorted file.  I highly recommend you consult 
<a href="javascript:if(confirm('http://www.amazon.com/exec/obidos/ISBN=0201896850/none01A/  \n\nThis file was not retrieved by Teleport Pro, because it is addressed on a domain or path outside the boundaries set for its Starting Address.  \n\nDo you want to open it from the server?'))window.location='http://www.amazon.com/exec/obidos/ISBN=0201896850/none01A/'" tppabs="http://www.amazon.com/exec/obidos/ISBN=0201896850/none01A/" target="_top">Knuth [1998]</a>,
as many details have been
omitted.

<h2>Theory</h2>

For clarity, I'll assume that data is on one or more reels of magnetic tape.
Figure 4-1 illustrates a 3-way polyphase merge.  Initially, in phase A,
all data is on 
tapes T1 and T2.  Assume that the beginning of each tape is at the bottom of
the frame.  There are two sequential runs of data on T1: 4-8, and 6-7.
Tape T2 has one run: 5-9.  At phase B, we've merged the first run from tapes
T1 (4-8) and T2 (5-9) into a longer run on tape T3 (4-5-8-9).  Phase C is
simply renames the tapes, so we may repeat the merge again.  In phase D we
repeat the merge, with the final output on tape T3.
<p>

<center>
<p><a name="Fig4-1"></a>
<table border cellspacing=0 cellpadding=3>
<caption align=bottom><h3>Figure 4-1:  Merge Sort</h3></caption>
<tr><th>Phase</th><th>T1</th><th>T2</th><th>T3</th></tr>
<tr align=center>
  <td>A</td>
  <td valign=bottom>7<br>6<br>8<br>4</td>
  <td valign=bottom>9<br>5</td> 
  <td valign=bottom><br></td>
</tr>
<tr align=center>
  <td>B</td>
  <td valign=bottom>7<br>6</td>
  <td valign=bottom><br></td> 
  <td valign=bottom>9<br>8<br>5<br>4</td>
</tr>
<tr align=center>
  <td>C</td>
  <td valign=bottom>9<br>8<br>5<br>4</td>
  <td valign=bottom>7<br>6</td> 
  <td valign=bottom><br></td>
</tr>
<tr align=center>
  <td>D</td>
  <td valign=bottom><br></td>
  <td valign=bottom><br></td> 
  <td valign=bottom>9<br>8<br>7<br>6<br>5<br>4</td>
</tr>
</table>
</center>

<p>
Several interesting details have been omitted from the previous illustration.
For example, how were the initial runs created?  And, did you notice that 
they merged <em>perfectly</em>,
with no extra runs on any tapes?  Before I explain the 
method used for constructing initial runs, let me digress for a bit.

<p>
In 1202, Leonardo Fibonacci presented the following exercise in his
<em>Liber Abbaci</em> (Book of the Abacus):
"How many pairs of rabbits can be produced 
from a single pair in a year's time?"  We may assume that each pair produces 
a new pair of offspring every month, each pair becomes fertile at the age of
one month, and that rabbits never die.  After one month, there will be 2 pairs
of rabbits; after two months there will be 3; the following month the original
pair and the pair born during the first month will both usher in a new pair, 
and there will be 5 in all; and so on.  This series, where each number is the
sum of the two preceeding numbers, is known as the Fibonacci sequence:
<blockquote>
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, ... .
</blockquote>

Curiously, the Fibonacci series has found wide-spread application to 
everything from the arrangement of flowers on plants to studying the
efficiency of Euclid's algorithm.  There's even a 
<em>Fibonacci Quarterly</em> journal. 
And, as you might suspect, the Fibonacci series has something to do with 
establishing initial runs for external sorts.
<p>

Recall that we initially had one run on tape T2, and 2 runs on tape T1.
Note that the numbers {1,2} are two sequential numbers in the Fibonacci series.
After our first merge, we had one run on T1 and one run on T2.  Note that the 
numbers {1,1} are two sequential numbers in the Fibonacci series, only one 
<em>notch</em> down.  We could predict, in fact, that if we had 13 runs on T2, and 21
runs on T1 {13,21}, we would be left with 8 runs on T1 and 13 runs on 
T3 {8,13} after one pass.  Successive passes would result in
run counts of {5,8}, {3,5}, {2,3}, {1,1}, and
{0,1}, for a total of 7 passes.  This arrangement is ideal, and will result 
in the minimum number of passes.  Should data actually be on tape, this is 
a big savings, as tapes must be mounted and rewound for each pass.  For more
than 2 tapes, <em>higher-order</em> Fibonacci numbers are used.

<p>
Initially, all the data is on one tape.  The tape is read, and runs are
distributed to other tapes in the system.  After the initial runs are
created, they are merged as described above.  One method we could use
to create initial runs is to read a batch of records into memory, sort the 
records, and write them out.  This process would continue until we had 
exhausted the input tape.  An alternative algorithm,
<em>replacement selection</em>,
allows for longer runs.  A buffer is allocated in memory to act as a holding 
place for several records.  Initially, the buffer is filled.  Then, the 
following steps are repeated until the input is exhausted:
<ul>
<li>
Select the record with the smallest key that is >= the key of the 
last record written.
<li>
If all keys are smaller than the key of the last record written, 
then we have reached the end of a run.  Select the record with the
smallest key for the first record of the next run.
<li>
Write the selected record.
<li>
Replace the selected record with a new record from input.
</ul>

<p>
Figure 4-2 illustrates replacement selection for a small file.  The
beginning of the file is to the right of each frame.  To keep things 
simple, I've allocated a 2-record buffer.  Typically, such a buffer would
hold thousands of records.  We load the buffer in step B, and write the
record with the smallest key (6) in step C.  This is replaced with the 
next record (key 8).  We select the smallest key >= 6 in step D.  This is
key 7.  After writing key 7, we replace it with key 4.  This process repeats
until step F, where our last key written was 8, and all keys are less than 8.
At this point, we terminate the run, and start another.
<p>

<center>
<p><a name="Fig4-2"></a>
<table border cellspacing=0 cellpadding=3>
<caption align=bottom><h3>Figure 4-2:  Replacement Selection</h3></caption>
<tr><th>Step</th><th>Input</th><th>Buffer</th><th>Output</th></tr>
<tr align=center>
  <td align=center>A</td>
  <td align=left>5-3-4-8-6-7</td>
  <td align=left><br></td> 
  <td align=left><br></td>
</tr>
<tr align=center>
  <td align=center>B</td>
  <td align=left>5-3-4-8</td>
  <td align=left>6-7</td> 
  <td align=left><br></td>
</tr>
<tr align=center>
  <td align=center>C</td>
  <td align=left>5-3-4</td>
  <td align=left>8-7</td> 
  <td align=left>6</td>
</tr>
<tr align=center>
  <td align=center>D</td>
  <td align=left>5-3</td>
  <td align=left>8-4</td> 
  <td align=left>7-6</td>
</tr>
<tr align=center>
  <td align=center>E</td>
  <td align=left>5</td>
  <td align=left>3-4</td> 
  <td align=left>8-7-6</td>
</tr>
<tr align=center>
  <td align=center>F</td>
  <td align=left><br></td>
  <td align=left>5-4</td> 
  <td align=left>3 | 8-7-6</td>
</tr>
<tr align=center>
  <td align=center>G</td>
  <td align=left><br></td>
  <td align=left>5</td> 
  <td align=left>4-3 | 8-7-6</td>
</tr>
<tr align=center>
  <td align=center>H</td>
  <td align=left><br></td>
  <td align=left><br></td> 
  <td align=left>5-4-3 | 8-7-6</td>
</tr>
</table>
</center>

This strategy simply utilizes an intermediate buffer to hold values until
the appropriate time for output.  Using random numbers as input, the average 
length of a run is twice the length of the buffer.  However, if the data is
somewhat ordered, runs can be extremely long.  Thus, this method is more 
effective than doing partial sorts.
<p>
When selecting the next output record, we need to find the smallest key >=
the last key written.
One way to do this is to scan the entire list, searching for the 
appropriate key.  However, when the buffer holds thousands of records, 
execution time becomes prohibitive.  An alternative method is to use a 
binary tree structure, so that we only compare lg <i>n</i> items.

<h2>Implementation</h2>
An
<a href="javascript:if(confirm('http://www.ee.uwa.edu.au/~plsd210/ds/niemann/s_ext.txt  \n\nThis file was not retrieved by Teleport Pro, because it is linked too far away from its Starting Address. If you increase the in-domain depth setting for the Starting Address, this file will be queued for retrieval.  \n\nDo you want to open it from the server?'))window.location='http://www.ee.uwa.edu.au/~plsd210/ds/niemann/s_ext.txt'" tppabs="http://www.ee.uwa.edu.au/~plsd210/ds/niemann/s_ext.txt">ANSI-C implementation</a>
of an external sort is included.
Function <b>makeRuns</b> calls <b>readRec</b> to read the next record. 
Function <b>readRec</b> employs the replacement selection algorithm
(utilizing a binary tree) to fetch the next record, and <b>makeRuns</b>
distributes
the records in a Fibonacci distribution.  If the number of runs is not a
perfect Fibonacci number, <em>dummy</em> runs are simulated at the 
beginning of each
file.  Function <b>mergeSort</b> is then called to do a polyphase
merge sort on the runs.

</body>
</html>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -