tuplesort.c

来自「postgresql8.3.4源码,开源数据库」· C语言 代码 · 共 1,982 行 · 第 1/5 页

C
1,982
字号
/*------------------------------------------------------------------------- * * tuplesort.c *	  Generalized tuple sorting routines. * * This module handles sorting of heap tuples, index tuples, or single * Datums (and could easily support other kinds of sortable objects, * if necessary).  It works efficiently for both small and large amounts * of data.  Small amounts are sorted in-memory using qsort().	Large * amounts are sorted using temporary files and a standard external sort * algorithm. * * See Knuth, volume 3, for more than you want to know about the external * sorting algorithm.  We divide the input into sorted runs using replacement * selection, in the form of a priority tree implemented as a heap * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase * merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D * are implemented by logtape.c, which avoids space wastage by recycling * disk space as soon as each block is read from its "tape". * * We do not form the initial runs using Knuth's recommended replacement * selection data structure (Algorithm 5.4.1R), because it uses a fixed * number of records in memory at all times.  Since we are dealing with * tuples that may vary considerably in size, we want to be able to vary * the number of records kept in memory to ensure full utilization of the * allowed sort memory space.  So, we keep the tuples in a variable-size * heap, with the next record to go out at the top of the heap.  Like * Algorithm 5.4.1R, each record is stored with the run number that it * must go into, and we use (run number, key) as the ordering key for the * heap.  When the run number at the top of the heap changes, we know that * no more records of the prior run are left in the heap. * * The approximate amount of memory allowed for any one sort operation * is specified in kilobytes by the caller (most pass work_mem).  Initially, * we absorb tuples and simply store them in an unsorted array as long as * we haven't exceeded workMem.  If we reach the end of the input without * exceeding workMem, we sort the array using qsort() and subsequently return * tuples just by scanning the tuple array sequentially.  If we do exceed * workMem, we construct a heap using Algorithm H and begin to emit tuples * into sorted runs in temporary tapes, emitting just enough tuples at each * step to get back within the workMem limit.  Whenever the run number at * the top of the heap changes, we begin a new run with a new output tape * (selected per Algorithm D).	After the end of the input is reached, * we dump out remaining tuples in memory into a final run (or two), * then merge the runs using Algorithm D. * * When merging runs, we use a heap containing just the frontmost tuple from * each source run; we repeatedly output the smallest tuple and insert the * next tuple from its source tape (if any).  When the heap empties, the merge * is complete.  The basic merge algorithm thus needs very little memory --- * only M tuples for an M-way merge, and M is constrained to a small number. * However, we can still make good use of our full workMem allocation by * pre-reading additional tuples from each source tape.  Without prereading, * our access pattern to the temporary file would be very erratic; on average * we'd read one block from each of M source tapes during the same time that * we're writing M blocks to the output tape, so there is no sequentiality of * access at all, defeating the read-ahead methods used by most Unix kernels. * Worse, the output tape gets written into a very random sequence of blocks * of the temp file, ensuring that things will be even worse when it comes * time to read that tape.	A straightforward merge pass thus ends up doing a * lot of waiting for disk seeks.  We can improve matters by prereading from * each source tape sequentially, loading about workMem/M bytes from each tape * in turn.  Then we run the merge algorithm, writing but not reading until * one of the preloaded tuple series runs out.	Then we switch back to preread * mode, fill memory again, and repeat.  This approach helps to localize both * read and write accesses. * * When the caller requests random access to the sort result, we form * the final sorted run on a logical tape which is then "frozen", so * that we can access it randomly.	When the caller does not need random * access, we return from tuplesort_performsort() as soon as we are down * to one run per logical tape.  The final merge is then performed * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this * saves one cycle of writing all the data out to disk and reading it in. * * Before Postgres 8.2, we always used a seven-tape polyphase merge, on the * grounds that 7 is the "sweet spot" on the tapes-to-passes curve according * to Knuth's figure 70 (section 5.4.2).  However, Knuth is assuming that * tape drives are expensive beasts, and in particular that there will always * be many more runs than tape drives.	In our implementation a "tape drive" * doesn't cost much more than a few Kb of memory buffers, so we can afford * to have lots of them.  In particular, if we can have as many tape drives * as sorted runs, we can eliminate any repeated I/O at all.  In the current * code we determine the number of tapes M on the basis of workMem: we want * workMem/M to be large enough that we read a fair amount of data each time * we preread from a tape, so as to maintain the locality of access described * above.  Nonetheless, with large workMem we can have many tapes. * * * Portions Copyright (c) 1996-2008, PostgreSQL Global Development Group * Portions Copyright (c) 1994, Regents of the University of California * * IDENTIFICATION *	  $PostgreSQL: pgsql/src/backend/utils/sort/tuplesort.c,v 1.81 2008/01/01 19:45:55 momjian Exp $ * *------------------------------------------------------------------------- */#include "postgres.h"#include <limits.h>#include "access/heapam.h"#include "access/nbtree.h"#include "catalog/pg_amop.h"#include "catalog/pg_operator.h"#include "commands/tablespace.h"#include "miscadmin.h"#include "utils/datum.h"#include "utils/logtape.h"#include "utils/lsyscache.h"#include "utils/memutils.h"#include "utils/pg_rusage.h"#include "utils/syscache.h"#include "utils/tuplesort.h"/* GUC variables */#ifdef TRACE_SORTbool		trace_sort = false;#endif#ifdef DEBUG_BOUNDED_SORTbool		optimize_bounded_sort = true;#endif/* * The objects we actually sort are SortTuple structs.	These contain * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple), * which is a separate palloc chunk --- we assume it is just one chunk and * can be freed by a simple pfree().  SortTuples also contain the tuple's * first key column in Datum/nullflag format, and an index integer. * * Storing the first key column lets us save heap_getattr or index_getattr * calls during tuple comparisons.	We could extract and save all the key * columns not just the first, but this would increase code complexity and * overhead, and wouldn't actually save any comparison cycles in the common * case where the first key determines the comparison result.  Note that * for a pass-by-reference datatype, datum1 points into the "tuple" storage. * * When sorting single Datums, the data value is represented directly by * datum1/isnull1.	If the datatype is pass-by-reference and isnull1 is false, * then datum1 points to a separately palloc'd data value that is also pointed * to by the "tuple" pointer; otherwise "tuple" is NULL. * * While building initial runs, tupindex holds the tuple's run number.  During * merge passes, we re-use it to hold the input tape number that each tuple in * the heap was read from, or to hold the index of the next tuple pre-read * from the same tape in the case of pre-read entries.	tupindex goes unused * if the sort occurs entirely in memory. */typedef struct{	void	   *tuple;			/* the tuple proper */	Datum		datum1;			/* value of first key column */	bool		isnull1;		/* is first key column NULL? */	int			tupindex;		/* see notes above */} SortTuple;/* * Possible states of a Tuplesort object.  These denote the states that * persist between calls of Tuplesort routines. */typedef enum{	TSS_INITIAL,				/* Loading tuples; still within memory limit */	TSS_BOUNDED,				/* Loading tuples into bounded-size heap */	TSS_BUILDRUNS,				/* Loading tuples; writing to tape */	TSS_SORTEDINMEM,			/* Sort completed entirely in memory */	TSS_SORTEDONTAPE,			/* Sort completed, final run is on tape */	TSS_FINALMERGE				/* Performing final merge on-the-fly */} TupSortStatus;/* * Parameters for calculation of number of tapes to use --- see inittapes() * and tuplesort_merge_order(). * * In this calculation we assume that each tape will cost us about 3 blocks * worth of buffer space (which is an underestimate for very large data * volumes, but it's probably close enough --- see logtape.c). * * MERGE_BUFFER_SIZE is how much data we'd like to read from each input * tape during a preread cycle (see discussion at top of file). */#define MINORDER		6		/* minimum merge order */#define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)#define MERGE_BUFFER_SIZE			(BLCKSZ * 32)/* * Private state of a Tuplesort operation. */struct Tuplesortstate{	TupSortStatus status;		/* enumerated value as shown above */	int			nKeys;			/* number of columns in sort key */	bool		randomAccess;	/* did caller request random access? */	bool		bounded;		/* did caller specify a maximum number of								 * tuples to return? */	bool		boundUsed;		/* true if we made use of a bounded heap */	int			bound;			/* if bounded, the maximum number of tuples */	long		availMem;		/* remaining memory available, in bytes */	long		allowedMem;		/* total memory allowed, in bytes */	int			maxTapes;		/* number of tapes (Knuth's T) */	int			tapeRange;		/* maxTapes-1 (Knuth's P) */	MemoryContext sortcontext;	/* memory context holding all sort data */	LogicalTapeSet *tapeset;	/* logtape.c object for tapes in a temp file */	/*	 * These function pointers decouple the routines that must know what kind	 * of tuple we are sorting from the routines that don't need to know it.	 * They are set up by the tuplesort_begin_xxx routines.	 *	 * Function to compare two tuples; result is per qsort() convention, ie:	 * <0, 0, >0 according as a<b, a=b, a>b.  The API must match	 * qsort_arg_comparator.	 */	int			(*comparetup) (const SortTuple *a, const SortTuple *b,										   Tuplesortstate *state);	/*	 * Function to copy a supplied input tuple into palloc'd space and set up	 * its SortTuple representation (ie, set tuple/datum1/isnull1).  Also,	 * state->availMem must be decreased by the amount of space used for the	 * tuple copy (note the SortTuple struct itself is not counted).	 */	void		(*copytup) (Tuplesortstate *state, SortTuple *stup, void *tup);	/*	 * Function to write a stored tuple onto tape.	The representation of the	 * tuple on tape need not be the same as it is in memory; requirements on	 * the tape representation are given below.  After writing the tuple,	 * pfree() the out-of-line data (not the SortTuple struct!), and increase	 * state->availMem by the amount of memory space thereby released.	 */	void		(*writetup) (Tuplesortstate *state, int tapenum,										 SortTuple *stup);	/*	 * Function to read a stored tuple from tape back into memory. 'len' is	 * the already-read length of the stored tuple.  Create a palloc'd copy,	 * initialize tuple/datum1/isnull1 in the target SortTuple struct, and	 * decrease state->availMem by the amount of memory space consumed.	 */	void		(*readtup) (Tuplesortstate *state, SortTuple *stup,										int tapenum, unsigned int len);	/*	 * Function to reverse the sort direction from its current state. (We	 * could dispense with this if we wanted to enforce that all variants	 * represent the sort key information alike.)	 */	void		(*reversedirection) (Tuplesortstate *state);	/*	 * This array holds the tuples now in sort memory.	If we are in state	 * INITIAL, the tuples are in no particular order; if we are in state	 * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS	 * and FINALMERGE, the tuples are organized in "heap" order per Algorithm	 * H.  (Note that memtupcount only counts the tuples that are part of the	 * heap --- during merge passes, memtuples[] entries beyond tapeRange are	 * never in the heap and are used to hold pre-read tuples.)  In state	 * SORTEDONTAPE, the array is not used.	 */	SortTuple  *memtuples;		/* array of SortTuple structs */	int			memtupcount;	/* number of tuples currently present */	int			memtupsize;		/* allocated length of memtuples array */	/*	 * While building initial runs, this is the current output run number	 * (starting at 0).  Afterwards, it is the number of initial runs we made.	 */	int			currentRun;	/*	 * Unless otherwise noted, all pointer variables below are pointers to	 * arrays of length maxTapes, holding per-tape data.	 */	/*	 * These variables are only used during merge passes.  mergeactive[i] is	 * true if we are reading an input run from (actual) tape number i and	 * have not yet exhausted that run.  mergenext[i] is the memtuples index	 * of the next pre-read tuple (next to be loaded into the heap) for tape	 * i, or 0 if we are out of pre-read tuples.  mergelast[i] similarly	 * points to the last pre-read tuple from each tape.  mergeavailslots[i]	 * is the number of unused memtuples[] slots reserved for tape i, and	 * mergeavailmem[i] is the amount of unused space allocated for tape i.	 * mergefreelist and mergefirstfree keep track of unused locations in the	 * memtuples[] array.  The memtuples[].tupindex fields link together	 * pre-read tuples for each tape as well as recycled locations in	 * mergefreelist. It is OK to use 0 as a null link in these lists, because	 * memtuples[0] is part of the merge heap and is never a pre-read tuple.	 */	bool	   *mergeactive;	/* active input run source? */	int		   *mergenext;		/* first preread tuple for each source */	int		   *mergelast;		/* last preread tuple for each source */	int		   *mergeavailslots;	/* slots left for prereading each tape */	long	   *mergeavailmem;	/* availMem for prereading each tape */	int			mergefreelist;	/* head of freelist of recycled slots */	int			mergefirstfree; /* first slot never used in this merge */	/*	 * Variables for Algorithm D.  Note that destTape is a "logical" tape	 * number, ie, an index into the tp_xxx[] arrays.  Be careful to keep	 * "logical" and "actual" tape numbers straight!	 */	int			Level;			/* Knuth's l */	int			destTape;		/* current output tape (Knuth's j, less 1) */	int		   *tp_fib;			/* Target Fibonacci run counts (A[]) */	int		   *tp_runs;		/* # of real runs on each tape */	int		   *tp_dummy;		/* # of dummy runs for each tape (D[]) */	int		   *tp_tapenum;		/* Actual tape numbers (TAPE[]) */	int			activeTapes;	/* # of active input tapes in merge pass */	/*	 * These variables are used after completion of sorting to keep track of	 * the next tuple to return.  (In the tape case, the tape's current read	 * position is also critical state.)	 */	int			result_tape;	/* actual tape number of finished output */	int			current;		/* array index (only used if SORTEDINMEM) */	bool		eof_reached;	/* reached EOF (needed for cursors) */	/* markpos_xxx holds marked position for mark and restore */	long		markpos_block;	/* tape block# (only used if SORTEDONTAPE) */	int			markpos_offset; /* saved "current", or offset in tape block */	bool		markpos_eof;	/* saved "eof_reached" */	/*	 * These variables are specific to the MinimalTuple case; they are set by	 * tuplesort_begin_heap and used only by the MinimalTuple routines.	 */	TupleDesc	tupDesc;	ScanKey		scanKeys;		/* array of length nKeys */	/*	 * These variables are specific to the IndexTuple case; they are set by	 * tuplesort_begin_index and used only by the IndexTuple routines.	 */	Relation	indexRel;	ScanKey		indexScanKey;	bool		enforceUnique;	/* complain if we find duplicate tuples */	/*	 * These variables are specific to the Datum case; they are set by	 * tuplesort_begin_datum and used only by the DatumTuple routines.	 */	Oid			datumType;	FmgrInfo	sortOpFn;		/* cached lookup data for sortOperator */	int			sortFnFlags;	/* equivalent to sk_flags */	/* we need typelen and byval in order to know how to copy the Datums. */	int			datumTypeLen;	bool		datumTypeByVal;	/*	 * Resource snapshot for time of sort start.	 */#ifdef TRACE_SORT	PGRUsage	ru_start;#endif};#define COMPARETUP(state,a,b)	((*(state)->comparetup) (a, b, state))#define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))#define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))#define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))#define REVERSEDIRECTION(state) ((*(state)->reversedirection) (state))#define LACKMEM(state)		((state)->availMem < 0)#define USEMEM(state,amt)	((state)->availMem -= (amt))#define FREEMEM(state,amt)	((state)->availMem += (amt))/* * NOTES about on-tape representation of tuples: * * We require the first "unsigned int" of a stored tuple to be the total size * on-tape of the tuple, including itself (so it is never zero; an all-zero * unsigned int is used to delimit runs).  The remainder of the stored tuple * may or may not match the in-memory representation of the tuple --- * any conversion needed is the job of the writetup and readtup routines. * * If state->randomAccess is true, then the stored representation of the * tuple must be followed by another "unsigned int" that is a copy of the * length --- so the total tape space used is actually sizeof(unsigned int) * more than the stored length value.  This allows read-backwards.	When * randomAccess is not true, the write/read routines may omit the extra * length word. * * writetup is expected to write both length words as well as the tuple * data.  When readtup is called, the tape is positioned just after the * front length word; readtup must read the tuple data and advance past * the back length word (if present). * * The write/read routines can make use of the tuple description data * stored in the Tuplesortstate record, if needed.	They are also expected * to adjust state->availMem by the amount of memory space (not tape space!) * released or consumed.  There is no error return from either writetup

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?