📄 ropeimpl.html
字号:
<HTML><HEAD> <TITLE> Rope Implementation Overview</TITLE></HEAD><BODY TEXT="#000000" LINK="#006600" ALINK="#003300" VLINK="#7C7F87" BGCOLOR="#FFFFFF"><A HREF="/"><IMG SRC="/images/common/sgilogo_small.gif" ALT="SGI Logo" WIDTH="80" HEIGHT="72" BORDER="0"></A><P><!--end header--><H1> Rope Implementation Overview</h1><P>The rope container type included in SGI's version of the STL is basedloosely on the ropes in the Xerox Cedar environment or C "cords", asdescribed in Boehm, Atkinson, and Plass, "Ropes: An Alternative to Strings",<I>Software Practice and Experience 25</i>, 12 (Dec 1995), pp. 1315 - 1330.<P>A rope is represented as a pointer to <TT>_Rope_RopeRep</tt>structure, which represents a tree node.Every tree node corresponds to a piece of a rope.Although we refer to "tree nodes", each such piece can beshared between different ropes, or can even be reused in thesame rope if the corresponding substring is repeated.Thus ropes are really represented as directed acyclic graphs.Nonetheless, we will continue to refer to trees, sincethat is both the usual case, and more intuitive.<P>Each tree node contains a size field giving the length of the rope piece,a depth field specifying the depth (or height) of the tree rooted at the node, a boolean field indicating whether the subtree has been balanced,and a tag field indicating which of the four variants or subclassesof <TT>_Rope_RopeRep</tt> is used to represent the list.(The balanced bit is really of interest only for concatenationtree nodes, see below.)<P>It would have been possible to use virtual functions and/or RTTIto replace the use of the tag field. We chose not to pursue that route,since the tag field can be much smaller than a vtable pointer, and thetag based code is probably also faster in this case.<P>The 4 subclasses of <TT>_Rope_RopeRep</tt> are:<OL><LI> (<TT>_Rope_RopeLeaf</tt>)Leaves containing string characters. Shortropes are usually represented as a single such node.In the case of the standard character type, the actual arrayof characters is <TT>NULL</tt>-terminated to allow fast generationof an equivalent C string.<LI> (<TT>_Rope_RopeConcatenation</tt>)Concatenation nodes. These have two children <I>left</i>and <I>right</i>. They represent the concatenation of the twostrings represented by the <I>left</i> and <I>right</i>subtrees. Concatenation of two longer ropes usuallyallocates a new concatenation node which references the tworopes to be concatenated.<LI> (<TT>_Rope_RopeFunction</tt>)Function nodes. These contain a pointer to a function objectthat can be used to compute sections of the string. This facilitymakes it possible to manipulate a rope that is computed lazilyas the pieces are needed. For example, it is possible totreat a file as a rope withoutactually reading in the entire file. Thus a text editor can represent evena 100 MB file being edited as a rope, updating it with standard rope operations,while still consuming only very small amount of memory.<LI> (<TT>_Rope_RopeSubstring</tt>)Substring nodes.These contain a pointer to a base rope tree node,and a starting position within that rope.They denote a substring of the base rope.These are generated onlyto represent substrings of ropes that are expensive to compute explicitly.The base field never points to a concatenation tree node.If the substring operation is applied to either avery large leaf node (which can be built by converting a very longC string to a rope) or to a function node representing a long string,then it produces a substring node. Substring nodes also contain a pointerto a function object that performs the appropriate character extraction.They are a subclassof function nodes, and a number of operations treat them simply asfunction nodes. Many uses of ropes will never result in the generationof a substring node. They are however essential for applications thatuse function nodes to lazily evaluate strings.</ol><P>Only concatenation nodes have nonzero depth fields. Depth fields areguaranteed to fit into a byte, since we impose a static maximum on ropedepth.<H2>Reference Counting and Synchronization</h2><P>The rope implementation can be compiled in two different ways.Normally __GC will not be defined. In this case each tree node willalso contain a reference count field. This keeps track of thenumber of rope variables, concatenation nodes, or substring nodesthat reference the tree node. (We'll see later that referencesfrom some iterators are also included.)When the reference count of a tree node becomes zero, thetree node is deallocated, and reference counts of any subtreesare correspondingly decremented.<P>In a few cases, the reference counts are also used to allowin-place updates of ropes. If the reference counts of all tree nodeson the path from a rope <I>R</i>'s root node to the leaf node <I>L</i> holdinga specific character are 1, then <I>L</i> occurs exactly once in <I>R</i>,and in no other rope. Thus <I>R</i> can safely be updated by updating<I>L</i> in place.<P>If the rope implementation is compiled with <TT>__GC</tt> defined,it will assume that there is an underlying garbage collector andinaccessible tree nodes will be automatically reclaimed.In this case rope must be instantiated with a suitable garbage-collectingallocator, and no reference count is maintained. Thus the above optimizationfor in-place updates is also not implemented. Since even non-destructiveupdates copy only portions of a rope, and since many rope clients willuse them purely as immutable strings, this is often not a serious loss.But it may be for some applications.<P>The remainder of this section assumes that __GC is not defined, and thatreference counts are used.<P>Since rope nodes can be shared by different ropes, which can be concurrentlycopied, updated, or destroyed by different threads, reference counts must beupdated atomically. This is the only explicit synchronization performedby the implementation, since the reference count is the only part of apotentially shared data structure that is updated.<P>The synchronization required for reference count updates may consume asignificant fraction of the time required for rope operations.Reference count updates should be implemented in terms of an atomicadd operation whenever such an operation is available.It is important that the reference count decrement operation not onlyatomically decrement the count, but also return the result as part ofthe atomic operation. If the zero test is performed outside the atomicpart of the operation, the same tree node may be deallocated twice.<P>On Irix and win32 platforms, the current implementation maintainsreference counts using an atomic add operation. A more genericimplementation based on PThread mutexes is also provided. But it isunlikely to provide optimal performance for applications that useropes extensively.<H2>Allocator Use</h2>The rope implementation can use either standard-conforming allocators(compiler permitting) or SGI-style simplified allocators.In the former case and if there are distinct allocator instances ofa given allocator type, the allocator instance is stored in each ropetree node, as well as in the rope itself.It is illegal to concatenate ropes built with differentallocator instances.<P>This representation was chosen because it keeps the implementationcomparatively clean, and the instance-less case reasonably efficient.The alternative of storing the allocator instance only in the ropewould have added additional allocator arguments to many internalfunctions. It would have been difficult to eliminate this overheadfor allocator types that do not have distinct instances.<H2>Basic Algorithms and Rope Balancing</h2>Concatenation is normally implemented by allocating a new concatenationnode, and having it refer to the two arguments to the concatenationoperation. Thus in most cases its execution time is independentof the length of strings.<P>The case in which a short rope consisting of a single leaf is concatenatedonto the right of a rope which is either a leaf, or a concatenation nodewhose right child is a leaf, is handled specially. In this case, if theleaves in question are sufficiently short, we mayeither allocate a new leaf holding the combined contents of the twoleaves or, under the right circumstances, even update the left operandin place. In order to allow the destructive update,the actual arrays holding leaf characters are grown in increments of 8.<P>For example, the rope "abcedefghigklmnopqrstuvwxy" might be concatenatedto "z" as shown in the following figure:<IMG SRC="whitespace.gif"><P>Handling this case specially guarantees that ropes builtby repeatedly concatenating short strings onto the right will be composedof leaves of a minimum size, and thus can be stored and processedefficiently. It has a similar effect on repeated character insertionsin the same position.<P>Although concatenation is efficient independent of the shape of thetree, some other operations such as retrieving the <I>i</i>thcharacter, are more efficient if the tree representing the rope isapproximately balanced. Ropes can be rebalanced either via anexplicit call to the <TT>balance</tt> member function, orimplicitly as the result of a concatenation.Ropes are implicitly rebalanced whenever the depth is in danger of overflowing,or the rope both exceeds a smaller depth threshold (currently 20)and is below a minimum length (currently 1000).<P>The balance operation proceeds as described in the paper cited above.The operation is non-destructive; rebalanced pieces formerly shared with otherropes are no longer shared after the operation.As a rope is being balanced, the balanced bit is set in each concatenationnode that has sufficiently small depth for its length.Tree nodes with the balanced bit set are not examined by furtherbalancing operations. Thus the balanceoperation tends to not rebalance the same substring.<P>The worst-case cost ofrebalancing is nonetheless linear in the string.However, the observed behavior is that rebalancing typically consumesa small fraction of the running time. Indeed, all natural ways of buildinga rope of length <I>N</i> by repeated concatenation require total timelinear in <I>N</i>. It is possible, but nontrivial, to design concatenationsequences that violate this.<P>The substring operation is performed differently depending on the treenode representing the root of the rope. The operation is recursive:<OL><LI> For a leaf node, we either return a leaf node with the substring,or if that would be too long, we return a subscript node.<LI> For a concatenation node, we return the concatenation of the appropriatesubstrings of the left and right subtrees. We stop if we find that we needa zero length substring. Similarly, we simply return a pointer to theentire rope when that's appropriate.<LI> For a substring node, we either return a short leaf node, or a newsubstring node. referring to the rope from which the original substringwas obtained. Thus we do not build up nested substring nodes.<LI> Any other function node is treated roughly as a leaf.</ol>Note that this process requires time proportional to the rope depth, anddoesn't directly depend on the length. The algorithm only copies piecesof leaf nodes at the beginning and end of the substring, and it builds newconcatenation nodes along the paths from the root to either end of thesubstring. The interior of the substring can remain shared with theoriginal rope.<H2>Iterators</h2><P>Iterators generated by the normal <TT>begin()</tt> and <TT>end()</tt>operations generate const_iterators, i.e. iterators that do notpermit updates to the underlying rope. Such iterators basicallycontain three kinds of information:<OL><LI> A pointer to the root node of the rope with which the iteratoris associated. This pointer is not included in the reference count,Const_iterators become invalid if the underlying rope is modifiedor destroyed. (In the garbage collected case they remain valid andcontinue to refer to the original pre-modification rope.)<LI> The position inside the rope.<LI> Cached information used to speed up access to sections of therope close to the current position.</ol>We maintain two kinds of cache information in the iterator:<OL><LI> The limits of a contiguous block of storage holding the characterssurrounding the current character position, and the current offset withinthat block. Most iterator references can be resolved with access to onlythis part of the cache. The referenced block of storage can either be partof a leaf in the rope representation, or it can be a small block of charactersreserved inside the iterator itself. The latter is used when the iteratorrefers to a rope section represented by a function node.<LI> The bottom few rope nodes on the path from the root node to the leafor function node currently referenced by the iterator. This is used toquickly increment or decrement the iterator across node boundaries.We do not cache the entire path, since that would make rope iteratorsunpleasantly large.</ol> <P>This implementation differs significantly from that used in the C "cord"package. We do not reserve space inside iterators for a complete pathfrom the root to the current node. This change was made to accommodateSTL code that assumes small iterators that can be cheaply passed by value.We try to aid such code further by providing an iterator assignmentoperation which does not copy the cache part of the iterator unless it hasbeen initialized.<P>The character buffer in the iterator is needed since there is not another place to cache characters returned for the evaluation ofa function node. (This is less of an issue with a garbage collector,since it turns out to be reasonably efficient to maintain such caches inthe function object itself. Without a garbage collector, this requireslocking around cache accesses, which is usually unacceptable.)<P>Mutable iterators differ primarily in that they also contain a pointerto the rope itself (i.e. a pointer to the tree node pointer), so that theycan be used to perform updates, and in that the rope root pointer isincluded in the reference count. In this case the rope root pointeris redundant, except that it us used to detect changes in the rope, so thatthe cached information in the iterator can be invalidated. The rope has changediff the rope itself no longer points to the same node as the rope rootpointer in the iterator. The fact that the iterator is included in thereference count ensures that the root node cannot be dropped and reused.It also prevents the rope from being destructively updated while theiterator must remain valid.</body>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -