📄 skip-deltas
字号:
Skip-Deltas in Subversion=========================To keep repositories at a manageable size, essentially all versioncontrol systems use some kind of relative compression technique suchthat two similar versions of the same file don't take up much morespace than just one version of that file. The two most commontechniques are the SCCS "weave", which represents all revisions of afile as a single data stream with the moral equivalent of #ifdefs, andthe technique of storing deltas (differences) between relatedrevisions of files. Subversion uses deltas.Subversion uses a technique called "skip-deltas" to ensure that only areasonable number of deltas need to be composed in order to retrieveany revisions of a file. The concept of skip-deltas is inspired bythe concept of skip-lists, but they aren't all that similar.For the purposes of this document, we will pretend that revisions of afile are numbered starting from 0. In reality, this numbercorresponds to the "change count" field of a node-revision in eachfilesystem back end.Skip-Deltas in the FSFS Back End================================In the FSFS back end, each revision of a file is represented as adelta against an older revision of the file. The first revision isrepresented as a delta against the empty stream (i.e. it isself-compressed). To reconstruct a revision of a file, the filesystemcode determines the chain of deltas leading back to revision 0,composes them all together using the delta combiner, and applies theresulting super-delta to the empty stream in order to reconstruct thefile contents.The most obvious strategy would be to choose revision N-1 as the deltabase for revision N. But even with the delta combiner, it wouldbecome very slow to retrieve revision 1000 of a file if we had topiece together 1000 deltas. So, to choose the delta base for revisionN, we write out N in binary and flip the rightmost '1' bit. Forinstance, if we are storing 54, we write it out in binary as 110110,flip the last 1 bit to get 110100, and thus pick revision 52 of thefile as the delta base. A file with ten versions (numbered 0-9) wouldhave those versions represented as follows: 0 <- 1 2 <- 3 4 <- 5 6 <- 7 0 <------ 2 4 <------ 6 0 <---------------- 4 0 <------------------------------------ 8 <- 9where "0 <- 1" means that the delta base for revision 1 is revision 0.Because we flip the rightmost 1 bit each time we pick a delta base, atmost lg(N) deltas are necessary to reconstruct revision N of a file.Skip-deltas in the BDB Back End===============================In the BDB back end, each revision of a file is represented as a deltaagainst a newer revision of the file--the opposite of FSFS. Thenewest version of a file is stored in plain text. Whereas in FSFS, weadd a new version of a file simply by picking a delta base and writingout a delta, in BDB the process is more complicated: we write out thenew version of the file in plain text; then, after the commit isconfirmed, we go back and "redeltify" one or more older versions ofthe file against the new one.The goal of the redeltification process is to produce the reverse ofthe FSFS diagram: 0 ------------------------------------> 8 -> 9 4 ----------------> 8 2 ------> 4 6 ------> 8 1 -> 2 3 -> 4 5 -> 6 7 -> 8To accomplish this, we write out the number in binary, count thenumber of trailing zeros, and redeltify that number of ancestorrevisions plus 1. For instance, when we see revision 8, we write itout as "1000", note that there are three trailing 0s, and resolve toredeltify four ancestor revisions: the revisions one back, two back,four back, and eight back.As it turns out, the above diagram is a fiction. To reduce overhead,the BDB back end makes three compromises to the skip-delta scheme: * When storing file revisions 0-31, only the immediate predecessor is redeltified. * We don't redeltify the ancestor revision which is two back from the one we are storing. * We never redeltify revision 0 of a file.Despite these compromises, the asymptotic behavior of the BDBskip-delta scheme is the same as the simpler FSFS one: O(lg(N)) deltasare necessary to reconstruct any revision of a file which has had Nrevisions.Skip-Deltas and Branches========================If a file's history diverges because it is copied and the modified onboth branches, the behavior is as follows: * In FSFS, we choose delta bases just as we would if each branch were an isolated linear path leading back to revision 0. For instance, if a file has six revisions (0-5), then branches into revisions 6-7 and revisions 6'-8', they would look like: 0 <- 1 2 <- 3 4 <- 5 6 <- 7 0 <------ 2 4 <------ 6 6' <- 7' 0 <-------------------------------------- 8' * In BDB, we redeltify ancestor revisions just as we would if each branch were an isolated linear path leading back to revision 0. The result depends on the order of commits. If a file has four revisions (0-3), then branches into revisions 4 and 4', then if 4 was committed first and 4' was committed second, the result would look like: 4 0 --------------------> 4' 2 --------> 4' 1 --> 2 3 --> 4' but if instead, 4 was committed second, the result would look like: 4' 0 --------------------> 4 2 --------> 4 1 --> 2 3 --> 4 Although this order dependency may be confusing to think about, it causes no complexity in the code, and the O(lg(N)) asymptotic behavior is clearly preserved.Note that in the BDB back end, a branched file has a separateplain-text representation for each branch head, while in the FSFS backend, that is not the case.Costs of Skip-Deltas====================In most workloads, the delta for a file revision becomes larger if thedelta base is farther away--in terms of the diagrams, longer arrowstake up more space. In the worst case, where all changes to the fileare orthogonal to each other, a delta across N file revisions may be Ntimes as expensive as a delta across one revision.In either back end, the average number of revisions crossed by a deltaarrow is O(lg(N)), if the file has had N revisions. So we may assumethat in the worst case, skip-deltas incur an O(lg(N)) space penaltywhile providing an O(N/lg(N)) time benefit. The practical spacepenalty appears to be substantially less than O(lg(N)), because manyfiles have short histories and many changes are not orthogonal to eachother.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -