hgbook
diff en/concepts.tex @ 110:75c076c7a374
More concepts stuff.
author | Bryan O'Sullivan <bos@serpentine.com> |
---|---|
date | Fri Nov 10 15:09:49 2006 -0800 (2006-11-10) |
parents | 1b67dc96f27a |
children | 34b8b7a15ea1 |
line diff
1.1 --- a/en/concepts.tex Fri Nov 10 12:42:00 2006 -0800 1.2 +++ b/en/concepts.tex Fri Nov 10 15:09:49 2006 -0800 1.3 @@ -77,15 +77,15 @@ 1.4 \label{fig:concepts:metadata} 1.5 \end{figure} 1.6 1.7 -Note that there is not a ``one to one'' relationship between revisions 1.8 -in these different metadata files. If the manifest hasn't changed 1.9 -between two changesets, their changelog entries will point to the same 1.10 -revision of the manifest. If a file that Mercurial tracks hasn't 1.11 -changed between two changesets, the entry for that file in the two 1.12 -revisions of the manifest will point to the same revision of its 1.13 -filelog. 1.14 - 1.15 -\section{An efficient, unified, safe storage mechanism} 1.16 +As the illustration shows, there is \emph{not} a ``one to one'' 1.17 +relationship between revisions in the changelog, manifest, or filelog. 1.18 +If the manifest hasn't changed between two changesets, the changelog 1.19 +entries for those changesets will point to the same revision of the 1.20 +manifest. If a file that Mercurial tracks hasn't changed between two 1.21 +changesets, the entry for that file in the two revisions of the 1.22 +manifest will point to the same revision of its filelog. 1.23 + 1.24 +\section{Safe, efficient storage} 1.25 1.26 The underpinnings of changelogs, manifests, and filelogs are provided 1.27 by a single structure called the \emph{revlog}. 1.28 @@ -136,13 +136,24 @@ 1.29 revisions you must read, hence the longer it takes to reconstruct a 1.30 particular revision. 1.31 1.32 +\begin{figure}[ht] 1.33 + \centering 1.34 + \grafix{snapshot} 1.35 + \caption{Snapshot of a revlog, with incremental deltas} 1.36 + \label{fig:concepts:snapshot} 1.37 +\end{figure} 1.38 + 1.39 The innovation that Mercurial applies to this problem is simple but 1.40 effective. Once the cumulative amount of delta information stored 1.41 since the last snapshot exceeds a fixed threshold, it stores a new 1.42 snapshot (compressed, of course), instead of another delta. This 1.43 makes it possible to reconstruct \emph{any} revision of a file 1.44 -quickly. This approach works so well that it has subsequently been 1.45 -copied by several other revision control systems. 1.46 +quickly. This approach works so well that it has since been copied by 1.47 +several other revision control systems. 1.48 + 1.49 +Figure~\ref{fig:concepts:snapshot} illustrates the idea. In an entry 1.50 +in a revlog's index file, Mercurial stores the range of entries from 1.51 +the data file that it must read to reconstruct a particular revision. 1.52 1.53 \subsubsection{Aside: the influence of video compression} 1.54 1.55 @@ -163,26 +174,6 @@ 1.56 the next key frame is received. Also, the accumulation of encoding 1.57 errors restarts anew with each key frame. 1.58 1.59 -\subsection{Clever compression} 1.60 - 1.61 -When appropriate, Mercurial will store both snapshots and deltas in 1.62 -compressed form. It does this by always \emph{trying to} compress a 1.63 -snapshot or delta, but only storing the compressed version if it's 1.64 -smaller than the uncompressed version. 1.65 - 1.66 -This means that Mercurial does ``the right thing'' when storing a file 1.67 -whose native form is compressed, such as a \texttt{zip} archive or a 1.68 -JPEG image. When these types of files are compressed a second time, 1.69 -the resulting file is usually bigger than the once-compressed form, 1.70 -and so Mercurial will store the plain \texttt{zip} or JPEG. 1.71 - 1.72 -Deltas between revisions of a compressed file are usually larger than 1.73 -snapshots of the file, and Mercurial again does ``the right thing'' in 1.74 -these cases. It finds that such a delta exceeds the threshold at 1.75 -which it should store a complete snapshot of the file, so it stores 1.76 -the snapshot, again saving space compared to a naive delta-only 1.77 -approach. 1.78 - 1.79 \subsection{Strong integrity} 1.80 1.81 Along with delta or snapshot information, a revlog entry contains a 1.82 @@ -202,6 +193,83 @@ 1.83 after the corrupted section. This would not be possible with a 1.84 delta-only storage model. 1.85 1.86 +\section{The working directory} 1.87 + 1.88 +Mercurial's good ideas are not confined to the repository; it also 1.89 +needs to manage the working directory. The \emph{dirstate} contains 1.90 +Mercurial's knowledge of the working directory. This details which 1.91 +revision(s) the working directory is updated to, and all files that 1.92 +Mercurial is tracking in the working directory. 1.93 + 1.94 +Because Mercurial doesn't force you to tell it when you're modifying a 1.95 +file, it uses the dirstate to store some extra information so it can 1.96 +determine efficiently whether you have modified a file. For each file 1.97 +in the working directory, it stores the time that it last modified the 1.98 +file itself, and the size of the file at that time. 1.99 + 1.100 +When Mercurial is checking the states of files in the working 1.101 +directory, it first checks a file's modification time. If that has 1.102 +not changed, the file must not have been modified. If the file's size 1.103 +has changed, the file must have been modified. If the modification 1.104 +time has changed, but the size has not, only then does Mercurial need 1.105 +to read the actual contents of the file to see if they've changed. 1.106 +Storing these few extra pieces of information dramatically reduces the 1.107 +amount of data that Mercurial needs to read, which yields large 1.108 +performance improvements compared to other revision control systems. 1.109 + 1.110 +\section{Other interesting design features} 1.111 + 1.112 +In the sections above, I've tried to highlight some of the most 1.113 +important aspects of Mercurial's design, to illustrate that it pays 1.114 +careful attention to reliability and performance. However, the 1.115 +attention to detail doesn't stop there. There are a number of other 1.116 +aspects of Mercurial's construction that I personally find 1.117 +interesting. I'll detail a few of them here, separate from the ``big 1.118 +ticket'' items above, so that if you're interested, you can gain a 1.119 +better idea of the amount of thinking that goes into a well-designed 1.120 +system. 1.121 + 1.122 +\subsection{Clever compression} 1.123 + 1.124 +When appropriate, Mercurial will store both snapshots and deltas in 1.125 +compressed form. It does this by always \emph{trying to} compress a 1.126 +snapshot or delta, but only storing the compressed version if it's 1.127 +smaller than the uncompressed version. 1.128 + 1.129 +This means that Mercurial does ``the right thing'' when storing a file 1.130 +whose native form is compressed, such as a \texttt{zip} archive or a 1.131 +JPEG image. When these types of files are compressed a second time, 1.132 +the resulting file is usually bigger than the once-compressed form, 1.133 +and so Mercurial will store the plain \texttt{zip} or JPEG. 1.134 + 1.135 +Deltas between revisions of a compressed file are usually larger than 1.136 +snapshots of the file, and Mercurial again does ``the right thing'' in 1.137 +these cases. It finds that such a delta exceeds the threshold at 1.138 +which it should store a complete snapshot of the file, so it stores 1.139 +the snapshot, again saving space compared to a naive delta-only 1.140 +approach. 1.141 + 1.142 +\subsubsection{Network recompression} 1.143 + 1.144 +When storing revisions on disk, Mercurial uses the ``deflate'' 1.145 +compression algorithm (the same one used by the popular \texttt{zip} 1.146 +archive format), which balances good speed with a respectable 1.147 +compression ratio. However, when transmitting revision data over a 1.148 +network connection, Mercurial uncompresses the compressed revision 1.149 +data. 1.150 + 1.151 +If the connection is over HTTP, Mercurial recompresses the entire 1.152 +stream of data using a compression algorithm that gives a etter 1.153 +compression ratio (the Burrows-Wheeler algorithm from the widely used 1.154 +\texttt{bzip2} compression package). This combination of algorithm 1.155 +and compression of the entire stream (instead of a revision at a time) 1.156 +substantially reduces the number of bytes to be transferred, yielding 1.157 +better network performance over almost all kinds of network. 1.158 + 1.159 +(If the connection is over \command{ssh}, Mercurial \emph{doesn't} 1.160 +recompress the stream, because \command{ssh} can already do this 1.161 +itself.) 1.162 + 1.163 \subsection{Read/write ordering and atomicity} 1.164 1.165 Appending to files isn't the whole story when it comes to guaranteeing 1.166 @@ -241,15 +309,25 @@ 1.167 makes for all kinds of nasty and annoying security and administrative 1.168 problems.) 1.169 1.170 -Mercurial uses a locking mechanism to ensure that only one process can 1.171 -write to a repository at a time. This locking mechanism is safe even 1.172 -over filesystems that are notoriously unsafe for locking, such as NFS. 1.173 -If a repository is locked, a writer will wait for a while to retry if 1.174 -the repository becomes unlocked, but if the repository remains locked 1.175 -for too long, the process attempting to write will time out after a 1.176 -while. This means that your daily automated scripts won't get stuck 1.177 -forever and pile up if a system crashes unnoticed, for example. (Yes, 1.178 -the timeout is configurable, from zero to infinity.) 1.179 +Mercurial uses locks to ensure that only one process can write to a 1.180 +repository at a time (the locking mechanism is safe even over 1.181 +filesystems that are notoriously hostile to locking, such as NFS). If 1.182 +a repository is locked, a writer will wait for a while to retry if the 1.183 +repository becomes unlocked, but if the repository remains locked for 1.184 +too long, the process attempting to write will time out after a while. 1.185 +This means that your daily automated scripts won't get stuck forever 1.186 +and pile up if a system crashes unnoticed, for example. (Yes, the 1.187 +timeout is configurable, from zero to infinity.) 1.188 + 1.189 +\subsubsection{Safe dirstate access} 1.190 + 1.191 +As with revision data, Mercurial doesn't take a lock to read the 1.192 +dirstate file; it does acquire a lock to write it. To avoid the 1.193 +possibility of reading a partially written copy of the dirstate file, 1.194 +Mercurial writes to a file with a unique name in the same directory as 1.195 +the dirstate file, then renames the temporary file atomically to 1.196 +\filename{dirstate}. The file named \filename{dirstate} is thus 1.197 +guaranteed to be complete, not partially written. 1.198 1.199 1.200