hgbook

diff en/concepts.tex @ 110:75c076c7a374

More concepts stuff.
author Bryan O'Sullivan <bos@serpentine.com>
date Fri Nov 10 15:09:49 2006 -0800 (2006-11-10)
parents 1b67dc96f27a
children 34b8b7a15ea1
line diff
     1.1 --- a/en/concepts.tex	Fri Nov 10 12:42:00 2006 -0800
     1.2 +++ b/en/concepts.tex	Fri Nov 10 15:09:49 2006 -0800
     1.3 @@ -77,15 +77,15 @@
     1.4    \label{fig:concepts:metadata}
     1.5  \end{figure}
     1.6  
     1.7 -Note that there is not a ``one to one'' relationship between revisions
     1.8 -in these different metadata files.  If the manifest hasn't changed
     1.9 -between two changesets, their changelog entries will point to the same
    1.10 -revision of the manifest.  If a file that Mercurial tracks hasn't
    1.11 -changed between two changesets, the entry for that file in the two
    1.12 -revisions of the manifest will point to the same revision of its
    1.13 -filelog.
    1.14 -
    1.15 -\section{An efficient, unified, safe storage mechanism}
    1.16 +As the illustration shows, there is \emph{not} a ``one to one''
    1.17 +relationship between revisions in the changelog, manifest, or filelog.
    1.18 +If the manifest hasn't changed between two changesets, the changelog
    1.19 +entries for those changesets will point to the same revision of the
    1.20 +manifest.  If a file that Mercurial tracks hasn't changed between two
    1.21 +changesets, the entry for that file in the two revisions of the
    1.22 +manifest will point to the same revision of its filelog.
    1.23 +
    1.24 +\section{Safe, efficient storage}
    1.25  
    1.26  The underpinnings of changelogs, manifests, and filelogs are provided
    1.27  by a single structure called the \emph{revlog}.
    1.28 @@ -136,13 +136,24 @@
    1.29  revisions you must read, hence the longer it takes to reconstruct a
    1.30  particular revision.
    1.31  
    1.32 +\begin{figure}[ht]
    1.33 +  \centering
    1.34 +  \grafix{snapshot}
    1.35 +  \caption{Snapshot of a revlog, with incremental deltas}
    1.36 +  \label{fig:concepts:snapshot}
    1.37 +\end{figure}
    1.38 +
    1.39  The innovation that Mercurial applies to this problem is simple but
    1.40  effective.  Once the cumulative amount of delta information stored
    1.41  since the last snapshot exceeds a fixed threshold, it stores a new
    1.42  snapshot (compressed, of course), instead of another delta.  This
    1.43  makes it possible to reconstruct \emph{any} revision of a file
    1.44 -quickly.  This approach works so well that it has subsequently been
    1.45 -copied by several other revision control systems.
    1.46 +quickly.  This approach works so well that it has since been copied by
    1.47 +several other revision control systems.
    1.48 +
    1.49 +Figure~\ref{fig:concepts:snapshot} illustrates the idea.  In an entry
    1.50 +in a revlog's index file, Mercurial stores the range of entries from
    1.51 +the data file that it must read to reconstruct a particular revision.
    1.52  
    1.53  \subsubsection{Aside: the influence of video compression}
    1.54  
    1.55 @@ -163,26 +174,6 @@
    1.56  the next key frame is received.  Also, the accumulation of encoding
    1.57  errors restarts anew with each key frame.
    1.58  
    1.59 -\subsection{Clever compression}
    1.60 -
    1.61 -When appropriate, Mercurial will store both snapshots and deltas in
    1.62 -compressed form.  It does this by always \emph{trying to} compress a
    1.63 -snapshot or delta, but only storing the compressed version if it's
    1.64 -smaller than the uncompressed version.
    1.65 -
    1.66 -This means that Mercurial does ``the right thing'' when storing a file
    1.67 -whose native form is compressed, such as a \texttt{zip} archive or a
    1.68 -JPEG image.  When these types of files are compressed a second time,
    1.69 -the resulting file is usually bigger than the once-compressed form,
    1.70 -and so Mercurial will store the plain \texttt{zip} or JPEG.
    1.71 -
    1.72 -Deltas between revisions of a compressed file are usually larger than
    1.73 -snapshots of the file, and Mercurial again does ``the right thing'' in
    1.74 -these cases.  It finds that such a delta exceeds the threshold at
    1.75 -which it should store a complete snapshot of the file, so it stores
    1.76 -the snapshot, again saving space compared to a naive delta-only
    1.77 -approach.
    1.78 -
    1.79  \subsection{Strong integrity}
    1.80  
    1.81  Along with delta or snapshot information, a revlog entry contains a
    1.82 @@ -202,6 +193,83 @@
    1.83  after the corrupted section.  This would not be possible with a
    1.84  delta-only storage model.
    1.85  
    1.86 +\section{The working directory}
    1.87 +
    1.88 +Mercurial's good ideas are not confined to the repository; it also
    1.89 +needs to manage the working directory.  The \emph{dirstate} contains
    1.90 +Mercurial's knowledge of the working directory.  This details which
    1.91 +revision(s) the working directory is updated to, and all files that
    1.92 +Mercurial is tracking in the working directory.
    1.93 +
    1.94 +Because Mercurial doesn't force you to tell it when you're modifying a
    1.95 +file, it uses the dirstate to store some extra information so it can
    1.96 +determine efficiently whether you have modified a file.  For each file
    1.97 +in the working directory, it stores the time that it last modified the
    1.98 +file itself, and the size of the file at that time.  
    1.99 +
   1.100 +When Mercurial is checking the states of files in the working
   1.101 +directory, it first checks a file's modification time.  If that has
   1.102 +not changed, the file must not have been modified.  If the file's size
   1.103 +has changed, the file must have been modified.  If the modification
   1.104 +time has changed, but the size has not, only then does Mercurial need
   1.105 +to read the actual contents of the file to see if they've changed.
   1.106 +Storing these few extra pieces of information dramatically reduces the
   1.107 +amount of data that Mercurial needs to read, which yields large
   1.108 +performance improvements compared to other revision control systems.
   1.109 +
   1.110 +\section{Other interesting design features}
   1.111 +
   1.112 +In the sections above, I've tried to highlight some of the most
   1.113 +important aspects of Mercurial's design, to illustrate that it pays
   1.114 +careful attention to reliability and performance.  However, the
   1.115 +attention to detail doesn't stop there.  There are a number of other
   1.116 +aspects of Mercurial's construction that I personally find
   1.117 +interesting.  I'll detail a few of them here, separate from the ``big
   1.118 +ticket'' items above, so that if you're interested, you can gain a
   1.119 +better idea of the amount of thinking that goes into a well-designed
   1.120 +system.
   1.121 +
   1.122 +\subsection{Clever compression}
   1.123 +
   1.124 +When appropriate, Mercurial will store both snapshots and deltas in
   1.125 +compressed form.  It does this by always \emph{trying to} compress a
   1.126 +snapshot or delta, but only storing the compressed version if it's
   1.127 +smaller than the uncompressed version.
   1.128 +
   1.129 +This means that Mercurial does ``the right thing'' when storing a file
   1.130 +whose native form is compressed, such as a \texttt{zip} archive or a
   1.131 +JPEG image.  When these types of files are compressed a second time,
   1.132 +the resulting file is usually bigger than the once-compressed form,
   1.133 +and so Mercurial will store the plain \texttt{zip} or JPEG.
   1.134 +
   1.135 +Deltas between revisions of a compressed file are usually larger than
   1.136 +snapshots of the file, and Mercurial again does ``the right thing'' in
   1.137 +these cases.  It finds that such a delta exceeds the threshold at
   1.138 +which it should store a complete snapshot of the file, so it stores
   1.139 +the snapshot, again saving space compared to a naive delta-only
   1.140 +approach.
   1.141 +
   1.142 +\subsubsection{Network recompression}
   1.143 +
   1.144 +When storing revisions on disk, Mercurial uses the ``deflate''
   1.145 +compression algorithm (the same one used by the popular \texttt{zip}
   1.146 +archive format), which balances good speed with a respectable
   1.147 +compression ratio.  However, when transmitting revision data over a
   1.148 +network connection, Mercurial uncompresses the compressed revision
   1.149 +data.
   1.150 +
   1.151 +If the connection is over HTTP, Mercurial recompresses the entire
   1.152 +stream of data using a compression algorithm that gives a etter
   1.153 +compression ratio (the Burrows-Wheeler algorithm from the widely used
   1.154 +\texttt{bzip2} compression package).  This combination of algorithm
   1.155 +and compression of the entire stream (instead of a revision at a time)
   1.156 +substantially reduces the number of bytes to be transferred, yielding
   1.157 +better network performance over almost all kinds of network.
   1.158 +
   1.159 +(If the connection is over \command{ssh}, Mercurial \emph{doesn't}
   1.160 +recompress the stream, because \command{ssh} can already do this
   1.161 +itself.)
   1.162 +
   1.163  \subsection{Read/write ordering and atomicity}
   1.164  
   1.165  Appending to files isn't the whole story when it comes to guaranteeing
   1.166 @@ -241,15 +309,25 @@
   1.167  makes for all kinds of nasty and annoying security and administrative
   1.168  problems.)
   1.169  
   1.170 -Mercurial uses a locking mechanism to ensure that only one process can
   1.171 -write to a repository at a time.  This locking mechanism is safe even
   1.172 -over filesystems that are notoriously unsafe for locking, such as NFS.
   1.173 -If a repository is locked, a writer will wait for a while to retry if
   1.174 -the repository becomes unlocked, but if the repository remains locked
   1.175 -for too long, the process attempting to write will time out after a
   1.176 -while.  This means that your daily automated scripts won't get stuck
   1.177 -forever and pile up if a system crashes unnoticed, for example.  (Yes,
   1.178 -the timeout is configurable, from zero to infinity.)
   1.179 +Mercurial uses locks to ensure that only one process can write to a
   1.180 +repository at a time (the locking mechanism is safe even over
   1.181 +filesystems that are notoriously hostile to locking, such as NFS).  If
   1.182 +a repository is locked, a writer will wait for a while to retry if the
   1.183 +repository becomes unlocked, but if the repository remains locked for
   1.184 +too long, the process attempting to write will time out after a while.
   1.185 +This means that your daily automated scripts won't get stuck forever
   1.186 +and pile up if a system crashes unnoticed, for example.  (Yes, the
   1.187 +timeout is configurable, from zero to infinity.)
   1.188 +
   1.189 +\subsubsection{Safe dirstate access}
   1.190 +
   1.191 +As with revision data, Mercurial doesn't take a lock to read the
   1.192 +dirstate file; it does acquire a lock to write it.  To avoid the
   1.193 +possibility of reading a partially written copy of the dirstate file,
   1.194 +Mercurial writes to a file with a unique name in the same directory as
   1.195 +the dirstate file, then renames the temporary file atomically to
   1.196 +\filename{dirstate}.  The file named \filename{dirstate} is thus
   1.197 +guaranteed to be complete, not partially written.
   1.198  
   1.199  
   1.200