hgbook

diff en/concepts.tex @ 109:1b67dc96f27a

Snapshot of concepts chapter.
author Bryan O'Sullivan <bos@serpentine.com>
date Fri Nov 10 12:42:00 2006 -0800 (2006-11-10)
parents e0b961975c5e
children 75c076c7a374
line diff
     1.1 --- a/en/concepts.tex	Thu Nov 09 10:11:31 2006 -0800
     1.2 +++ b/en/concepts.tex	Fri Nov 10 12:42:00 2006 -0800
     1.3 @@ -6,7 +6,15 @@
     1.4  the software really works.  Knowing this certainly isn't necessary,
     1.5  but I find it useful to have a ``mental model'' of what's going on.
     1.6  
     1.7 -\section{Tracking the history of a single file}
     1.8 +This understanding gives me confidence that Mercurial has been
     1.9 +carefully designed to be both \emph{safe} and \emph{efficient}.  And
    1.10 +just as importantly, if I have a good idea what the software is doing
    1.11 +when I perform a revision control task, I'm less likely to be
    1.12 +surprised by its behaviour.
    1.13 +
    1.14 +\section{Mercurial's historical record}
    1.15 +
    1.16 +\subsection{Tracking the history of a single file}
    1.17  
    1.18  When Mercurial tracks modifications to a file, it stores the history
    1.19  of that file in a metadata object called a \emph{filelog}.  Each entry
    1.20 @@ -16,13 +24,13 @@
    1.21  information: revision data, and an index to help Mercurial to find a
    1.22  revision efficiently.  
    1.23  
    1.24 -For small files without much history, the revision data and index are
    1.25 -combined in a single file (with a ``\texttt{.i}'' suffix).  A file
    1.26 -that is large, or has a lot of history, has its filelog stored as
    1.27 -separate data (``\texttt{.d}'' suffix) and index (``\texttt{.i}''
    1.28 -suffix) files.  The correspondence between a file in the working
    1.29 -directory and the filelog that tracks its history in the repository is
    1.30 -illustrated in figure~\ref{fig:concepts:filelog}.
    1.31 +A file that is large, or has a lot of history, has its filelog stored
    1.32 +in separate data (``\texttt{.d}'' suffix) and index (``\texttt{.i}''
    1.33 +suffix) files.  For small files without much history, the revision
    1.34 +data and index are combined in a single ``\texttt{.i}'' file.  The
    1.35 +correspondence between a file in the working directory and the filelog
    1.36 +that tracks its history in the repository is illustrated in
    1.37 +figure~\ref{fig:concepts:filelog}.
    1.38  
    1.39  \begin{figure}[ht]
    1.40    \centering
    1.41 @@ -32,6 +40,219 @@
    1.42    \label{fig:concepts:filelog}
    1.43  \end{figure}
    1.44  
    1.45 +\subsection{Managing tracked files}
    1.46 +
    1.47 +Mercurial uses a structure called a \emph{manifest} to collect
    1.48 +together information about the files that it tracks.  Each entry in
    1.49 +the manifest contains information about the files present in a single
    1.50 +changeset.  An entry records which files are present in the changeset,
    1.51 +the revision of each file, and a few other pieces of file metadata.
    1.52 +
    1.53 +\subsection{Recording changeset information}
    1.54 +
    1.55 +The \emph{changelog} contains information about each changeset.  Each
    1.56 +revision records who committed a change, the changeset comment, other
    1.57 +pieces of changeset-related information, and the revision of the
    1.58 +manifest to use.
    1.59 +
    1.60 +\subsection{Relationships between revisions}
    1.61 +
    1.62 +Within a changelog, a manifest, or a filelog, each revision stores a
    1.63 +pointer to its immediate parent (or to its two parents, if it's a
    1.64 +merge revision).  As I mentioned above, there are also relationships
    1.65 +between revisions \emph{across} these structures, and they are
    1.66 +hierarchical in nature.
    1.67 +
    1.68 +For every changeset in a repository, there is exactly one revision
    1.69 +stored in the changelog.  Each revision of the changelog contains a
    1.70 +pointer to a single revision of the manifest.  A revision of the
    1.71 +manifest stores a pointer to a single revision of each filelog tracked
    1.72 +when that changeset was created.  These relationships are illustrated
    1.73 +in figure~\ref{fig:concepts:metadata}.
    1.74 +
    1.75 +\begin{figure}[ht]
    1.76 +  \centering
    1.77 +  \grafix{metadata}
    1.78 +  \caption{Metadata relationships}
    1.79 +  \label{fig:concepts:metadata}
    1.80 +\end{figure}
    1.81 +
    1.82 +Note that there is not a ``one to one'' relationship between revisions
    1.83 +in these different metadata files.  If the manifest hasn't changed
    1.84 +between two changesets, their changelog entries will point to the same
    1.85 +revision of the manifest.  If a file that Mercurial tracks hasn't
    1.86 +changed between two changesets, the entry for that file in the two
    1.87 +revisions of the manifest will point to the same revision of its
    1.88 +filelog.
    1.89 +
    1.90 +\section{An efficient, unified, safe storage mechanism}
    1.91 +
    1.92 +The underpinnings of changelogs, manifests, and filelogs are provided
    1.93 +by a single structure called the \emph{revlog}.
    1.94 +
    1.95 +\subsection{Efficient storage}
    1.96 +
    1.97 +The revlog provides efficient storage of revisions using a
    1.98 +\emph{delta} mechanism.  Instead of storing a complete copy of a file
    1.99 +for each revision, it stores the changes needed to transform an older
   1.100 +revision into the new revision.  For many kinds of file data, these
   1.101 +deltas are typically a fraction of a percent of the size of a full
   1.102 +copy of a file.
   1.103 +
   1.104 +Some obsolete revision control systems can only work with deltas of
   1.105 +text files.  They must either store binary files as complete snapshots
   1.106 +or encoded into a text representation, both of which are wasteful
   1.107 +approaches.  Mercurial can efficiently handle deltas of files with
   1.108 +arbitrary binary contents; it doesn't need to treat text as special.
   1.109 +
   1.110 +\subsection{Safe operation}
   1.111 +
   1.112 +Mercurial only ever \emph{appends} data to the end of a revlog file.
   1.113 +It never modifies a section of a file after it has written it.  This
   1.114 +is both more robust and efficient than schemes that need to modify or
   1.115 +rewrite data.
   1.116 +
   1.117 +In addition, Mercurial treats every write as part of a
   1.118 +\emph{transaction} that can span a number of files.  A transaction is
   1.119 +\emph{atomic}: either the entire transaction succeeds and its effects
   1.120 +are all visible to readers in one go, or the whole thing is undone.
   1.121 +This guarantee of atomicity means that if you're running two copies of
   1.122 +Mercurial, where one is reading data and one is writing it, the reader
   1.123 +will never see a partially written result that might confuse it.
   1.124 +
   1.125 +The fact that Mercurial only appends to files makes it easier to
   1.126 +provide this transactional guarantee.  The easier it is to do stuff
   1.127 +like this, the more confident you should be that it's done correctly.
   1.128 +
   1.129 +\subsection{Fast retrieval}
   1.130 +
   1.131 +Mercurial cleverly avoids a pitfall common to all earlier
   1.132 +revision control systems: the problem of \emph{inefficient retrieval}.
   1.133 +Most revision control systems store the contents of a revision as an
   1.134 +incremental series of modifications against a ``snapshot''.  To
   1.135 +reconstruct a specific revision, you must first read the snapshot, and
   1.136 +then every one of the revisions between the snapshot and your target
   1.137 +revision.  The more history that a file accumulates, the more
   1.138 +revisions you must read, hence the longer it takes to reconstruct a
   1.139 +particular revision.
   1.140 +
   1.141 +The innovation that Mercurial applies to this problem is simple but
   1.142 +effective.  Once the cumulative amount of delta information stored
   1.143 +since the last snapshot exceeds a fixed threshold, it stores a new
   1.144 +snapshot (compressed, of course), instead of another delta.  This
   1.145 +makes it possible to reconstruct \emph{any} revision of a file
   1.146 +quickly.  This approach works so well that it has subsequently been
   1.147 +copied by several other revision control systems.
   1.148 +
   1.149 +\subsubsection{Aside: the influence of video compression}
   1.150 +
   1.151 +If you're familiar with video compression or have ever watched a TV
   1.152 +feed through a digital cable or satellite service, you may know that
   1.153 +most video compression schemes store each frame of video as a delta
   1.154 +against its predecessor frame.  In addition, these schemes use
   1.155 +``lossy'' compression techniques to increase the compression ratio, so
   1.156 +visual errors accumulate over the course of a number of inter-frame
   1.157 +deltas.
   1.158 +
   1.159 +Because it's possible for a video stream to ``drop out'' occasionally
   1.160 +due to signal glitches, and to limit the accumulation of artefacts
   1.161 +introduced by the lossy compression process, video encoders
   1.162 +periodically insert a complete frame (called a ``key frame'') into the
   1.163 +video stream; the next delta is generated against that frame.  This
   1.164 +means that if the video signal gets interrupted, it will resume once
   1.165 +the next key frame is received.  Also, the accumulation of encoding
   1.166 +errors restarts anew with each key frame.
   1.167 +
   1.168 +\subsection{Clever compression}
   1.169 +
   1.170 +When appropriate, Mercurial will store both snapshots and deltas in
   1.171 +compressed form.  It does this by always \emph{trying to} compress a
   1.172 +snapshot or delta, but only storing the compressed version if it's
   1.173 +smaller than the uncompressed version.
   1.174 +
   1.175 +This means that Mercurial does ``the right thing'' when storing a file
   1.176 +whose native form is compressed, such as a \texttt{zip} archive or a
   1.177 +JPEG image.  When these types of files are compressed a second time,
   1.178 +the resulting file is usually bigger than the once-compressed form,
   1.179 +and so Mercurial will store the plain \texttt{zip} or JPEG.
   1.180 +
   1.181 +Deltas between revisions of a compressed file are usually larger than
   1.182 +snapshots of the file, and Mercurial again does ``the right thing'' in
   1.183 +these cases.  It finds that such a delta exceeds the threshold at
   1.184 +which it should store a complete snapshot of the file, so it stores
   1.185 +the snapshot, again saving space compared to a naive delta-only
   1.186 +approach.
   1.187 +
   1.188 +\subsection{Strong integrity}
   1.189 +
   1.190 +Along with delta or snapshot information, a revlog entry contains a
   1.191 +cryptographic hash of the data that it represents.  This makes it
   1.192 +difficult to forge the contents of a revision, and easy to detect
   1.193 +accidental corruption.
   1.194 +
   1.195 +Mercurial checks these hashes when retrieving file revisions and when
   1.196 +pulling changes from a repository.  If it encounters an integrity
   1.197 +problem, it will complain and stop whatever it's doing.
   1.198 +
   1.199 +In addition to the effect it has on retrieval efficiency, Mercurial's
   1.200 +use of periodic snapshots makes it more robust against partial data
   1.201 +corruption.  If a revlog becomes partly corrupted due to a hardware
   1.202 +error or system bug, it's often possible to reconstruct some or most
   1.203 +revisions from the uncorrupted sections of the revlog, both before and
   1.204 +after the corrupted section.  This would not be possible with a
   1.205 +delta-only storage model.
   1.206 +
   1.207 +\subsection{Read/write ordering and atomicity}
   1.208 +
   1.209 +Appending to files isn't the whole story when it comes to guaranteeing
   1.210 +that a reader won't see a partial write.  If you recall
   1.211 +figure~\ref{fig:concepts:metadata}, revisions in the changelog point to
   1.212 +revisions in the manifest, and revisions in the manifest point to
   1.213 +revisions in filelogs.  This hierarchy is deliberate.
   1.214 +
   1.215 +A writer starts a transaction by writing filelog and manifest data,
   1.216 +and doesn't write any changelog data until those are finished.  A
   1.217 +reader starts by reading changelog data, then manifest data, followed
   1.218 +by filelog data.
   1.219 +
   1.220 +Since the writer has always finished writing filelog and manifest data
   1.221 +before it writes to the changelog, a reader will never read a pointer
   1.222 +to a partially written manifest revision from the changelog, and it will
   1.223 +never read a pointer to a partially written filelog revision from the
   1.224 +manifest.
   1.225 +
   1.226 +\subsection{Concurrent access}
   1.227 +
   1.228 +The read/write ordering and atomicity guarantees mean that Mercurial
   1.229 +never needs to \emph{lock} a repository when it's reading data, even
   1.230 +if the repository is being written to while the read is occurring.
   1.231 +This has a big effect on scalability; you can have an arbitrary number
   1.232 +of Mercurial processes safely reading data from a repository safely
   1.233 +all at once, no matter whether it's being written to or not.
   1.234 +
   1.235 +The lockless nature of reading means that if you're sharing a
   1.236 +repository on a multi-user system, you don't need to grant other local
   1.237 +users permission to \emph{write} to your repository in order for them
   1.238 +to be able to clone it or pull changes from it; they only need
   1.239 +\emph{read} permission.  (This is \emph{not} a common feature among
   1.240 +revision control systems, so don't take it for granted!  Most require
   1.241 +readers to be able to lock a repository to access it safely, and this
   1.242 +requires write permission on at least one directory, which of course
   1.243 +makes for all kinds of nasty and annoying security and administrative
   1.244 +problems.)
   1.245 +
   1.246 +Mercurial uses a locking mechanism to ensure that only one process can
   1.247 +write to a repository at a time.  This locking mechanism is safe even
   1.248 +over filesystems that are notoriously unsafe for locking, such as NFS.
   1.249 +If a repository is locked, a writer will wait for a while to retry if
   1.250 +the repository becomes unlocked, but if the repository remains locked
   1.251 +for too long, the process attempting to write will time out after a
   1.252 +while.  This means that your daily automated scripts won't get stuck
   1.253 +forever and pile up if a system crashes unnoticed, for example.  (Yes,
   1.254 +the timeout is configurable, from zero to infinity.)
   1.255 +
   1.256 +
   1.257 +
   1.258  %%% Local Variables: 
   1.259  %%% mode: latex
   1.260  %%% TeX-master: "00book"