hgbook

diff en/ch04-concepts.xml @ 826:c8e6c34901ff

Minor changes to Ch.4.
author Giulio@puck
date Sun Aug 16 09:47:25 2009 +0200 (2009-08-16)
parents 18131160f7ee
children
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/en/ch04-concepts.xml	Sun Aug 16 09:47:25 2009 +0200
     1.3 @@ -0,0 +1,778 @@
     1.4 +<!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : -->
     1.5 +
     1.6 +<chapter id="chap:concepts">
     1.7 +  <?dbhtml filename="behind-the-scenes.html"?>
     1.8 +  <title>Behind the scenes</title>
     1.9 +
    1.10 +  <para id="x_2e8">Unlike many revision control systems, the concepts
    1.11 +    upon which Mercurial is built are simple enough that it's easy to
    1.12 +    understand how the software really works.  Knowing these details
    1.13 +    certainly isn't necessary, so it is certainly safe to skip this
    1.14 +    chapter.  However, I think you will get more out of the software
    1.15 +    with a <quote>mental model</quote> of what's going on.</para>
    1.16 +
    1.17 +  <para id="x_2e9">Being able to understand what's going on behind the
    1.18 +    scenes gives me confidence that Mercurial has been carefully
    1.19 +    designed to be both <emphasis>safe</emphasis> and
    1.20 +    <emphasis>efficient</emphasis>.  And just as importantly, if it's
    1.21 +    easy for me to retain a good idea of what the software is doing
    1.22 +    when I perform a revision control task, I'm less likely to be
    1.23 +    surprised by its behavior.</para>
    1.24 +
    1.25 +  <para id="x_2ea">In this chapter, we'll initially cover the core concepts
    1.26 +    behind Mercurial's design, then continue to discuss some of the
    1.27 +    interesting details of its implementation.</para>
    1.28 +
    1.29 +  <sect1>
    1.30 +    <title>Mercurial's historical record</title>
    1.31 +
    1.32 +    <sect2>
    1.33 +      <title>Tracking the history of a single file</title>
    1.34 +
    1.35 +      <para id="x_2eb">When Mercurial tracks modifications to a file, it stores
    1.36 +	the history of that file in a metadata object called a
    1.37 +	<emphasis>filelog</emphasis>.  Each entry in the filelog
    1.38 +	contains enough information to reconstruct one revision of the
    1.39 +	file that is being tracked.  Filelogs are stored as files in
    1.40 +	the <filename role="special"
    1.41 +	  class="directory">.hg/store/data</filename> directory.  A
    1.42 +	filelog contains two kinds of information: revision data, and
    1.43 +	an index to help Mercurial to find a revision
    1.44 +	efficiently.</para>
    1.45 +
    1.46 +      <para id="x_2ec">A file that is large, or has a lot of history, has its
    1.47 +	filelog stored in separate data
    1.48 +	(<quote><literal>.d</literal></quote> suffix) and index
    1.49 +	(<quote><literal>.i</literal></quote> suffix) files.  For
    1.50 +	small files without much history, the revision data and index
    1.51 +	are combined in a single <quote><literal>.i</literal></quote>
    1.52 +	file.  The correspondence between a file in the working
    1.53 +	directory and the filelog that tracks its history in the
    1.54 +	repository is illustrated in <xref
    1.55 +	  linkend="fig:concepts:filelog"/>.</para>
    1.56 +
    1.57 +      <figure id="fig:concepts:filelog">
    1.58 +	<title>Relationships between files in working directory and
    1.59 +	  filelogs in repository</title>
    1.60 +	<mediaobject>
    1.61 +	  <imageobject><imagedata fileref="figs/filelog.png"/></imageobject>
    1.62 +	  <textobject><phrase>XXX add text</phrase></textobject>
    1.63 +	</mediaobject>
    1.64 +      </figure>
    1.65 +
    1.66 +    </sect2>
    1.67 +    <sect2>
    1.68 +      <title>Managing tracked files</title>
    1.69 +
    1.70 +      <para id="x_2ee">Mercurial uses a structure called a
    1.71 +	<emphasis>manifest</emphasis> to collect together information
    1.72 +	about the files that it tracks.  Each entry in the manifest
    1.73 +	contains information about the files present in a single
    1.74 +	changeset.  An entry records which files are present in the
    1.75 +	changeset, the revision of each file, and a few other pieces
    1.76 +	of file metadata.</para>
    1.77 +
    1.78 +    </sect2>
    1.79 +    <sect2>
    1.80 +      <title>Recording changeset information</title>
    1.81 +
    1.82 +      <para id="x_2ef">The <emphasis>changelog</emphasis> contains information
    1.83 +	about each changeset.  Each revision records who committed a
    1.84 +	change, the changeset comment, other pieces of
    1.85 +	changeset-related information, and the revision of the
    1.86 +	manifest to use.</para>
    1.87 +
    1.88 +    </sect2>
    1.89 +    <sect2>
    1.90 +      <title>Relationships between revisions</title>
    1.91 +
    1.92 +      <para id="x_2f0">Within a changelog, a manifest, or a filelog, each
    1.93 +	revision stores a pointer to its immediate parent (or to its
    1.94 +	two parents, if it's a merge revision).  As I mentioned above,
    1.95 +	there are also relationships between revisions
    1.96 +	<emphasis>across</emphasis> these structures, and they are
    1.97 +	hierarchical in nature.</para>
    1.98 +
    1.99 +      <para id="x_2f1">For every changeset in a repository, there is exactly one
   1.100 +	revision stored in the changelog.  Each revision of the
   1.101 +	changelog contains a pointer to a single revision of the
   1.102 +	manifest.  A revision of the manifest stores a pointer to a
   1.103 +	single revision of each filelog tracked when that changeset
   1.104 +	was created.  These relationships are illustrated in
   1.105 +	<xref linkend="fig:concepts:metadata"/>.</para>
   1.106 +
   1.107 +      <figure id="fig:concepts:metadata">
   1.108 +	<title>Metadata relationships</title>
   1.109 +	<mediaobject>
   1.110 +	  <imageobject><imagedata fileref="figs/metadata.png"/></imageobject>
   1.111 +	  <textobject><phrase>XXX add text</phrase></textobject>
   1.112 +	</mediaobject>
   1.113 +      </figure>
   1.114 +
   1.115 +      <para id="x_2f3">As the illustration shows, there is
   1.116 +	<emphasis>not</emphasis> a <quote>one to one</quote>
   1.117 +	relationship between revisions in the changelog, manifest, or
   1.118 +	filelog. If a file that
   1.119 +	Mercurial tracks hasn't changed between two changesets, the
   1.120 +	entry for that file in the two revisions of the manifest will
   1.121 +	point to the same revision of its filelog<footnote>
   1.122 +	  <para id="x_725">It is possible (though unusual) for the manifest to
   1.123 +	    remain the same between two changesets, in which case the
   1.124 +	    changelog entries for those changesets will point to the
   1.125 +	    same revision of the manifest.</para>
   1.126 +	</footnote>.</para>
   1.127 +
   1.128 +    </sect2>
   1.129 +  </sect1>
   1.130 +  <sect1>
   1.131 +    <title>Safe, efficient storage</title>
   1.132 +
   1.133 +    <para id="x_2f4">The underpinnings of changelogs, manifests, and filelogs are
   1.134 +      provided by a single structure called the
   1.135 +      <emphasis>revlog</emphasis>.</para>
   1.136 +
   1.137 +    <sect2>
   1.138 +      <title>Efficient storage</title>
   1.139 +
   1.140 +      <para id="x_2f5">The revlog provides efficient storage of revisions using a
   1.141 +	<emphasis>delta</emphasis> mechanism.  Instead of storing a
   1.142 +	complete copy of a file for each revision, it stores the
   1.143 +	changes needed to transform an older revision into the new
   1.144 +	revision.  For many kinds of file data, these deltas are
   1.145 +	typically a fraction of a percent of the size of a full copy
   1.146 +	of a file.</para>
   1.147 +
   1.148 +      <para id="x_2f6">Some obsolete revision control systems can only work with
   1.149 +	deltas of text files.  They must either store binary files as
   1.150 +	complete snapshots or encoded into a text representation, both
   1.151 +	of which are wasteful approaches.  Mercurial can efficiently
   1.152 +	handle deltas of files with arbitrary binary contents; it
   1.153 +	doesn't need to treat text as special.</para>
   1.154 +
   1.155 +    </sect2>
   1.156 +    <sect2 id="sec:concepts:txn">
   1.157 +      <title>Safe operation</title>
   1.158 +
   1.159 +      <para id="x_2f7">Mercurial only ever <emphasis>appends</emphasis> data to
   1.160 +	the end of a revlog file. It never modifies a section of a
   1.161 +	file after it has written it.  This is both more robust and
   1.162 +	efficient than schemes that need to modify or rewrite
   1.163 +	data.</para>
   1.164 +
   1.165 +      <para id="x_2f8">In addition, Mercurial treats every write as part of a
   1.166 +	<emphasis>transaction</emphasis> that can span a number of
   1.167 +	files.  A transaction is <emphasis>atomic</emphasis>: either
   1.168 +	the entire transaction succeeds and its effects are all
   1.169 +	visible to readers in one go, or the whole thing is undone.
   1.170 +	This guarantee of atomicity means that if you're running two
   1.171 +	copies of Mercurial, where one is reading data and one is
   1.172 +	writing it, the reader will never see a partially written
   1.173 +	result that might confuse it.</para>
   1.174 +
   1.175 +      <para id="x_2f9">The fact that Mercurial only appends to files makes it
   1.176 +	easier to provide this transactional guarantee.  The easier it
   1.177 +	is to do stuff like this, the more confident you should be
   1.178 +	that it's done correctly.</para>
   1.179 +
   1.180 +    </sect2>
   1.181 +    <sect2>
   1.182 +      <title>Fast retrieval</title>
   1.183 +
   1.184 +      <para id="x_2fa">Mercurial cleverly avoids a pitfall common to
   1.185 +	all earlier revision control systems: the problem of
   1.186 +	<emphasis>inefficient retrieval</emphasis>. Most revision
   1.187 +	control systems store the contents of a revision as an
   1.188 +	incremental series of modifications against a
   1.189 +	<quote>snapshot</quote>.  (Some base the snapshot on the
   1.190 +	oldest revision, others on the newest.)  To reconstruct a
   1.191 +	specific revision, you must first read the snapshot, and then
   1.192 +	every one of the revisions between the snapshot and your
   1.193 +	target revision.  The more history that a file accumulates,
   1.194 +	the more revisions you must read, hence the longer it takes to
   1.195 +	reconstruct a particular revision.</para>
   1.196 +
   1.197 +      <figure id="fig:concepts:snapshot">
   1.198 +	<title>Snapshot of a revlog, with incremental deltas</title>
   1.199 +	<mediaobject>
   1.200 +	  <imageobject><imagedata fileref="figs/snapshot.png"/></imageobject>
   1.201 +	  <textobject><phrase>XXX add text</phrase></textobject>
   1.202 +	</mediaobject>
   1.203 +      </figure>
   1.204 +
   1.205 +      <para id="x_2fc">The innovation that Mercurial applies to this problem is
   1.206 +	simple but effective.  Once the cumulative amount of delta
   1.207 +	information stored since the last snapshot exceeds a fixed
   1.208 +	threshold, it stores a new snapshot (compressed, of course),
   1.209 +	instead of another delta.  This makes it possible to
   1.210 +	reconstruct <emphasis>any</emphasis> revision of a file
   1.211 +	quickly.  This approach works so well that it has since been
   1.212 +	copied by several other revision control systems.</para>
   1.213 +
   1.214 +      <para id="x_2fd"><xref linkend="fig:concepts:snapshot"/> illustrates
   1.215 +	the idea.  In an entry in a revlog's index file, Mercurial
   1.216 +	stores the range of entries from the data file that it must
   1.217 +	read to reconstruct a particular revision.</para>
   1.218 +
   1.219 +      <sect3>
   1.220 +	<title>Aside: the influence of video compression</title>
   1.221 +
   1.222 +	<para id="x_2fe">If you're familiar with video compression or
   1.223 +	  have ever watched a TV feed through a digital cable or
   1.224 +	  satellite service, you may know that most video compression
   1.225 +	  schemes store each frame of video as a delta against its
   1.226 +	  predecessor frame.</para>
   1.227 +
   1.228 +	<para id="x_2ff">Mercurial borrows this idea to make it
   1.229 +	  possible to reconstruct a revision from a snapshot and a
   1.230 +	  small number of deltas.</para>
   1.231 +
   1.232 +      </sect3>
   1.233 +    </sect2>
   1.234 +    <sect2>
   1.235 +      <title>Identification and strong integrity</title>
   1.236 +
   1.237 +      <para id="x_300">Along with delta or snapshot information, a revlog entry
   1.238 +	contains a cryptographic hash of the data that it represents.
   1.239 +	This makes it difficult to forge the contents of a revision,
   1.240 +	and easy to detect accidental corruption.</para>
   1.241 +
   1.242 +      <para id="x_301">Hashes provide more than a mere check against corruption;
   1.243 +	they are used as the identifiers for revisions.  The changeset
   1.244 +	identification hashes that you see as an end user are from
   1.245 +	revisions of the changelog.  Although filelogs and the
   1.246 +	manifest also use hashes, Mercurial only uses these behind the
   1.247 +	scenes.</para>
   1.248 +
   1.249 +      <para id="x_302">Mercurial verifies that hashes are correct when it
   1.250 +	retrieves file revisions and when it pulls changes from
   1.251 +	another repository.  If it encounters an integrity problem, it
   1.252 +	will complain and stop whatever it's doing.</para>
   1.253 +
   1.254 +      <para id="x_303">In addition to the effect it has on retrieval efficiency,
   1.255 +	Mercurial's use of periodic snapshots makes it more robust
   1.256 +	against partial data corruption.  If a revlog becomes partly
   1.257 +	corrupted due to a hardware error or system bug, it's often
   1.258 +	possible to reconstruct some or most revisions from the
   1.259 +	uncorrupted sections of the revlog, both before and after the
   1.260 +	corrupted section.  This would not be possible with a
   1.261 +	delta-only storage model.</para>
   1.262 +    </sect2>
   1.263 +  </sect1>
   1.264 +
   1.265 +  <sect1>
   1.266 +    <title>Revision history, branching, and merging</title>
   1.267 +
   1.268 +    <para id="x_304">Every entry in a Mercurial revlog knows the identity of its
   1.269 +      immediate ancestor revision, usually referred to as its
   1.270 +      <emphasis>parent</emphasis>.  In fact, a revision contains room
   1.271 +      for not one parent, but two.  Mercurial uses a special hash,
   1.272 +      called the <quote>null ID</quote>, to represent the idea
   1.273 +      <quote>there is no parent here</quote>.  This hash is simply a
   1.274 +      string of zeroes.</para>
   1.275 +
   1.276 +    <para id="x_305">In <xref linkend="fig:concepts:revlog"/>, you can see
   1.277 +      an example of the conceptual structure of a revlog.  Filelogs,
   1.278 +      manifests, and changelogs all have this same structure; they
   1.279 +      differ only in the kind of data stored in each delta or
   1.280 +      snapshot.</para>
   1.281 +
   1.282 +    <para id="x_306">The first revision in a revlog (at the bottom of the image)
   1.283 +      has the null ID in both of its parent slots.  For a
   1.284 +      <quote>normal</quote> revision, its first parent slot contains
   1.285 +      the ID of its parent revision, and its second contains the null
   1.286 +      ID, indicating that the revision has only one real parent.  Any
   1.287 +      two revisions that have the same parent ID are branches.  A
   1.288 +      revision that represents a merge between branches has two normal
   1.289 +      revision IDs in its parent slots.</para>
   1.290 +
   1.291 +    <figure id="fig:concepts:revlog">
   1.292 +      <title>The conceptual structure of a revlog</title>
   1.293 +      <mediaobject>
   1.294 +	<imageobject><imagedata fileref="figs/revlog.png"/></imageobject>
   1.295 +	<textobject><phrase>XXX add text</phrase></textobject>
   1.296 +      </mediaobject>
   1.297 +    </figure>
   1.298 +
   1.299 +  </sect1>
   1.300 +  <sect1>
   1.301 +    <title>The working directory</title>
   1.302 +
   1.303 +    <para id="x_307">In the working directory, Mercurial stores a snapshot of the
   1.304 +      files from the repository as of a particular changeset.</para>
   1.305 +
   1.306 +    <para id="x_308">The working directory <quote>knows</quote> which changeset
   1.307 +      it contains.  When you update the working directory to contain a
   1.308 +      particular changeset, Mercurial looks up the appropriate
   1.309 +      revision of the manifest to find out which files it was tracking
   1.310 +      at the time that changeset was committed, and which revision of
   1.311 +      each file was then current.  It then recreates a copy of each of
   1.312 +      those files, with the same contents it had when the changeset
   1.313 +      was committed.</para>
   1.314 +
   1.315 +    <para id="x_309">The <emphasis>dirstate</emphasis> is a special
   1.316 +      structure that contains Mercurial's knowledge of the working
   1.317 +      directory.  It is maintained as a file named
   1.318 +      <filename>.hg/dirstate</filename> inside a repository.  The
   1.319 +      dirstate details which changeset the working directory is
   1.320 +      updated to, and all of the files that Mercurial is tracking in
   1.321 +      the working directory. It also lets Mercurial quickly notice
   1.322 +      changed files, by recording their checkout times and
   1.323 +      sizes.</para>
   1.324 +
   1.325 +    <para id="x_30a">Just as a revision of a revlog has room for two parents, so
   1.326 +      that it can represent either a normal revision (with one parent)
   1.327 +      or a merge of two earlier revisions, the dirstate has slots for
   1.328 +      two parents.  When you use the <command role="hg-cmd">hg
   1.329 +	update</command> command, the changeset that you update to is
   1.330 +      stored in the <quote>first parent</quote> slot, and the null ID
   1.331 +      in the second. When you <command role="hg-cmd">hg
   1.332 +	merge</command> with another changeset, the first parent
   1.333 +      remains unchanged, and the second parent is filled in with the
   1.334 +      changeset you're merging with.  The <command role="hg-cmd">hg
   1.335 +	parents</command> command tells you what the parents of the
   1.336 +      dirstate are.</para>
   1.337 +
   1.338 +    <sect2>
   1.339 +      <title>What happens when you commit</title>
   1.340 +
   1.341 +      <para id="x_30b">The dirstate stores parent information for more than just
   1.342 +	book-keeping purposes.  Mercurial uses the parents of the
   1.343 +	dirstate as <emphasis>the parents of a new
   1.344 +	  changeset</emphasis> when you perform a commit.</para>
   1.345 +
   1.346 +      <figure id="fig:concepts:wdir">
   1.347 +	<title>The working directory can have two parents</title>
   1.348 +	<mediaobject>
   1.349 +	  <imageobject><imagedata fileref="figs/wdir.png"/></imageobject>
   1.350 +	  <textobject><phrase>XXX add text</phrase></textobject>
   1.351 +	</mediaobject>
   1.352 +      </figure>
   1.353 +
   1.354 +      <para id="x_30d"><xref linkend="fig:concepts:wdir"/> shows the
   1.355 +	normal state of the working directory, where it has a single
   1.356 +	changeset as parent.  That changeset is the
   1.357 +	<emphasis>tip</emphasis>, the newest changeset in the
   1.358 +	repository that has no children.</para>
   1.359 +
   1.360 +      <figure id="fig:concepts:wdir-after-commit">
   1.361 +	<title>The working directory gains new parents after a
   1.362 +	  commit</title>
   1.363 +	<mediaobject>
   1.364 +	  <imageobject><imagedata fileref="figs/wdir-after-commit.png"/></imageobject>
   1.365 +	  <textobject><phrase>XXX add text</phrase></textobject>
   1.366 +	</mediaobject>
   1.367 +      </figure>
   1.368 +
   1.369 +      <para id="x_30f">It's useful to think of the working directory as
   1.370 +	<quote>the changeset I'm about to commit</quote>.  Any files
   1.371 +	that you tell Mercurial that you've added, removed, renamed,
   1.372 +	or copied will be reflected in that changeset, as will
   1.373 +	modifications to any files that Mercurial is already tracking;
   1.374 +	the new changeset will have the parents of the working
   1.375 +	directory as its parents.</para>
   1.376 +
   1.377 +      <para id="x_310">After a commit, Mercurial will update the
   1.378 +	parents of the working directory, so that the first parent is
   1.379 +	the ID of the new changeset, and the second is the null ID.
   1.380 +	This is shown in <xref
   1.381 +	  linkend="fig:concepts:wdir-after-commit"/>. Mercurial
   1.382 +	doesn't touch any of the files in the working directory when
   1.383 +	you commit; it just modifies the dirstate to note its new
   1.384 +	parents.</para>
   1.385 +
   1.386 +    </sect2>
   1.387 +    <sect2>
   1.388 +      <title>Creating a new head</title>
   1.389 +
   1.390 +      <para id="x_311">It's perfectly normal to update the working directory to a
   1.391 +	changeset other than the current tip.  For example, you might
   1.392 +	want to know what your project looked like last Tuesday, or
   1.393 +	you could be looking through changesets to see which one
   1.394 +	introduced a bug.  In cases like this, the natural thing to do
   1.395 +	is update the working directory to the changeset you're
   1.396 +	interested in, and then examine the files in the working
   1.397 +	directory directly to see their contents as they were when you
   1.398 +	committed that changeset.  The effect of this is shown in
   1.399 +	<xref linkend="fig:concepts:wdir-pre-branch"/>.</para>
   1.400 +
   1.401 +      <figure id="fig:concepts:wdir-pre-branch">
   1.402 +	<title>The working directory, updated to an older
   1.403 +	  changeset</title>
   1.404 +	<mediaobject>
   1.405 +	  <imageobject><imagedata fileref="figs/wdir-pre-branch.png"/></imageobject>
   1.406 +	  <textobject><phrase>XXX add text</phrase></textobject>
   1.407 +	</mediaobject>
   1.408 +      </figure>
   1.409 +
   1.410 +      <para id="x_313">Having updated the working directory to an
   1.411 +	older changeset, what happens if you make some changes, and
   1.412 +	then commit?  Mercurial behaves in the same way as I outlined
   1.413 +	above.  The parents of the working directory become the
   1.414 +	parents of the new changeset.  This new changeset has no
   1.415 +	children, so it becomes the new tip.  And the repository now
   1.416 +	contains two changesets that have no children; we call these
   1.417 +	<emphasis>heads</emphasis>.  You can see the structure that
   1.418 +	this creates in <xref
   1.419 +	  linkend="fig:concepts:wdir-branch"/>.</para>
   1.420 +
   1.421 +      <figure id="fig:concepts:wdir-branch">
   1.422 +	<title>After a commit made while synced to an older
   1.423 +	  changeset</title>
   1.424 +	<mediaobject>
   1.425 +	  <imageobject><imagedata fileref="figs/wdir-branch.png"/></imageobject>
   1.426 +	  <textobject><phrase>XXX add text</phrase></textobject>
   1.427 +	</mediaobject>
   1.428 +      </figure>
   1.429 +
   1.430 +      <note>
   1.431 +	<para id="x_315">If you're new to Mercurial, you should keep
   1.432 +	  in mind a common <quote>error</quote>, which is to use the
   1.433 +	  <command role="hg-cmd">hg pull</command> command without any
   1.434 +	  options.  By default, the <command role="hg-cmd">hg
   1.435 +	    pull</command> command <emphasis>does not</emphasis>
   1.436 +	  update the working directory, so you'll bring new changesets
   1.437 +	  into your repository, but the working directory will stay
   1.438 +	  synced at the same changeset as before the pull.  If you
   1.439 +	  make some changes and commit afterwards, you'll thus create
   1.440 +	  a new head, because your working directory isn't synced to
   1.441 +	  whatever the current tip is.  To combine the operation of a
   1.442 +	  pull, followed by an update, run <command>hg pull
   1.443 +	    -u</command>.</para>
   1.444 +
   1.445 +	<para id="x_316">I put the word <quote>error</quote> in quotes
   1.446 +	  because all that you need to do to rectify the situation
   1.447 +	  where you created a new head by accident is
   1.448 +	  <command role="hg-cmd">hg merge</command>, then <command
   1.449 +	    role="hg-cmd">hg commit</command>.  In other words, this
   1.450 +	  almost never has negative consequences; it's just something
   1.451 +	  of a surprise for newcomers.  I'll discuss other ways to
   1.452 +	  avoid this behavior, and why Mercurial behaves in this
   1.453 +	  initially surprising way, later on.</para>
   1.454 +      </note>
   1.455 +
   1.456 +    </sect2>
   1.457 +    <sect2>
   1.458 +      <title>Merging changes</title>
   1.459 +
   1.460 +      <para id="x_317">When you run the <command role="hg-cmd">hg
   1.461 +	  merge</command> command, Mercurial leaves the first parent
   1.462 +	of the working directory unchanged, and sets the second parent
   1.463 +	to the changeset you're merging with, as shown in <xref
   1.464 +	  linkend="fig:concepts:wdir-merge"/>.</para>
   1.465 +
   1.466 +      <figure id="fig:concepts:wdir-merge">
   1.467 +	<title>Merging two heads</title>
   1.468 +	<mediaobject>
   1.469 +	  <imageobject>
   1.470 +	    <imagedata fileref="figs/wdir-merge.png"/>
   1.471 +	  </imageobject>
   1.472 +	  <textobject><phrase>XXX add text</phrase></textobject>
   1.473 +	</mediaobject>
   1.474 +      </figure>
   1.475 +
   1.476 +      <para id="x_319">Mercurial also has to modify the working directory, to
   1.477 +	merge the files managed in the two changesets.  Simplified a
   1.478 +	little, the merging process goes like this, for every file in
   1.479 +	the manifests of both changesets.</para>
   1.480 +      <itemizedlist>
   1.481 +	<listitem><para id="x_31a">If neither changeset has modified a file, do
   1.482 +	    nothing with that file.</para>
   1.483 +	</listitem>
   1.484 +	<listitem><para id="x_31b">If one changeset has modified a file, and the
   1.485 +	    other hasn't, create the modified copy of the file in the
   1.486 +	    working directory.</para>
   1.487 +	</listitem>
   1.488 +	<listitem><para id="x_31c">If one changeset has removed a file, and the
   1.489 +	    other hasn't (or has also deleted it), delete the file
   1.490 +	    from the working directory.</para>
   1.491 +	</listitem>
   1.492 +	<listitem><para id="x_31d">If one changeset has removed a file, but the
   1.493 +	    other has modified the file, ask the user what to do: keep
   1.494 +	    the modified file, or remove it?</para>
   1.495 +	</listitem>
   1.496 +	<listitem><para id="x_31e">If both changesets have modified a file,
   1.497 +	    invoke an external merge program to choose the new
   1.498 +	    contents for the merged file.  This may require input from
   1.499 +	    the user.</para>
   1.500 +	</listitem>
   1.501 +	<listitem><para id="x_31f">If one changeset has modified a file, and the
   1.502 +	    other has renamed or copied the file, make sure that the
   1.503 +	    changes follow the new name of the file.</para>
   1.504 +	</listitem></itemizedlist>
   1.505 +      <para id="x_320">There are more details&emdash;merging has plenty of corner
   1.506 +	cases&emdash;but these are the most common choices that are
   1.507 +	involved in a merge.  As you can see, most cases are
   1.508 +	completely automatic, and indeed most merges finish
   1.509 +	automatically, without requiring your input to resolve any
   1.510 +	conflicts.</para>
   1.511 +
   1.512 +      <para id="x_321">When you're thinking about what happens when you commit
   1.513 +	after a merge, once again the working directory is <quote>the
   1.514 +	  changeset I'm about to commit</quote>.  After the <command
   1.515 +	  role="hg-cmd">hg merge</command> command completes, the
   1.516 +	working directory has two parents; these will become the
   1.517 +	parents of the new changeset.</para>
   1.518 +
   1.519 +      <para id="x_322">Mercurial lets you perform multiple merges, but
   1.520 +	you must commit the results of each individual merge as you
   1.521 +	go.  This is necessary because Mercurial only tracks two
   1.522 +	parents for both revisions and the working directory.  While
   1.523 +	it would be technically feasible to merge multiple changesets
   1.524 +	at once, Mercurial avoids this for simplicity.  With multi-way
   1.525 +	merges, the risks of user confusion, nasty conflict
   1.526 +	resolution, and making a terrible mess of a merge would grow
   1.527 +	intolerable.</para>
   1.528 +
   1.529 +    </sect2>
   1.530 +
   1.531 +    <sect2>
   1.532 +      <title>Merging and renames</title>
   1.533 +
   1.534 +      <para id="x_69a">A surprising number of revision control systems pay little
   1.535 +	or no attention to a file's <emphasis>name</emphasis> over
   1.536 +	time.  For instance, it used to be common that if a file got
   1.537 +	renamed on one side of a merge, the changes from the other
   1.538 +	side would be silently dropped.</para>
   1.539 +
   1.540 +      <para id="x_69b">Mercurial records metadata when you tell it to perform a
   1.541 +	rename or copy. It uses this metadata during a merge to do the
   1.542 +	right thing in the case of a merge.  For instance, if I rename
   1.543 +	a file, and you edit it without renaming it, when we merge our
   1.544 +	work the file will be renamed and have your edits
   1.545 +	applied.</para>
   1.546 +    </sect2>
   1.547 +  </sect1>
   1.548 +
   1.549 +  <sect1>
   1.550 +    <title>Other interesting design features</title>
   1.551 +
   1.552 +    <para id="x_323">In the sections above, I've tried to highlight some of the
   1.553 +      most important aspects of Mercurial's design, to illustrate that
   1.554 +      it pays careful attention to reliability and performance.
   1.555 +      However, the attention to detail doesn't stop there.  There are
   1.556 +      a number of other aspects of Mercurial's construction that I
   1.557 +      personally find interesting.  I'll detail a few of them here,
   1.558 +      separate from the <quote>big ticket</quote> items above, so that
   1.559 +      if you're interested, you can gain a better idea of the amount
   1.560 +      of thinking that goes into a well-designed system.</para>
   1.561 +
   1.562 +    <sect2>
   1.563 +      <title>Clever compression</title>
   1.564 +
   1.565 +      <para id="x_324">When appropriate, Mercurial will store both snapshots and
   1.566 +	deltas in compressed form.  It does this by always
   1.567 +	<emphasis>trying to</emphasis> compress a snapshot or delta,
   1.568 +	but only storing the compressed version if it's smaller than
   1.569 +	the uncompressed version.</para>
   1.570 +
   1.571 +      <para id="x_325">This means that Mercurial does <quote>the right
   1.572 +	  thing</quote> when storing a file whose native form is
   1.573 +	compressed, such as a <literal>zip</literal> archive or a JPEG
   1.574 +	image.  When these types of files are compressed a second
   1.575 +	time, the resulting file is usually bigger than the
   1.576 +	once-compressed form, and so Mercurial will store the plain
   1.577 +	<literal>zip</literal> or JPEG.</para>
   1.578 +
   1.579 +      <para id="x_326">Deltas between revisions of a compressed file are usually
   1.580 +	larger than snapshots of the file, and Mercurial again does
   1.581 +	<quote>the right thing</quote> in these cases.  It finds that
   1.582 +	such a delta exceeds the threshold at which it should store a
   1.583 +	complete snapshot of the file, so it stores the snapshot,
   1.584 +	again saving space compared to a naive delta-only
   1.585 +	approach.</para>
   1.586 +
   1.587 +      <sect3>
   1.588 +	<title>Network recompression</title>
   1.589 +
   1.590 +	<para id="x_327">When storing revisions on disk, Mercurial uses the
   1.591 +	  <quote>deflate</quote> compression algorithm (the same one
   1.592 +	  used by the popular <literal>zip</literal> archive format),
   1.593 +	  which balances good speed with a respectable compression
   1.594 +	  ratio.  However, when transmitting revision data over a
   1.595 +	  network connection, Mercurial uncompresses the compressed
   1.596 +	  revision data.</para>
   1.597 +
   1.598 +	<para id="x_328">If the connection is over HTTP, Mercurial recompresses
   1.599 +	  the entire stream of data using a compression algorithm that
   1.600 +	  gives a better compression ratio (the Burrows-Wheeler
   1.601 +	  algorithm from the widely used <literal>bzip2</literal>
   1.602 +	  compression package).  This combination of algorithm and
   1.603 +	  compression of the entire stream (instead of a revision at a
   1.604 +	  time) substantially reduces the number of bytes to be
   1.605 +	  transferred, yielding better network performance over most
   1.606 +	  kinds of network.</para>
   1.607 +
   1.608 +	<para id="x_329">If the connection is over
   1.609 +	  <command>ssh</command>, Mercurial
   1.610 +	  <emphasis>doesn't</emphasis> recompress the stream, because
   1.611 +	  <command>ssh</command> can already do this itself.  You can
   1.612 +	  tell Mercurial to always use <command>ssh</command>'s
   1.613 +	  compression feature by editing the
   1.614 +	  <filename>.hgrc</filename> file in your home directory as
   1.615 +	  follows.</para>
   1.616 +
   1.617 +	<programlisting>[ui]
   1.618 +ssh = ssh -C</programlisting>
   1.619 +
   1.620 +      </sect3>
   1.621 +    </sect2>
   1.622 +    <sect2>
   1.623 +      <title>Read/write ordering and atomicity</title>
   1.624 +
   1.625 +      <para id="x_32a">Appending to files isn't the whole story when
   1.626 +	it comes to guaranteeing that a reader won't see a partial
   1.627 +	write.  If you recall <xref linkend="fig:concepts:metadata"/>,
   1.628 +	revisions in the changelog point to revisions in the manifest,
   1.629 +	and revisions in the manifest point to revisions in filelogs.
   1.630 +	This hierarchy is deliberate.</para>
   1.631 +
   1.632 +      <para id="x_32b">A writer starts a transaction by writing filelog and
   1.633 +	manifest data, and doesn't write any changelog data until
   1.634 +	those are finished.  A reader starts by reading changelog
   1.635 +	data, then manifest data, followed by filelog data.</para>
   1.636 +
   1.637 +      <para id="x_32c">Since the writer has always finished writing filelog and
   1.638 +	manifest data before it writes to the changelog, a reader will
   1.639 +	never read a pointer to a partially written manifest revision
   1.640 +	from the changelog, and it will never read a pointer to a
   1.641 +	partially written filelog revision from the manifest.</para>
   1.642 +
   1.643 +    </sect2>
   1.644 +    <sect2>
   1.645 +      <title>Concurrent access</title>
   1.646 +
   1.647 +      <para id="x_32d">The read/write ordering and atomicity guarantees mean that
   1.648 +	Mercurial never needs to <emphasis>lock</emphasis> a
   1.649 +	repository when it's reading data, even if the repository is
   1.650 +	being written to while the read is occurring. This has a big
   1.651 +	effect on scalability; you can have an arbitrary number of
   1.652 +	Mercurial processes safely reading data from a repository
   1.653 +	all at once, no matter whether it's being written to or
   1.654 +	not.</para>
   1.655 +
   1.656 +      <para id="x_32e">The lockless nature of reading means that if you're
   1.657 +	sharing a repository on a multi-user system, you don't need to
   1.658 +	grant other local users permission to
   1.659 +	<emphasis>write</emphasis> to your repository in order for
   1.660 +	them to be able to clone it or pull changes from it; they only
   1.661 +	need <emphasis>read</emphasis> permission.  (This is
   1.662 +	<emphasis>not</emphasis> a common feature among revision
   1.663 +	control systems, so don't take it for granted!  Most require
   1.664 +	readers to be able to lock a repository to access it safely,
   1.665 +	and this requires write permission on at least one directory,
   1.666 +	which of course makes for all kinds of nasty and annoying
   1.667 +	security and administrative problems.)</para>
   1.668 +
   1.669 +      <para id="x_32f">Mercurial uses locks to ensure that only one process can
   1.670 +	write to a repository at a time (the locking mechanism is safe
   1.671 +	even over filesystems that are notoriously hostile to locking,
   1.672 +	such as NFS).  If a repository is locked, a writer will wait
   1.673 +	for a while to retry if the repository becomes unlocked, but
   1.674 +	if the repository remains locked for too long, the process
   1.675 +	attempting to write will time out after a while. This means
   1.676 +	that your daily automated scripts won't get stuck forever and
   1.677 +	pile up if a system crashes unnoticed, for example.  (Yes, the
   1.678 +	timeout is configurable, from zero to infinity.)</para>
   1.679 +
   1.680 +      <sect3>
   1.681 +	<title>Safe dirstate access</title>
   1.682 +
   1.683 +	<para id="x_330">As with revision data, Mercurial doesn't take a lock to
   1.684 +	  read the dirstate file; it does acquire a lock to write it.
   1.685 +	  To avoid the possibility of reading a partially written copy
   1.686 +	  of the dirstate file, Mercurial writes to a file with a
   1.687 +	  unique name in the same directory as the dirstate file, then
   1.688 +	  renames the temporary file atomically to
   1.689 +	  <filename>dirstate</filename>.  The file named
   1.690 +	  <filename>dirstate</filename> is thus guaranteed to be
   1.691 +	  complete, not partially written.</para>
   1.692 +
   1.693 +      </sect3>
   1.694 +    </sect2>
   1.695 +    <sect2>
   1.696 +      <title>Avoiding seeks</title>
   1.697 +
   1.698 +      <para id="x_331">Critical to Mercurial's performance is the avoidance of
   1.699 +	seeks of the disk head, since any seek is far more expensive
   1.700 +	than even a comparatively large read operation.</para>
   1.701 +
   1.702 +      <para id="x_332">This is why, for example, the dirstate is stored in a
   1.703 +	single file.  If there were a dirstate file per directory that
   1.704 +	Mercurial tracked, the disk would seek once per directory.
   1.705 +	Instead, Mercurial reads the entire single dirstate file in
   1.706 +	one step.</para>
   1.707 +
   1.708 +      <para id="x_333">Mercurial also uses a <quote>copy on write</quote> scheme
   1.709 +	when cloning a repository on local storage.  Instead of
   1.710 +	copying every revlog file from the old repository into the new
   1.711 +	repository, it makes a <quote>hard link</quote>, which is a
   1.712 +	shorthand way to say <quote>these two names point to the same
   1.713 +	  file</quote>.  When Mercurial is about to write to one of a
   1.714 +	revlog's files, it checks to see if the number of names
   1.715 +	pointing at the file is greater than one.  If it is, more than
   1.716 +	one repository is using the file, so Mercurial makes a new
   1.717 +	copy of the file that is private to this repository.</para>
   1.718 +
   1.719 +      <para id="x_334">A few revision control developers have pointed out that
   1.720 +	this idea of making a complete private copy of a file is not
   1.721 +	very efficient in its use of storage.  While this is true,
   1.722 +	storage is cheap, and this method gives the highest
   1.723 +	performance while deferring most book-keeping to the operating
   1.724 +	system.  An alternative scheme would most likely reduce
   1.725 +	performance and increase the complexity of the software, but
   1.726 +	speed and simplicity are key to the <quote>feel</quote> of
   1.727 +	day-to-day use.</para>
   1.728 +
   1.729 +    </sect2>
   1.730 +    <sect2>
   1.731 +      <title>Other contents of the dirstate</title>
   1.732 +
   1.733 +      <para id="x_335">Because Mercurial doesn't force you to tell it when you're
   1.734 +	modifying a file, it uses the dirstate to store some extra
   1.735 +	information so it can determine efficiently whether you have
   1.736 +	modified a file.  For each file in the working directory, it
   1.737 +	stores the time that it last modified the file itself, and the
   1.738 +	size of the file at that time.</para>
   1.739 +
   1.740 +      <para id="x_336">When you explicitly <command role="hg-cmd">hg
   1.741 +	  add</command>, <command role="hg-cmd">hg remove</command>,
   1.742 +	<command role="hg-cmd">hg rename</command> or <command
   1.743 +	  role="hg-cmd">hg copy</command> files, Mercurial updates the
   1.744 +	dirstate so that it knows what to do with those files when you
   1.745 +	commit.</para>
   1.746 +
   1.747 +      <para id="x_337">The dirstate helps Mercurial to efficiently
   1.748 +	  check the status of files in a repository.</para>
   1.749 +
   1.750 +      <itemizedlist>
   1.751 +	<listitem>
   1.752 +	  <para id="x_726">When Mercurial checks the state of a file in the
   1.753 +	    working directory, it first checks a file's modification
   1.754 +	    time against the time in the dirstate that records when
   1.755 +	    Mercurial last wrote the file. If the last modified time
   1.756 +	    is the same as the time when Mercurial wrote the file, the
   1.757 +	    file must not have been modified, so Mercurial does not
   1.758 +	    need to check any further.</para>
   1.759 +	</listitem>
   1.760 +	<listitem>
   1.761 +	  <para id="x_727">If the file's size has changed, the file must have
   1.762 +	    been modified.  If the modification time has changed, but
   1.763 +	    the size has not, only then does Mercurial need to
   1.764 +	    actually read the contents of the file to see if it has
   1.765 +	    changed.</para>
   1.766 +	</listitem>
   1.767 +      </itemizedlist>
   1.768 +
   1.769 +      <para id="x_728">Storing the modification time and size dramatically
   1.770 +	reduces the number of read operations that Mercurial needs to
   1.771 +	perform when we run commands like <command>hg status</command>.
   1.772 +	This results in large performance improvements.</para>
   1.773 +    </sect2>
   1.774 +  </sect1>
   1.775 +</chapter>
   1.776 +
   1.777 +<!--
   1.778 +local variables: 
   1.779 +sgml-parent-document: ("00book.xml" "book" "chapter")
   1.780 +end:
   1.781 +-->