hgbook

annotate en/ch03-concepts.xml @ 650:7e7c47481e4f

Oops, this is the real merge for my hg's oddity
author Dongsheng Song <dongsheng.song@gmail.com>
date Fri Mar 20 16:43:35 2009 +0800 (2009-03-20)
parents en/ch04-concepts.xml@a13813534ccd
children 1c13ed2130a7
rev   line source
bos@559 1 <!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : -->
bos@559 2
dongsheng@625 3 <chapter id="chap.concepts">
bos@572 4 <?dbhtml filename="behind-the-scenes.html"?>
bos@559 5 <title>Behind the scenes</title>
bos@559 6
bos@559 7 <para>Unlike many revision control systems, the concepts upon which
bos@559 8 Mercurial is built are simple enough that it's easy to understand
bos@559 9 how the software really works. Knowing this certainly isn't
bos@559 10 necessary, but I find it useful to have a <quote>mental
bos@559 11 model</quote> of what's going on.</para>
bos@559 12
bos@559 13 <para>This understanding gives me confidence that Mercurial has been
bos@559 14 carefully designed to be both <emphasis>safe</emphasis> and
bos@559 15 <emphasis>efficient</emphasis>. And just as importantly, if it's
bos@559 16 easy for me to retain a good idea of what the software is doing
bos@559 17 when I perform a revision control task, I'm less likely to be
bos@559 18 surprised by its behaviour.</para>
bos@559 19
bos@559 20 <para>In this chapter, we'll initially cover the core concepts
bos@559 21 behind Mercurial's design, then continue to discuss some of the
bos@559 22 interesting details of its implementation.</para>
bos@559 23
bos@559 24 <sect1>
bos@559 25 <title>Mercurial's historical record</title>
bos@559 26
bos@559 27 <sect2>
bos@559 28 <title>Tracking the history of a single file</title>
bos@559 29
bos@559 30 <para>When Mercurial tracks modifications to a file, it stores
bos@559 31 the history of that file in a metadata object called a
bos@559 32 <emphasis>filelog</emphasis>. Each entry in the filelog
bos@559 33 contains enough information to reconstruct one revision of the
bos@559 34 file that is being tracked. Filelogs are stored as files in
bos@559 35 the <filename role="special"
bos@559 36 class="directory">.hg/store/data</filename> directory. A
bos@559 37 filelog contains two kinds of information: revision data, and
bos@559 38 an index to help Mercurial to find a revision
bos@559 39 efficiently.</para>
bos@559 40
bos@559 41 <para>A file that is large, or has a lot of history, has its
bos@559 42 filelog stored in separate data
bos@559 43 (<quote><literal>.d</literal></quote> suffix) and index
bos@559 44 (<quote><literal>.i</literal></quote> suffix) files. For
bos@559 45 small files without much history, the revision data and index
bos@559 46 are combined in a single <quote><literal>.i</literal></quote>
bos@559 47 file. The correspondence between a file in the working
bos@559 48 directory and the filelog that tracks its history in the
bos@559 49 repository is illustrated in figure <xref
dongsheng@640 50 endterm="fig.concepts.filelog.caption"
dongsheng@625 51 linkend="fig.concepts.filelog"/>.</para>
dongsheng@625 52
dongsheng@625 53 <informalfigure id="fig.concepts.filelog">
dongsheng@640 54 <mediaobject>
dongsheng@640 55 <imageobject><imagedata fileref="images/filelog.png"/></imageobject>
dongsheng@640 56 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 57 <caption><para id="fig.concepts.filelog.caption">Relationships between
dongsheng@640 58 files in working directory and filelogs in repository</para>
dongsheng@640 59 </caption>
dongsheng@640 60 </mediaobject>
bos@559 61 </informalfigure>
bos@559 62
bos@559 63 </sect2>
bos@559 64 <sect2>
bos@559 65 <title>Managing tracked files</title>
bos@559 66
bos@559 67 <para>Mercurial uses a structure called a
bos@559 68 <emphasis>manifest</emphasis> to collect together information
bos@559 69 about the files that it tracks. Each entry in the manifest
bos@559 70 contains information about the files present in a single
bos@559 71 changeset. An entry records which files are present in the
bos@559 72 changeset, the revision of each file, and a few other pieces
bos@559 73 of file metadata.</para>
bos@559 74
bos@559 75 </sect2>
bos@559 76 <sect2>
bos@559 77 <title>Recording changeset information</title>
bos@559 78
bos@559 79 <para>The <emphasis>changelog</emphasis> contains information
bos@559 80 about each changeset. Each revision records who committed a
bos@559 81 change, the changeset comment, other pieces of
bos@559 82 changeset-related information, and the revision of the
bos@559 83 manifest to use.</para>
bos@559 84
bos@559 85 </sect2>
bos@559 86 <sect2>
bos@559 87 <title>Relationships between revisions</title>
bos@559 88
bos@559 89 <para>Within a changelog, a manifest, or a filelog, each
bos@559 90 revision stores a pointer to its immediate parent (or to its
bos@559 91 two parents, if it's a merge revision). As I mentioned above,
bos@559 92 there are also relationships between revisions
bos@559 93 <emphasis>across</emphasis> these structures, and they are
bos@559 94 hierarchical in nature.</para>
bos@559 95
bos@559 96 <para>For every changeset in a repository, there is exactly one
bos@559 97 revision stored in the changelog. Each revision of the
bos@559 98 changelog contains a pointer to a single revision of the
bos@559 99 manifest. A revision of the manifest stores a pointer to a
bos@559 100 single revision of each filelog tracked when that changeset
bos@559 101 was created. These relationships are illustrated in figure
dongsheng@640 102 <xref endterm="fig.concepts.metadata.caption"
dongsheng@640 103 linkend="fig.concepts.metadata"/>.</para>
dongsheng@625 104
dongsheng@625 105 <informalfigure id="fig.concepts.metadata">
dongsheng@640 106 <mediaobject>
dongsheng@640 107 <imageobject><imagedata fileref="images/metadata.png"/></imageobject>
dongsheng@640 108 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 109 <caption><para id="fig.concepts.metadata.caption">Metadata
dongsheng@640 110 relationships</para></caption>
dongsheng@640 111 </mediaobject>
bos@559 112 </informalfigure>
bos@559 113
bos@559 114 <para>As the illustration shows, there is
bos@559 115 <emphasis>not</emphasis> a <quote>one to one</quote>
bos@559 116 relationship between revisions in the changelog, manifest, or
bos@559 117 filelog. If the manifest hasn't changed between two
bos@559 118 changesets, the changelog entries for those changesets will
bos@559 119 point to the same revision of the manifest. If a file that
bos@559 120 Mercurial tracks hasn't changed between two changesets, the
bos@559 121 entry for that file in the two revisions of the manifest will
bos@559 122 point to the same revision of its filelog.</para>
bos@559 123
bos@559 124 </sect2>
bos@559 125 </sect1>
bos@559 126 <sect1>
bos@559 127 <title>Safe, efficient storage</title>
bos@559 128
bos@559 129 <para>The underpinnings of changelogs, manifests, and filelogs are
bos@559 130 provided by a single structure called the
bos@559 131 <emphasis>revlog</emphasis>.</para>
bos@559 132
bos@559 133 <sect2>
bos@559 134 <title>Efficient storage</title>
bos@559 135
bos@559 136 <para>The revlog provides efficient storage of revisions using a
bos@559 137 <emphasis>delta</emphasis> mechanism. Instead of storing a
bos@559 138 complete copy of a file for each revision, it stores the
bos@559 139 changes needed to transform an older revision into the new
bos@559 140 revision. For many kinds of file data, these deltas are
bos@559 141 typically a fraction of a percent of the size of a full copy
bos@559 142 of a file.</para>
bos@559 143
bos@559 144 <para>Some obsolete revision control systems can only work with
bos@559 145 deltas of text files. They must either store binary files as
bos@559 146 complete snapshots or encoded into a text representation, both
bos@559 147 of which are wasteful approaches. Mercurial can efficiently
bos@559 148 handle deltas of files with arbitrary binary contents; it
bos@559 149 doesn't need to treat text as special.</para>
bos@559 150
bos@559 151 </sect2>
dongsheng@625 152 <sect2 id="sec.concepts.txn">
bos@559 153 <title>Safe operation</title>
bos@559 154
bos@559 155 <para>Mercurial only ever <emphasis>appends</emphasis> data to
bos@559 156 the end of a revlog file. It never modifies a section of a
bos@559 157 file after it has written it. This is both more robust and
bos@559 158 efficient than schemes that need to modify or rewrite
bos@559 159 data.</para>
bos@559 160
bos@559 161 <para>In addition, Mercurial treats every write as part of a
bos@559 162 <emphasis>transaction</emphasis> that can span a number of
bos@559 163 files. A transaction is <emphasis>atomic</emphasis>: either
bos@559 164 the entire transaction succeeds and its effects are all
bos@559 165 visible to readers in one go, or the whole thing is undone.
bos@559 166 This guarantee of atomicity means that if you're running two
bos@559 167 copies of Mercurial, where one is reading data and one is
bos@559 168 writing it, the reader will never see a partially written
bos@559 169 result that might confuse it.</para>
bos@559 170
bos@559 171 <para>The fact that Mercurial only appends to files makes it
bos@559 172 easier to provide this transactional guarantee. The easier it
bos@559 173 is to do stuff like this, the more confident you should be
bos@559 174 that it's done correctly.</para>
bos@559 175
bos@559 176 </sect2>
bos@559 177 <sect2>
bos@559 178 <title>Fast retrieval</title>
bos@559 179
bos@559 180 <para>Mercurial cleverly avoids a pitfall common to all earlier
bos@559 181 revision control systems: the problem of <emphasis>inefficient
bos@559 182 retrieval</emphasis>. Most revision control systems store
bos@559 183 the contents of a revision as an incremental series of
bos@559 184 modifications against a <quote>snapshot</quote>. To
bos@559 185 reconstruct a specific revision, you must first read the
bos@559 186 snapshot, and then every one of the revisions between the
bos@559 187 snapshot and your target revision. The more history that a
bos@559 188 file accumulates, the more revisions you must read, hence the
bos@559 189 longer it takes to reconstruct a particular revision.</para>
bos@559 190
dongsheng@625 191 <informalfigure id="fig.concepts.snapshot">
dongsheng@640 192 <mediaobject>
dongsheng@640 193 <imageobject><imagedata fileref="images/snapshot.png"/></imageobject>
dongsheng@640 194 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 195 <caption><para id="fig.concepts.snapshot.caption">Snapshot of
dongsheng@640 196 a revlog, with incremental deltas</para></caption>
dongsheng@640 197 </mediaobject>
bos@559 198 </informalfigure>
bos@559 199
bos@559 200 <para>The innovation that Mercurial applies to this problem is
bos@559 201 simple but effective. Once the cumulative amount of delta
bos@559 202 information stored since the last snapshot exceeds a fixed
bos@559 203 threshold, it stores a new snapshot (compressed, of course),
bos@559 204 instead of another delta. This makes it possible to
bos@559 205 reconstruct <emphasis>any</emphasis> revision of a file
bos@559 206 quickly. This approach works so well that it has since been
bos@559 207 copied by several other revision control systems.</para>
bos@559 208
dongsheng@640 209 <para>Figure <xref endterm="fig.concepts.snapshot.caption"
dongsheng@640 210 linkend="fig.concepts.snapshot"/> illustrates
bos@559 211 the idea. In an entry in a revlog's index file, Mercurial
bos@559 212 stores the range of entries from the data file that it must
bos@559 213 read to reconstruct a particular revision.</para>
bos@559 214
bos@559 215 <sect3>
bos@559 216 <title>Aside: the influence of video compression</title>
bos@559 217
bos@559 218 <para>If you're familiar with video compression or have ever
bos@559 219 watched a TV feed through a digital cable or satellite
bos@559 220 service, you may know that most video compression schemes
bos@559 221 store each frame of video as a delta against its predecessor
bos@559 222 frame. In addition, these schemes use <quote>lossy</quote>
bos@559 223 compression techniques to increase the compression ratio, so
bos@559 224 visual errors accumulate over the course of a number of
bos@559 225 inter-frame deltas.</para>
bos@559 226
bos@559 227 <para>Because it's possible for a video stream to <quote>drop
bos@559 228 out</quote> occasionally due to signal glitches, and to
bos@559 229 limit the accumulation of artefacts introduced by the lossy
bos@559 230 compression process, video encoders periodically insert a
bos@559 231 complete frame (called a <quote>key frame</quote>) into the
bos@559 232 video stream; the next delta is generated against that
bos@559 233 frame. This means that if the video signal gets
bos@559 234 interrupted, it will resume once the next key frame is
bos@559 235 received. Also, the accumulation of encoding errors
bos@559 236 restarts anew with each key frame.</para>
bos@559 237
bos@559 238 </sect3>
bos@559 239 </sect2>
bos@559 240 <sect2>
bos@559 241 <title>Identification and strong integrity</title>
bos@559 242
bos@559 243 <para>Along with delta or snapshot information, a revlog entry
bos@559 244 contains a cryptographic hash of the data that it represents.
bos@559 245 This makes it difficult to forge the contents of a revision,
bos@559 246 and easy to detect accidental corruption.</para>
bos@559 247
bos@559 248 <para>Hashes provide more than a mere check against corruption;
bos@559 249 they are used as the identifiers for revisions. The changeset
bos@559 250 identification hashes that you see as an end user are from
bos@559 251 revisions of the changelog. Although filelogs and the
bos@559 252 manifest also use hashes, Mercurial only uses these behind the
bos@559 253 scenes.</para>
bos@559 254
bos@559 255 <para>Mercurial verifies that hashes are correct when it
bos@559 256 retrieves file revisions and when it pulls changes from
bos@559 257 another repository. If it encounters an integrity problem, it
bos@559 258 will complain and stop whatever it's doing.</para>
bos@559 259
bos@559 260 <para>In addition to the effect it has on retrieval efficiency,
bos@559 261 Mercurial's use of periodic snapshots makes it more robust
bos@559 262 against partial data corruption. If a revlog becomes partly
bos@559 263 corrupted due to a hardware error or system bug, it's often
bos@559 264 possible to reconstruct some or most revisions from the
bos@559 265 uncorrupted sections of the revlog, both before and after the
bos@559 266 corrupted section. This would not be possible with a
bos@559 267 delta-only storage model.</para>
bos@559 268
bos@559 269 </sect2>
bos@559 270 </sect1>
bos@559 271 <sect1>
bos@559 272 <title>Revision history, branching, and merging</title>
bos@559 273
bos@559 274 <para>Every entry in a Mercurial revlog knows the identity of its
bos@559 275 immediate ancestor revision, usually referred to as its
bos@559 276 <emphasis>parent</emphasis>. In fact, a revision contains room
bos@559 277 for not one parent, but two. Mercurial uses a special hash,
bos@559 278 called the <quote>null ID</quote>, to represent the idea
bos@559 279 <quote>there is no parent here</quote>. This hash is simply a
bos@559 280 string of zeroes.</para>
bos@559 281
dongsheng@640 282 <para>In figure <xref endterm="fig.concepts.revlog.caption"
dongsheng@640 283 linkend="fig.concepts.revlog"/>, you can see
bos@559 284 an example of the conceptual structure of a revlog. Filelogs,
bos@559 285 manifests, and changelogs all have this same structure; they
bos@559 286 differ only in the kind of data stored in each delta or
bos@559 287 snapshot.</para>
bos@559 288
bos@559 289 <para>The first revision in a revlog (at the bottom of the image)
bos@559 290 has the null ID in both of its parent slots. For a
bos@559 291 <quote>normal</quote> revision, its first parent slot contains
bos@559 292 the ID of its parent revision, and its second contains the null
bos@559 293 ID, indicating that the revision has only one real parent. Any
bos@559 294 two revisions that have the same parent ID are branches. A
bos@559 295 revision that represents a merge between branches has two normal
bos@559 296 revision IDs in its parent slots.</para>
bos@559 297
dongsheng@625 298 <informalfigure id="fig.concepts.revlog">
dongsheng@640 299 <mediaobject>
dongsheng@640 300 <imageobject><imagedata fileref="images/revlog.png"/></imageobject>
dongsheng@640 301 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 302 <caption><para id="fig.concepts.revlog.caption">Revision in revlog</para>
dongsheng@640 303 </caption>
dongsheng@640 304 </mediaobject>
bos@559 305 </informalfigure>
bos@559 306
bos@559 307 </sect1>
bos@559 308 <sect1>
bos@559 309 <title>The working directory</title>
bos@559 310
bos@559 311 <para>In the working directory, Mercurial stores a snapshot of the
bos@559 312 files from the repository as of a particular changeset.</para>
bos@559 313
bos@559 314 <para>The working directory <quote>knows</quote> which changeset
bos@559 315 it contains. When you update the working directory to contain a
bos@559 316 particular changeset, Mercurial looks up the appropriate
bos@559 317 revision of the manifest to find out which files it was tracking
bos@559 318 at the time that changeset was committed, and which revision of
bos@559 319 each file was then current. It then recreates a copy of each of
bos@559 320 those files, with the same contents it had when the changeset
bos@559 321 was committed.</para>
bos@559 322
bos@559 323 <para>The <emphasis>dirstate</emphasis> contains Mercurial's
bos@559 324 knowledge of the working directory. This details which
bos@559 325 changeset the working directory is updated to, and all of the
bos@559 326 files that Mercurial is tracking in the working
bos@559 327 directory.</para>
bos@559 328
bos@559 329 <para>Just as a revision of a revlog has room for two parents, so
bos@559 330 that it can represent either a normal revision (with one parent)
bos@559 331 or a merge of two earlier revisions, the dirstate has slots for
bos@559 332 two parents. When you use the <command role="hg-cmd">hg
bos@559 333 update</command> command, the changeset that you update to is
bos@559 334 stored in the <quote>first parent</quote> slot, and the null ID
bos@559 335 in the second. When you <command role="hg-cmd">hg
bos@559 336 merge</command> with another changeset, the first parent
bos@559 337 remains unchanged, and the second parent is filled in with the
bos@559 338 changeset you're merging with. The <command role="hg-cmd">hg
bos@559 339 parents</command> command tells you what the parents of the
bos@559 340 dirstate are.</para>
bos@559 341
bos@559 342 <sect2>
bos@559 343 <title>What happens when you commit</title>
bos@559 344
bos@559 345 <para>The dirstate stores parent information for more than just
bos@559 346 book-keeping purposes. Mercurial uses the parents of the
bos@559 347 dirstate as <emphasis>the parents of a new
bos@559 348 changeset</emphasis> when you perform a commit.</para>
bos@559 349
dongsheng@625 350 <informalfigure id="fig.concepts.wdir">
dongsheng@640 351 <mediaobject>
dongsheng@640 352 <imageobject><imagedata fileref="images/wdir.png"/></imageobject>
dongsheng@640 353 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 354 <caption><para id="fig.concepts.wdir.caption">The working
dongsheng@640 355 directory can have two parents</para></caption>
dongsheng@640 356 </mediaobject>
bos@559 357 </informalfigure>
bos@559 358
dongsheng@640 359 <para>Figure <xref endterm="fig.concepts.wdir.caption"
dongsheng@640 360 linkend="fig.concepts.wdir"/> shows the
bos@559 361 normal state of the working directory, where it has a single
bos@559 362 changeset as parent. That changeset is the
bos@559 363 <emphasis>tip</emphasis>, the newest changeset in the
bos@559 364 repository that has no children.</para>
bos@559 365
dongsheng@625 366 <informalfigure id="fig.concepts.wdir-after-commit">
dongsheng@640 367 <mediaobject>
dongsheng@640 368 <imageobject><imagedata fileref="images/wdir-after-commit.png"/>
dongsheng@640 369 </imageobject>
dongsheng@640 370 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 371 <caption><para id="fig.concepts.wdir-after-commit.caption">The working
dongsheng@640 372 directory gains new parents after a commit</para></caption>
dongsheng@640 373 </mediaobject>
bos@559 374 </informalfigure>
bos@559 375
bos@559 376 <para>It's useful to think of the working directory as
bos@559 377 <quote>the changeset I'm about to commit</quote>. Any files
bos@559 378 that you tell Mercurial that you've added, removed, renamed,
bos@559 379 or copied will be reflected in that changeset, as will
bos@559 380 modifications to any files that Mercurial is already tracking;
bos@559 381 the new changeset will have the parents of the working
bos@559 382 directory as its parents.</para>
bos@559 383
bos@559 384 <para>After a commit, Mercurial will update the parents of the
bos@559 385 working directory, so that the first parent is the ID of the
bos@559 386 new changeset, and the second is the null ID. This is shown
dongsheng@640 387 in figure <xref endterm="fig.concepts.wdir-after-commit.caption"
dongsheng@640 388 linkend="fig.concepts.wdir-after-commit"/>.
bos@559 389 Mercurial
bos@559 390 doesn't touch any of the files in the working directory when
bos@559 391 you commit; it just modifies the dirstate to note its new
bos@559 392 parents.</para>
bos@559 393
bos@559 394 </sect2>
bos@559 395 <sect2>
bos@559 396 <title>Creating a new head</title>
bos@559 397
bos@559 398 <para>It's perfectly normal to update the working directory to a
bos@559 399 changeset other than the current tip. For example, you might
bos@559 400 want to know what your project looked like last Tuesday, or
bos@559 401 you could be looking through changesets to see which one
bos@559 402 introduced a bug. In cases like this, the natural thing to do
bos@559 403 is update the working directory to the changeset you're
bos@559 404 interested in, and then examine the files in the working
bos@559 405 directory directly to see their contents as they were when you
bos@559 406 committed that changeset. The effect of this is shown in
dongsheng@640 407 figure <xref endterm="fig.concepts.wdir-pre-branch.caption"
dongsheng@640 408 linkend="fig.concepts.wdir-pre-branch"/>.</para>
dongsheng@625 409
dongsheng@625 410 <informalfigure id="fig.concepts.wdir-pre-branch">
dongsheng@640 411 <mediaobject>
dongsheng@640 412 <imageobject><imagedata fileref="images/wdir-pre-branch.png"/>
dongsheng@640 413 </imageobject>
dongsheng@640 414 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 415 <caption><para id="fig.concepts.wdir-pre-branch.caption">The working
dongsheng@640 416 directory, updated to an older changeset</para></caption>
dongsheng@640 417 </mediaobject>
bos@559 418 </informalfigure>
bos@559 419
bos@559 420 <para>Having updated the working directory to an older
bos@559 421 changeset, what happens if you make some changes, and then
bos@559 422 commit? Mercurial behaves in the same way as I outlined
bos@559 423 above. The parents of the working directory become the
bos@559 424 parents of the new changeset. This new changeset has no
bos@559 425 children, so it becomes the new tip. And the repository now
bos@559 426 contains two changesets that have no children; we call these
bos@559 427 <emphasis>heads</emphasis>. You can see the structure that
bos@559 428 this creates in figure <xref
dongsheng@640 429 endterm="fig.concepts.wdir-branch.caption"
dongsheng@625 430 linkend="fig.concepts.wdir-branch"/>.</para>
dongsheng@625 431
dongsheng@625 432 <informalfigure id="fig.concepts.wdir-branch">
dongsheng@640 433 <mediaobject>
dongsheng@640 434 <imageobject><imagedata fileref="images/wdir-branch.png"/>
dongsheng@640 435 </imageobject>
dongsheng@640 436 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 437 <caption><para id="fig.concepts.wdir-branch.caption">After a
dongsheng@640 438 commit made while synced to an older changeset</para></caption>
dongsheng@640 439 </mediaobject>
bos@559 440 </informalfigure>
bos@559 441
bos@559 442 <note>
bos@559 443 <para> If you're new to Mercurial, you should keep in mind a
bos@559 444 common <quote>error</quote>, which is to use the <command
bos@559 445 role="hg-cmd">hg pull</command> command without any
bos@559 446 options. By default, the <command role="hg-cmd">hg
bos@559 447 pull</command> command <emphasis>does not</emphasis>
bos@559 448 update the working directory, so you'll bring new changesets
bos@559 449 into your repository, but the working directory will stay
bos@559 450 synced at the same changeset as before the pull. If you
bos@559 451 make some changes and commit afterwards, you'll thus create
bos@559 452 a new head, because your working directory isn't synced to
bos@559 453 whatever the current tip is.</para>
bos@559 454
bos@559 455 <para> I put the word <quote>error</quote> in quotes because
bos@559 456 all that you need to do to rectify this situation is
bos@559 457 <command role="hg-cmd">hg merge</command>, then <command
bos@559 458 role="hg-cmd">hg commit</command>. In other words, this
bos@559 459 almost never has negative consequences; it just surprises
bos@559 460 people. I'll discuss other ways to avoid this behaviour,
bos@559 461 and why Mercurial behaves in this initially surprising way,
bos@559 462 later on.</para>
bos@559 463 </note>
bos@559 464
bos@559 465 </sect2>
bos@559 466 <sect2>
bos@559 467 <title>Merging heads</title>
bos@559 468
bos@559 469 <para>When you run the <command role="hg-cmd">hg merge</command>
bos@559 470 command, Mercurial leaves the first parent of the working
bos@559 471 directory unchanged, and sets the second parent to the
bos@559 472 changeset you're merging with, as shown in figure <xref
dongsheng@640 473 endterm="fig.concepts.wdir-merge.caption"
dongsheng@625 474 linkend="fig.concepts.wdir-merge"/>.</para>
dongsheng@625 475
dongsheng@625 476 <informalfigure id="fig.concepts.wdir-merge">
dongsheng@640 477 <mediaobject>
dongsheng@640 478 <imageobject><imagedata fileref="images/wdir-merge.png"/>
dongsheng@640 479 </imageobject>
dongsheng@640 480 <textobject><phrase>XXX add text</phrase></textobject>
dongsheng@640 481 <caption><para id="fig.concepts.wdir-merge.caption">Merging two
dongsheng@640 482 heads</para></caption>
dongsheng@640 483 </mediaobject>
bos@559 484 </informalfigure>
bos@559 485
bos@559 486 <para>Mercurial also has to modify the working directory, to
bos@559 487 merge the files managed in the two changesets. Simplified a
bos@559 488 little, the merging process goes like this, for every file in
bos@559 489 the manifests of both changesets.</para>
bos@559 490 <itemizedlist>
bos@559 491 <listitem><para>If neither changeset has modified a file, do
bos@559 492 nothing with that file.</para>
bos@559 493 </listitem>
bos@559 494 <listitem><para>If one changeset has modified a file, and the
bos@559 495 other hasn't, create the modified copy of the file in the
bos@559 496 working directory.</para>
bos@559 497 </listitem>
bos@559 498 <listitem><para>If one changeset has removed a file, and the
bos@559 499 other hasn't (or has also deleted it), delete the file
bos@559 500 from the working directory.</para>
bos@559 501 </listitem>
bos@559 502 <listitem><para>If one changeset has removed a file, but the
bos@559 503 other has modified the file, ask the user what to do: keep
bos@559 504 the modified file, or remove it?</para>
bos@559 505 </listitem>
bos@559 506 <listitem><para>If both changesets have modified a file,
bos@559 507 invoke an external merge program to choose the new
bos@559 508 contents for the merged file. This may require input from
bos@559 509 the user.</para>
bos@559 510 </listitem>
bos@559 511 <listitem><para>If one changeset has modified a file, and the
bos@559 512 other has renamed or copied the file, make sure that the
bos@559 513 changes follow the new name of the file.</para>
bos@559 514 </listitem></itemizedlist>
bos@559 515 <para>There are more details&emdash;merging has plenty of corner
bos@559 516 cases&emdash;but these are the most common choices that are
bos@559 517 involved in a merge. As you can see, most cases are
bos@559 518 completely automatic, and indeed most merges finish
bos@559 519 automatically, without requiring your input to resolve any
bos@559 520 conflicts.</para>
bos@559 521
bos@559 522 <para>When you're thinking about what happens when you commit
bos@559 523 after a merge, once again the working directory is <quote>the
bos@559 524 changeset I'm about to commit</quote>. After the <command
bos@559 525 role="hg-cmd">hg merge</command> command completes, the
bos@559 526 working directory has two parents; these will become the
bos@559 527 parents of the new changeset.</para>
bos@559 528
bos@559 529 <para>Mercurial lets you perform multiple merges, but you must
bos@559 530 commit the results of each individual merge as you go. This
bos@559 531 is necessary because Mercurial only tracks two parents for
bos@559 532 both revisions and the working directory. While it would be
bos@559 533 technically possible to merge multiple changesets at once, the
bos@559 534 prospect of user confusion and making a terrible mess of a
bos@559 535 merge immediately becomes overwhelming.</para>
bos@559 536
bos@559 537 </sect2>
bos@559 538 </sect1>
bos@559 539 <sect1>
bos@559 540 <title>Other interesting design features</title>
bos@559 541
bos@559 542 <para>In the sections above, I've tried to highlight some of the
bos@559 543 most important aspects of Mercurial's design, to illustrate that
bos@559 544 it pays careful attention to reliability and performance.
bos@559 545 However, the attention to detail doesn't stop there. There are
bos@559 546 a number of other aspects of Mercurial's construction that I
bos@559 547 personally find interesting. I'll detail a few of them here,
bos@559 548 separate from the <quote>big ticket</quote> items above, so that
bos@559 549 if you're interested, you can gain a better idea of the amount
bos@559 550 of thinking that goes into a well-designed system.</para>
bos@559 551
bos@559 552 <sect2>
bos@559 553 <title>Clever compression</title>
bos@559 554
bos@559 555 <para>When appropriate, Mercurial will store both snapshots and
bos@559 556 deltas in compressed form. It does this by always
bos@559 557 <emphasis>trying to</emphasis> compress a snapshot or delta,
bos@559 558 but only storing the compressed version if it's smaller than
bos@559 559 the uncompressed version.</para>
bos@559 560
bos@559 561 <para>This means that Mercurial does <quote>the right
bos@559 562 thing</quote> when storing a file whose native form is
bos@559 563 compressed, such as a <literal>zip</literal> archive or a JPEG
bos@559 564 image. When these types of files are compressed a second
bos@559 565 time, the resulting file is usually bigger than the
bos@559 566 once-compressed form, and so Mercurial will store the plain
bos@559 567 <literal>zip</literal> or JPEG.</para>
bos@559 568
bos@559 569 <para>Deltas between revisions of a compressed file are usually
bos@559 570 larger than snapshots of the file, and Mercurial again does
bos@559 571 <quote>the right thing</quote> in these cases. It finds that
bos@559 572 such a delta exceeds the threshold at which it should store a
bos@559 573 complete snapshot of the file, so it stores the snapshot,
bos@559 574 again saving space compared to a naive delta-only
bos@559 575 approach.</para>
bos@559 576
bos@559 577 <sect3>
bos@559 578 <title>Network recompression</title>
bos@559 579
bos@559 580 <para>When storing revisions on disk, Mercurial uses the
bos@559 581 <quote>deflate</quote> compression algorithm (the same one
bos@559 582 used by the popular <literal>zip</literal> archive format),
bos@559 583 which balances good speed with a respectable compression
bos@559 584 ratio. However, when transmitting revision data over a
bos@559 585 network connection, Mercurial uncompresses the compressed
bos@559 586 revision data.</para>
bos@559 587
bos@559 588 <para>If the connection is over HTTP, Mercurial recompresses
bos@559 589 the entire stream of data using a compression algorithm that
bos@559 590 gives a better compression ratio (the Burrows-Wheeler
bos@559 591 algorithm from the widely used <literal>bzip2</literal>
bos@559 592 compression package). This combination of algorithm and
bos@559 593 compression of the entire stream (instead of a revision at a
bos@559 594 time) substantially reduces the number of bytes to be
bos@559 595 transferred, yielding better network performance over almost
bos@559 596 all kinds of network.</para>
bos@559 597
bos@559 598 <para>(If the connection is over <command>ssh</command>,
bos@559 599 Mercurial <emphasis>doesn't</emphasis> recompress the
bos@559 600 stream, because <command>ssh</command> can already do this
bos@559 601 itself.)</para>
bos@559 602
bos@559 603 </sect3>
bos@559 604 </sect2>
bos@559 605 <sect2>
bos@559 606 <title>Read/write ordering and atomicity</title>
bos@559 607
bos@559 608 <para>Appending to files isn't the whole story when it comes to
bos@559 609 guaranteeing that a reader won't see a partial write. If you
dongsheng@640 610 recall figure <xref endterm="fig.concepts.metadata.caption"
dongsheng@640 611 linkend="fig.concepts.metadata"/>, revisions in the
bos@559 612 changelog point to revisions in the manifest, and revisions in
bos@559 613 the manifest point to revisions in filelogs. This hierarchy
bos@559 614 is deliberate.</para>
bos@559 615
bos@559 616 <para>A writer starts a transaction by writing filelog and
bos@559 617 manifest data, and doesn't write any changelog data until
bos@559 618 those are finished. A reader starts by reading changelog
bos@559 619 data, then manifest data, followed by filelog data.</para>
bos@559 620
bos@559 621 <para>Since the writer has always finished writing filelog and
bos@559 622 manifest data before it writes to the changelog, a reader will
bos@559 623 never read a pointer to a partially written manifest revision
bos@559 624 from the changelog, and it will never read a pointer to a
bos@559 625 partially written filelog revision from the manifest.</para>
bos@559 626
bos@559 627 </sect2>
bos@559 628 <sect2>
bos@559 629 <title>Concurrent access</title>
bos@559 630
bos@559 631 <para>The read/write ordering and atomicity guarantees mean that
bos@559 632 Mercurial never needs to <emphasis>lock</emphasis> a
bos@559 633 repository when it's reading data, even if the repository is
bos@559 634 being written to while the read is occurring. This has a big
bos@559 635 effect on scalability; you can have an arbitrary number of
bos@559 636 Mercurial processes safely reading data from a repository
bos@559 637 safely all at once, no matter whether it's being written to or
bos@559 638 not.</para>
bos@559 639
bos@559 640 <para>The lockless nature of reading means that if you're
bos@559 641 sharing a repository on a multi-user system, you don't need to
bos@559 642 grant other local users permission to
bos@559 643 <emphasis>write</emphasis> to your repository in order for
bos@559 644 them to be able to clone it or pull changes from it; they only
bos@559 645 need <emphasis>read</emphasis> permission. (This is
bos@559 646 <emphasis>not</emphasis> a common feature among revision
bos@559 647 control systems, so don't take it for granted! Most require
bos@559 648 readers to be able to lock a repository to access it safely,
bos@559 649 and this requires write permission on at least one directory,
bos@559 650 which of course makes for all kinds of nasty and annoying
bos@559 651 security and administrative problems.)</para>
bos@559 652
bos@559 653 <para>Mercurial uses locks to ensure that only one process can
bos@559 654 write to a repository at a time (the locking mechanism is safe
bos@559 655 even over filesystems that are notoriously hostile to locking,
bos@559 656 such as NFS). If a repository is locked, a writer will wait
bos@559 657 for a while to retry if the repository becomes unlocked, but
bos@559 658 if the repository remains locked for too long, the process
bos@559 659 attempting to write will time out after a while. This means
bos@559 660 that your daily automated scripts won't get stuck forever and
bos@559 661 pile up if a system crashes unnoticed, for example. (Yes, the
bos@559 662 timeout is configurable, from zero to infinity.)</para>
bos@559 663
bos@559 664 <sect3>
bos@559 665 <title>Safe dirstate access</title>
bos@559 666
bos@559 667 <para>As with revision data, Mercurial doesn't take a lock to
bos@559 668 read the dirstate file; it does acquire a lock to write it.
bos@559 669 To avoid the possibility of reading a partially written copy
bos@559 670 of the dirstate file, Mercurial writes to a file with a
bos@559 671 unique name in the same directory as the dirstate file, then
bos@559 672 renames the temporary file atomically to
bos@559 673 <filename>dirstate</filename>. The file named
bos@559 674 <filename>dirstate</filename> is thus guaranteed to be
bos@559 675 complete, not partially written.</para>
bos@559 676
bos@559 677 </sect3>
bos@559 678 </sect2>
bos@559 679 <sect2>
bos@559 680 <title>Avoiding seeks</title>
bos@559 681
bos@559 682 <para>Critical to Mercurial's performance is the avoidance of
bos@559 683 seeks of the disk head, since any seek is far more expensive
bos@559 684 than even a comparatively large read operation.</para>
bos@559 685
bos@559 686 <para>This is why, for example, the dirstate is stored in a
bos@559 687 single file. If there were a dirstate file per directory that
bos@559 688 Mercurial tracked, the disk would seek once per directory.
bos@559 689 Instead, Mercurial reads the entire single dirstate file in
bos@559 690 one step.</para>
bos@559 691
bos@559 692 <para>Mercurial also uses a <quote>copy on write</quote> scheme
bos@559 693 when cloning a repository on local storage. Instead of
bos@559 694 copying every revlog file from the old repository into the new
bos@559 695 repository, it makes a <quote>hard link</quote>, which is a
bos@559 696 shorthand way to say <quote>these two names point to the same
bos@559 697 file</quote>. When Mercurial is about to write to one of a
bos@559 698 revlog's files, it checks to see if the number of names
bos@559 699 pointing at the file is greater than one. If it is, more than
bos@559 700 one repository is using the file, so Mercurial makes a new
bos@559 701 copy of the file that is private to this repository.</para>
bos@559 702
bos@559 703 <para>A few revision control developers have pointed out that
bos@559 704 this idea of making a complete private copy of a file is not
bos@559 705 very efficient in its use of storage. While this is true,
bos@559 706 storage is cheap, and this method gives the highest
bos@559 707 performance while deferring most book-keeping to the operating
bos@559 708 system. An alternative scheme would most likely reduce
bos@559 709 performance and increase the complexity of the software, each
bos@559 710 of which is much more important to the <quote>feel</quote> of
bos@559 711 day-to-day use.</para>
bos@559 712
bos@559 713 </sect2>
bos@559 714 <sect2>
bos@559 715 <title>Other contents of the dirstate</title>
bos@559 716
bos@559 717 <para>Because Mercurial doesn't force you to tell it when you're
bos@559 718 modifying a file, it uses the dirstate to store some extra
bos@559 719 information so it can determine efficiently whether you have
bos@559 720 modified a file. For each file in the working directory, it
bos@559 721 stores the time that it last modified the file itself, and the
bos@559 722 size of the file at that time.</para>
bos@559 723
bos@559 724 <para>When you explicitly <command role="hg-cmd">hg
bos@559 725 add</command>, <command role="hg-cmd">hg remove</command>,
bos@559 726 <command role="hg-cmd">hg rename</command> or <command
bos@559 727 role="hg-cmd">hg copy</command> files, Mercurial updates the
bos@559 728 dirstate so that it knows what to do with those files when you
bos@559 729 commit.</para>
bos@559 730
bos@559 731 <para>When Mercurial is checking the states of files in the
bos@559 732 working directory, it first checks a file's modification time.
bos@559 733 If that has not changed, the file must not have been modified.
bos@559 734 If the file's size has changed, the file must have been
bos@559 735 modified. If the modification time has changed, but the size
bos@559 736 has not, only then does Mercurial need to read the actual
bos@559 737 contents of the file to see if they've changed. Storing these
bos@559 738 few extra pieces of information dramatically reduces the
bos@559 739 amount of data that Mercurial needs to read, which yields
bos@559 740 large performance improvements compared to other revision
bos@559 741 control systems.</para>
bos@559 742
bos@559 743 </sect2>
bos@559 744 </sect1>
bos@559 745 </chapter>
bos@559 746
bos@559 747 <!--
bos@559 748 local variables:
bos@559 749 sgml-parent-document: ("00book.xml" "book" "chapter")
bos@559 750 end:
bos@559 751 -->