hgbook
diff en/ch03-concepts.xml @ 650:7e7c47481e4f
Oops, this is the real merge for my hg's oddity
author | Dongsheng Song <dongsheng.song@gmail.com> |
---|---|
date | Fri Mar 20 16:43:35 2009 +0800 (2009-03-20) |
parents | en/ch04-concepts.xml@a13813534ccd |
children | 1c13ed2130a7 |
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/en/ch03-concepts.xml Fri Mar 20 16:43:35 2009 +0800 1.3 @@ -0,0 +1,751 @@ 1.4 +<!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : --> 1.5 + 1.6 +<chapter id="chap.concepts"> 1.7 + <?dbhtml filename="behind-the-scenes.html"?> 1.8 + <title>Behind the scenes</title> 1.9 + 1.10 + <para>Unlike many revision control systems, the concepts upon which 1.11 + Mercurial is built are simple enough that it's easy to understand 1.12 + how the software really works. Knowing this certainly isn't 1.13 + necessary, but I find it useful to have a <quote>mental 1.14 + model</quote> of what's going on.</para> 1.15 + 1.16 + <para>This understanding gives me confidence that Mercurial has been 1.17 + carefully designed to be both <emphasis>safe</emphasis> and 1.18 + <emphasis>efficient</emphasis>. And just as importantly, if it's 1.19 + easy for me to retain a good idea of what the software is doing 1.20 + when I perform a revision control task, I'm less likely to be 1.21 + surprised by its behaviour.</para> 1.22 + 1.23 + <para>In this chapter, we'll initially cover the core concepts 1.24 + behind Mercurial's design, then continue to discuss some of the 1.25 + interesting details of its implementation.</para> 1.26 + 1.27 + <sect1> 1.28 + <title>Mercurial's historical record</title> 1.29 + 1.30 + <sect2> 1.31 + <title>Tracking the history of a single file</title> 1.32 + 1.33 + <para>When Mercurial tracks modifications to a file, it stores 1.34 + the history of that file in a metadata object called a 1.35 + <emphasis>filelog</emphasis>. Each entry in the filelog 1.36 + contains enough information to reconstruct one revision of the 1.37 + file that is being tracked. Filelogs are stored as files in 1.38 + the <filename role="special" 1.39 + class="directory">.hg/store/data</filename> directory. A 1.40 + filelog contains two kinds of information: revision data, and 1.41 + an index to help Mercurial to find a revision 1.42 + efficiently.</para> 1.43 + 1.44 + <para>A file that is large, or has a lot of history, has its 1.45 + filelog stored in separate data 1.46 + (<quote><literal>.d</literal></quote> suffix) and index 1.47 + (<quote><literal>.i</literal></quote> suffix) files. For 1.48 + small files without much history, the revision data and index 1.49 + are combined in a single <quote><literal>.i</literal></quote> 1.50 + file. The correspondence between a file in the working 1.51 + directory and the filelog that tracks its history in the 1.52 + repository is illustrated in figure <xref 1.53 + endterm="fig.concepts.filelog.caption" 1.54 + linkend="fig.concepts.filelog"/>.</para> 1.55 + 1.56 + <informalfigure id="fig.concepts.filelog"> 1.57 + <mediaobject> 1.58 + <imageobject><imagedata fileref="images/filelog.png"/></imageobject> 1.59 + <textobject><phrase>XXX add text</phrase></textobject> 1.60 + <caption><para id="fig.concepts.filelog.caption">Relationships between 1.61 + files in working directory and filelogs in repository</para> 1.62 + </caption> 1.63 + </mediaobject> 1.64 + </informalfigure> 1.65 + 1.66 + </sect2> 1.67 + <sect2> 1.68 + <title>Managing tracked files</title> 1.69 + 1.70 + <para>Mercurial uses a structure called a 1.71 + <emphasis>manifest</emphasis> to collect together information 1.72 + about the files that it tracks. Each entry in the manifest 1.73 + contains information about the files present in a single 1.74 + changeset. An entry records which files are present in the 1.75 + changeset, the revision of each file, and a few other pieces 1.76 + of file metadata.</para> 1.77 + 1.78 + </sect2> 1.79 + <sect2> 1.80 + <title>Recording changeset information</title> 1.81 + 1.82 + <para>The <emphasis>changelog</emphasis> contains information 1.83 + about each changeset. Each revision records who committed a 1.84 + change, the changeset comment, other pieces of 1.85 + changeset-related information, and the revision of the 1.86 + manifest to use.</para> 1.87 + 1.88 + </sect2> 1.89 + <sect2> 1.90 + <title>Relationships between revisions</title> 1.91 + 1.92 + <para>Within a changelog, a manifest, or a filelog, each 1.93 + revision stores a pointer to its immediate parent (or to its 1.94 + two parents, if it's a merge revision). As I mentioned above, 1.95 + there are also relationships between revisions 1.96 + <emphasis>across</emphasis> these structures, and they are 1.97 + hierarchical in nature.</para> 1.98 + 1.99 + <para>For every changeset in a repository, there is exactly one 1.100 + revision stored in the changelog. Each revision of the 1.101 + changelog contains a pointer to a single revision of the 1.102 + manifest. A revision of the manifest stores a pointer to a 1.103 + single revision of each filelog tracked when that changeset 1.104 + was created. These relationships are illustrated in figure 1.105 + <xref endterm="fig.concepts.metadata.caption" 1.106 + linkend="fig.concepts.metadata"/>.</para> 1.107 + 1.108 + <informalfigure id="fig.concepts.metadata"> 1.109 + <mediaobject> 1.110 + <imageobject><imagedata fileref="images/metadata.png"/></imageobject> 1.111 + <textobject><phrase>XXX add text</phrase></textobject> 1.112 + <caption><para id="fig.concepts.metadata.caption">Metadata 1.113 + relationships</para></caption> 1.114 + </mediaobject> 1.115 + </informalfigure> 1.116 + 1.117 + <para>As the illustration shows, there is 1.118 + <emphasis>not</emphasis> a <quote>one to one</quote> 1.119 + relationship between revisions in the changelog, manifest, or 1.120 + filelog. If the manifest hasn't changed between two 1.121 + changesets, the changelog entries for those changesets will 1.122 + point to the same revision of the manifest. If a file that 1.123 + Mercurial tracks hasn't changed between two changesets, the 1.124 + entry for that file in the two revisions of the manifest will 1.125 + point to the same revision of its filelog.</para> 1.126 + 1.127 + </sect2> 1.128 + </sect1> 1.129 + <sect1> 1.130 + <title>Safe, efficient storage</title> 1.131 + 1.132 + <para>The underpinnings of changelogs, manifests, and filelogs are 1.133 + provided by a single structure called the 1.134 + <emphasis>revlog</emphasis>.</para> 1.135 + 1.136 + <sect2> 1.137 + <title>Efficient storage</title> 1.138 + 1.139 + <para>The revlog provides efficient storage of revisions using a 1.140 + <emphasis>delta</emphasis> mechanism. Instead of storing a 1.141 + complete copy of a file for each revision, it stores the 1.142 + changes needed to transform an older revision into the new 1.143 + revision. For many kinds of file data, these deltas are 1.144 + typically a fraction of a percent of the size of a full copy 1.145 + of a file.</para> 1.146 + 1.147 + <para>Some obsolete revision control systems can only work with 1.148 + deltas of text files. They must either store binary files as 1.149 + complete snapshots or encoded into a text representation, both 1.150 + of which are wasteful approaches. Mercurial can efficiently 1.151 + handle deltas of files with arbitrary binary contents; it 1.152 + doesn't need to treat text as special.</para> 1.153 + 1.154 + </sect2> 1.155 + <sect2 id="sec.concepts.txn"> 1.156 + <title>Safe operation</title> 1.157 + 1.158 + <para>Mercurial only ever <emphasis>appends</emphasis> data to 1.159 + the end of a revlog file. It never modifies a section of a 1.160 + file after it has written it. This is both more robust and 1.161 + efficient than schemes that need to modify or rewrite 1.162 + data.</para> 1.163 + 1.164 + <para>In addition, Mercurial treats every write as part of a 1.165 + <emphasis>transaction</emphasis> that can span a number of 1.166 + files. A transaction is <emphasis>atomic</emphasis>: either 1.167 + the entire transaction succeeds and its effects are all 1.168 + visible to readers in one go, or the whole thing is undone. 1.169 + This guarantee of atomicity means that if you're running two 1.170 + copies of Mercurial, where one is reading data and one is 1.171 + writing it, the reader will never see a partially written 1.172 + result that might confuse it.</para> 1.173 + 1.174 + <para>The fact that Mercurial only appends to files makes it 1.175 + easier to provide this transactional guarantee. The easier it 1.176 + is to do stuff like this, the more confident you should be 1.177 + that it's done correctly.</para> 1.178 + 1.179 + </sect2> 1.180 + <sect2> 1.181 + <title>Fast retrieval</title> 1.182 + 1.183 + <para>Mercurial cleverly avoids a pitfall common to all earlier 1.184 + revision control systems: the problem of <emphasis>inefficient 1.185 + retrieval</emphasis>. Most revision control systems store 1.186 + the contents of a revision as an incremental series of 1.187 + modifications against a <quote>snapshot</quote>. To 1.188 + reconstruct a specific revision, you must first read the 1.189 + snapshot, and then every one of the revisions between the 1.190 + snapshot and your target revision. The more history that a 1.191 + file accumulates, the more revisions you must read, hence the 1.192 + longer it takes to reconstruct a particular revision.</para> 1.193 + 1.194 + <informalfigure id="fig.concepts.snapshot"> 1.195 + <mediaobject> 1.196 + <imageobject><imagedata fileref="images/snapshot.png"/></imageobject> 1.197 + <textobject><phrase>XXX add text</phrase></textobject> 1.198 + <caption><para id="fig.concepts.snapshot.caption">Snapshot of 1.199 + a revlog, with incremental deltas</para></caption> 1.200 + </mediaobject> 1.201 + </informalfigure> 1.202 + 1.203 + <para>The innovation that Mercurial applies to this problem is 1.204 + simple but effective. Once the cumulative amount of delta 1.205 + information stored since the last snapshot exceeds a fixed 1.206 + threshold, it stores a new snapshot (compressed, of course), 1.207 + instead of another delta. This makes it possible to 1.208 + reconstruct <emphasis>any</emphasis> revision of a file 1.209 + quickly. This approach works so well that it has since been 1.210 + copied by several other revision control systems.</para> 1.211 + 1.212 + <para>Figure <xref endterm="fig.concepts.snapshot.caption" 1.213 + linkend="fig.concepts.snapshot"/> illustrates 1.214 + the idea. In an entry in a revlog's index file, Mercurial 1.215 + stores the range of entries from the data file that it must 1.216 + read to reconstruct a particular revision.</para> 1.217 + 1.218 + <sect3> 1.219 + <title>Aside: the influence of video compression</title> 1.220 + 1.221 + <para>If you're familiar with video compression or have ever 1.222 + watched a TV feed through a digital cable or satellite 1.223 + service, you may know that most video compression schemes 1.224 + store each frame of video as a delta against its predecessor 1.225 + frame. In addition, these schemes use <quote>lossy</quote> 1.226 + compression techniques to increase the compression ratio, so 1.227 + visual errors accumulate over the course of a number of 1.228 + inter-frame deltas.</para> 1.229 + 1.230 + <para>Because it's possible for a video stream to <quote>drop 1.231 + out</quote> occasionally due to signal glitches, and to 1.232 + limit the accumulation of artefacts introduced by the lossy 1.233 + compression process, video encoders periodically insert a 1.234 + complete frame (called a <quote>key frame</quote>) into the 1.235 + video stream; the next delta is generated against that 1.236 + frame. This means that if the video signal gets 1.237 + interrupted, it will resume once the next key frame is 1.238 + received. Also, the accumulation of encoding errors 1.239 + restarts anew with each key frame.</para> 1.240 + 1.241 + </sect3> 1.242 + </sect2> 1.243 + <sect2> 1.244 + <title>Identification and strong integrity</title> 1.245 + 1.246 + <para>Along with delta or snapshot information, a revlog entry 1.247 + contains a cryptographic hash of the data that it represents. 1.248 + This makes it difficult to forge the contents of a revision, 1.249 + and easy to detect accidental corruption.</para> 1.250 + 1.251 + <para>Hashes provide more than a mere check against corruption; 1.252 + they are used as the identifiers for revisions. The changeset 1.253 + identification hashes that you see as an end user are from 1.254 + revisions of the changelog. Although filelogs and the 1.255 + manifest also use hashes, Mercurial only uses these behind the 1.256 + scenes.</para> 1.257 + 1.258 + <para>Mercurial verifies that hashes are correct when it 1.259 + retrieves file revisions and when it pulls changes from 1.260 + another repository. If it encounters an integrity problem, it 1.261 + will complain and stop whatever it's doing.</para> 1.262 + 1.263 + <para>In addition to the effect it has on retrieval efficiency, 1.264 + Mercurial's use of periodic snapshots makes it more robust 1.265 + against partial data corruption. If a revlog becomes partly 1.266 + corrupted due to a hardware error or system bug, it's often 1.267 + possible to reconstruct some or most revisions from the 1.268 + uncorrupted sections of the revlog, both before and after the 1.269 + corrupted section. This would not be possible with a 1.270 + delta-only storage model.</para> 1.271 + 1.272 + </sect2> 1.273 + </sect1> 1.274 + <sect1> 1.275 + <title>Revision history, branching, and merging</title> 1.276 + 1.277 + <para>Every entry in a Mercurial revlog knows the identity of its 1.278 + immediate ancestor revision, usually referred to as its 1.279 + <emphasis>parent</emphasis>. In fact, a revision contains room 1.280 + for not one parent, but two. Mercurial uses a special hash, 1.281 + called the <quote>null ID</quote>, to represent the idea 1.282 + <quote>there is no parent here</quote>. This hash is simply a 1.283 + string of zeroes.</para> 1.284 + 1.285 + <para>In figure <xref endterm="fig.concepts.revlog.caption" 1.286 + linkend="fig.concepts.revlog"/>, you can see 1.287 + an example of the conceptual structure of a revlog. Filelogs, 1.288 + manifests, and changelogs all have this same structure; they 1.289 + differ only in the kind of data stored in each delta or 1.290 + snapshot.</para> 1.291 + 1.292 + <para>The first revision in a revlog (at the bottom of the image) 1.293 + has the null ID in both of its parent slots. For a 1.294 + <quote>normal</quote> revision, its first parent slot contains 1.295 + the ID of its parent revision, and its second contains the null 1.296 + ID, indicating that the revision has only one real parent. Any 1.297 + two revisions that have the same parent ID are branches. A 1.298 + revision that represents a merge between branches has two normal 1.299 + revision IDs in its parent slots.</para> 1.300 + 1.301 + <informalfigure id="fig.concepts.revlog"> 1.302 + <mediaobject> 1.303 + <imageobject><imagedata fileref="images/revlog.png"/></imageobject> 1.304 + <textobject><phrase>XXX add text</phrase></textobject> 1.305 + <caption><para id="fig.concepts.revlog.caption">Revision in revlog</para> 1.306 + </caption> 1.307 + </mediaobject> 1.308 + </informalfigure> 1.309 + 1.310 + </sect1> 1.311 + <sect1> 1.312 + <title>The working directory</title> 1.313 + 1.314 + <para>In the working directory, Mercurial stores a snapshot of the 1.315 + files from the repository as of a particular changeset.</para> 1.316 + 1.317 + <para>The working directory <quote>knows</quote> which changeset 1.318 + it contains. When you update the working directory to contain a 1.319 + particular changeset, Mercurial looks up the appropriate 1.320 + revision of the manifest to find out which files it was tracking 1.321 + at the time that changeset was committed, and which revision of 1.322 + each file was then current. It then recreates a copy of each of 1.323 + those files, with the same contents it had when the changeset 1.324 + was committed.</para> 1.325 + 1.326 + <para>The <emphasis>dirstate</emphasis> contains Mercurial's 1.327 + knowledge of the working directory. This details which 1.328 + changeset the working directory is updated to, and all of the 1.329 + files that Mercurial is tracking in the working 1.330 + directory.</para> 1.331 + 1.332 + <para>Just as a revision of a revlog has room for two parents, so 1.333 + that it can represent either a normal revision (with one parent) 1.334 + or a merge of two earlier revisions, the dirstate has slots for 1.335 + two parents. When you use the <command role="hg-cmd">hg 1.336 + update</command> command, the changeset that you update to is 1.337 + stored in the <quote>first parent</quote> slot, and the null ID 1.338 + in the second. When you <command role="hg-cmd">hg 1.339 + merge</command> with another changeset, the first parent 1.340 + remains unchanged, and the second parent is filled in with the 1.341 + changeset you're merging with. The <command role="hg-cmd">hg 1.342 + parents</command> command tells you what the parents of the 1.343 + dirstate are.</para> 1.344 + 1.345 + <sect2> 1.346 + <title>What happens when you commit</title> 1.347 + 1.348 + <para>The dirstate stores parent information for more than just 1.349 + book-keeping purposes. Mercurial uses the parents of the 1.350 + dirstate as <emphasis>the parents of a new 1.351 + changeset</emphasis> when you perform a commit.</para> 1.352 + 1.353 + <informalfigure id="fig.concepts.wdir"> 1.354 + <mediaobject> 1.355 + <imageobject><imagedata fileref="images/wdir.png"/></imageobject> 1.356 + <textobject><phrase>XXX add text</phrase></textobject> 1.357 + <caption><para id="fig.concepts.wdir.caption">The working 1.358 + directory can have two parents</para></caption> 1.359 + </mediaobject> 1.360 + </informalfigure> 1.361 + 1.362 + <para>Figure <xref endterm="fig.concepts.wdir.caption" 1.363 + linkend="fig.concepts.wdir"/> shows the 1.364 + normal state of the working directory, where it has a single 1.365 + changeset as parent. That changeset is the 1.366 + <emphasis>tip</emphasis>, the newest changeset in the 1.367 + repository that has no children.</para> 1.368 + 1.369 + <informalfigure id="fig.concepts.wdir-after-commit"> 1.370 + <mediaobject> 1.371 + <imageobject><imagedata fileref="images/wdir-after-commit.png"/> 1.372 + </imageobject> 1.373 + <textobject><phrase>XXX add text</phrase></textobject> 1.374 + <caption><para id="fig.concepts.wdir-after-commit.caption">The working 1.375 + directory gains new parents after a commit</para></caption> 1.376 + </mediaobject> 1.377 + </informalfigure> 1.378 + 1.379 + <para>It's useful to think of the working directory as 1.380 + <quote>the changeset I'm about to commit</quote>. Any files 1.381 + that you tell Mercurial that you've added, removed, renamed, 1.382 + or copied will be reflected in that changeset, as will 1.383 + modifications to any files that Mercurial is already tracking; 1.384 + the new changeset will have the parents of the working 1.385 + directory as its parents.</para> 1.386 + 1.387 + <para>After a commit, Mercurial will update the parents of the 1.388 + working directory, so that the first parent is the ID of the 1.389 + new changeset, and the second is the null ID. This is shown 1.390 + in figure <xref endterm="fig.concepts.wdir-after-commit.caption" 1.391 + linkend="fig.concepts.wdir-after-commit"/>. 1.392 + Mercurial 1.393 + doesn't touch any of the files in the working directory when 1.394 + you commit; it just modifies the dirstate to note its new 1.395 + parents.</para> 1.396 + 1.397 + </sect2> 1.398 + <sect2> 1.399 + <title>Creating a new head</title> 1.400 + 1.401 + <para>It's perfectly normal to update the working directory to a 1.402 + changeset other than the current tip. For example, you might 1.403 + want to know what your project looked like last Tuesday, or 1.404 + you could be looking through changesets to see which one 1.405 + introduced a bug. In cases like this, the natural thing to do 1.406 + is update the working directory to the changeset you're 1.407 + interested in, and then examine the files in the working 1.408 + directory directly to see their contents as they were when you 1.409 + committed that changeset. The effect of this is shown in 1.410 + figure <xref endterm="fig.concepts.wdir-pre-branch.caption" 1.411 + linkend="fig.concepts.wdir-pre-branch"/>.</para> 1.412 + 1.413 + <informalfigure id="fig.concepts.wdir-pre-branch"> 1.414 + <mediaobject> 1.415 + <imageobject><imagedata fileref="images/wdir-pre-branch.png"/> 1.416 + </imageobject> 1.417 + <textobject><phrase>XXX add text</phrase></textobject> 1.418 + <caption><para id="fig.concepts.wdir-pre-branch.caption">The working 1.419 + directory, updated to an older changeset</para></caption> 1.420 + </mediaobject> 1.421 + </informalfigure> 1.422 + 1.423 + <para>Having updated the working directory to an older 1.424 + changeset, what happens if you make some changes, and then 1.425 + commit? Mercurial behaves in the same way as I outlined 1.426 + above. The parents of the working directory become the 1.427 + parents of the new changeset. This new changeset has no 1.428 + children, so it becomes the new tip. And the repository now 1.429 + contains two changesets that have no children; we call these 1.430 + <emphasis>heads</emphasis>. You can see the structure that 1.431 + this creates in figure <xref 1.432 + endterm="fig.concepts.wdir-branch.caption" 1.433 + linkend="fig.concepts.wdir-branch"/>.</para> 1.434 + 1.435 + <informalfigure id="fig.concepts.wdir-branch"> 1.436 + <mediaobject> 1.437 + <imageobject><imagedata fileref="images/wdir-branch.png"/> 1.438 + </imageobject> 1.439 + <textobject><phrase>XXX add text</phrase></textobject> 1.440 + <caption><para id="fig.concepts.wdir-branch.caption">After a 1.441 + commit made while synced to an older changeset</para></caption> 1.442 + </mediaobject> 1.443 + </informalfigure> 1.444 + 1.445 + <note> 1.446 + <para> If you're new to Mercurial, you should keep in mind a 1.447 + common <quote>error</quote>, which is to use the <command 1.448 + role="hg-cmd">hg pull</command> command without any 1.449 + options. By default, the <command role="hg-cmd">hg 1.450 + pull</command> command <emphasis>does not</emphasis> 1.451 + update the working directory, so you'll bring new changesets 1.452 + into your repository, but the working directory will stay 1.453 + synced at the same changeset as before the pull. If you 1.454 + make some changes and commit afterwards, you'll thus create 1.455 + a new head, because your working directory isn't synced to 1.456 + whatever the current tip is.</para> 1.457 + 1.458 + <para> I put the word <quote>error</quote> in quotes because 1.459 + all that you need to do to rectify this situation is 1.460 + <command role="hg-cmd">hg merge</command>, then <command 1.461 + role="hg-cmd">hg commit</command>. In other words, this 1.462 + almost never has negative consequences; it just surprises 1.463 + people. I'll discuss other ways to avoid this behaviour, 1.464 + and why Mercurial behaves in this initially surprising way, 1.465 + later on.</para> 1.466 + </note> 1.467 + 1.468 + </sect2> 1.469 + <sect2> 1.470 + <title>Merging heads</title> 1.471 + 1.472 + <para>When you run the <command role="hg-cmd">hg merge</command> 1.473 + command, Mercurial leaves the first parent of the working 1.474 + directory unchanged, and sets the second parent to the 1.475 + changeset you're merging with, as shown in figure <xref 1.476 + endterm="fig.concepts.wdir-merge.caption" 1.477 + linkend="fig.concepts.wdir-merge"/>.</para> 1.478 + 1.479 + <informalfigure id="fig.concepts.wdir-merge"> 1.480 + <mediaobject> 1.481 + <imageobject><imagedata fileref="images/wdir-merge.png"/> 1.482 + </imageobject> 1.483 + <textobject><phrase>XXX add text</phrase></textobject> 1.484 + <caption><para id="fig.concepts.wdir-merge.caption">Merging two 1.485 + heads</para></caption> 1.486 + </mediaobject> 1.487 + </informalfigure> 1.488 + 1.489 + <para>Mercurial also has to modify the working directory, to 1.490 + merge the files managed in the two changesets. Simplified a 1.491 + little, the merging process goes like this, for every file in 1.492 + the manifests of both changesets.</para> 1.493 + <itemizedlist> 1.494 + <listitem><para>If neither changeset has modified a file, do 1.495 + nothing with that file.</para> 1.496 + </listitem> 1.497 + <listitem><para>If one changeset has modified a file, and the 1.498 + other hasn't, create the modified copy of the file in the 1.499 + working directory.</para> 1.500 + </listitem> 1.501 + <listitem><para>If one changeset has removed a file, and the 1.502 + other hasn't (or has also deleted it), delete the file 1.503 + from the working directory.</para> 1.504 + </listitem> 1.505 + <listitem><para>If one changeset has removed a file, but the 1.506 + other has modified the file, ask the user what to do: keep 1.507 + the modified file, or remove it?</para> 1.508 + </listitem> 1.509 + <listitem><para>If both changesets have modified a file, 1.510 + invoke an external merge program to choose the new 1.511 + contents for the merged file. This may require input from 1.512 + the user.</para> 1.513 + </listitem> 1.514 + <listitem><para>If one changeset has modified a file, and the 1.515 + other has renamed or copied the file, make sure that the 1.516 + changes follow the new name of the file.</para> 1.517 + </listitem></itemizedlist> 1.518 + <para>There are more details&emdash;merging has plenty of corner 1.519 + cases&emdash;but these are the most common choices that are 1.520 + involved in a merge. As you can see, most cases are 1.521 + completely automatic, and indeed most merges finish 1.522 + automatically, without requiring your input to resolve any 1.523 + conflicts.</para> 1.524 + 1.525 + <para>When you're thinking about what happens when you commit 1.526 + after a merge, once again the working directory is <quote>the 1.527 + changeset I'm about to commit</quote>. After the <command 1.528 + role="hg-cmd">hg merge</command> command completes, the 1.529 + working directory has two parents; these will become the 1.530 + parents of the new changeset.</para> 1.531 + 1.532 + <para>Mercurial lets you perform multiple merges, but you must 1.533 + commit the results of each individual merge as you go. This 1.534 + is necessary because Mercurial only tracks two parents for 1.535 + both revisions and the working directory. While it would be 1.536 + technically possible to merge multiple changesets at once, the 1.537 + prospect of user confusion and making a terrible mess of a 1.538 + merge immediately becomes overwhelming.</para> 1.539 + 1.540 + </sect2> 1.541 + </sect1> 1.542 + <sect1> 1.543 + <title>Other interesting design features</title> 1.544 + 1.545 + <para>In the sections above, I've tried to highlight some of the 1.546 + most important aspects of Mercurial's design, to illustrate that 1.547 + it pays careful attention to reliability and performance. 1.548 + However, the attention to detail doesn't stop there. There are 1.549 + a number of other aspects of Mercurial's construction that I 1.550 + personally find interesting. I'll detail a few of them here, 1.551 + separate from the <quote>big ticket</quote> items above, so that 1.552 + if you're interested, you can gain a better idea of the amount 1.553 + of thinking that goes into a well-designed system.</para> 1.554 + 1.555 + <sect2> 1.556 + <title>Clever compression</title> 1.557 + 1.558 + <para>When appropriate, Mercurial will store both snapshots and 1.559 + deltas in compressed form. It does this by always 1.560 + <emphasis>trying to</emphasis> compress a snapshot or delta, 1.561 + but only storing the compressed version if it's smaller than 1.562 + the uncompressed version.</para> 1.563 + 1.564 + <para>This means that Mercurial does <quote>the right 1.565 + thing</quote> when storing a file whose native form is 1.566 + compressed, such as a <literal>zip</literal> archive or a JPEG 1.567 + image. When these types of files are compressed a second 1.568 + time, the resulting file is usually bigger than the 1.569 + once-compressed form, and so Mercurial will store the plain 1.570 + <literal>zip</literal> or JPEG.</para> 1.571 + 1.572 + <para>Deltas between revisions of a compressed file are usually 1.573 + larger than snapshots of the file, and Mercurial again does 1.574 + <quote>the right thing</quote> in these cases. It finds that 1.575 + such a delta exceeds the threshold at which it should store a 1.576 + complete snapshot of the file, so it stores the snapshot, 1.577 + again saving space compared to a naive delta-only 1.578 + approach.</para> 1.579 + 1.580 + <sect3> 1.581 + <title>Network recompression</title> 1.582 + 1.583 + <para>When storing revisions on disk, Mercurial uses the 1.584 + <quote>deflate</quote> compression algorithm (the same one 1.585 + used by the popular <literal>zip</literal> archive format), 1.586 + which balances good speed with a respectable compression 1.587 + ratio. However, when transmitting revision data over a 1.588 + network connection, Mercurial uncompresses the compressed 1.589 + revision data.</para> 1.590 + 1.591 + <para>If the connection is over HTTP, Mercurial recompresses 1.592 + the entire stream of data using a compression algorithm that 1.593 + gives a better compression ratio (the Burrows-Wheeler 1.594 + algorithm from the widely used <literal>bzip2</literal> 1.595 + compression package). This combination of algorithm and 1.596 + compression of the entire stream (instead of a revision at a 1.597 + time) substantially reduces the number of bytes to be 1.598 + transferred, yielding better network performance over almost 1.599 + all kinds of network.</para> 1.600 + 1.601 + <para>(If the connection is over <command>ssh</command>, 1.602 + Mercurial <emphasis>doesn't</emphasis> recompress the 1.603 + stream, because <command>ssh</command> can already do this 1.604 + itself.)</para> 1.605 + 1.606 + </sect3> 1.607 + </sect2> 1.608 + <sect2> 1.609 + <title>Read/write ordering and atomicity</title> 1.610 + 1.611 + <para>Appending to files isn't the whole story when it comes to 1.612 + guaranteeing that a reader won't see a partial write. If you 1.613 + recall figure <xref endterm="fig.concepts.metadata.caption" 1.614 + linkend="fig.concepts.metadata"/>, revisions in the 1.615 + changelog point to revisions in the manifest, and revisions in 1.616 + the manifest point to revisions in filelogs. This hierarchy 1.617 + is deliberate.</para> 1.618 + 1.619 + <para>A writer starts a transaction by writing filelog and 1.620 + manifest data, and doesn't write any changelog data until 1.621 + those are finished. A reader starts by reading changelog 1.622 + data, then manifest data, followed by filelog data.</para> 1.623 + 1.624 + <para>Since the writer has always finished writing filelog and 1.625 + manifest data before it writes to the changelog, a reader will 1.626 + never read a pointer to a partially written manifest revision 1.627 + from the changelog, and it will never read a pointer to a 1.628 + partially written filelog revision from the manifest.</para> 1.629 + 1.630 + </sect2> 1.631 + <sect2> 1.632 + <title>Concurrent access</title> 1.633 + 1.634 + <para>The read/write ordering and atomicity guarantees mean that 1.635 + Mercurial never needs to <emphasis>lock</emphasis> a 1.636 + repository when it's reading data, even if the repository is 1.637 + being written to while the read is occurring. This has a big 1.638 + effect on scalability; you can have an arbitrary number of 1.639 + Mercurial processes safely reading data from a repository 1.640 + safely all at once, no matter whether it's being written to or 1.641 + not.</para> 1.642 + 1.643 + <para>The lockless nature of reading means that if you're 1.644 + sharing a repository on a multi-user system, you don't need to 1.645 + grant other local users permission to 1.646 + <emphasis>write</emphasis> to your repository in order for 1.647 + them to be able to clone it or pull changes from it; they only 1.648 + need <emphasis>read</emphasis> permission. (This is 1.649 + <emphasis>not</emphasis> a common feature among revision 1.650 + control systems, so don't take it for granted! Most require 1.651 + readers to be able to lock a repository to access it safely, 1.652 + and this requires write permission on at least one directory, 1.653 + which of course makes for all kinds of nasty and annoying 1.654 + security and administrative problems.)</para> 1.655 + 1.656 + <para>Mercurial uses locks to ensure that only one process can 1.657 + write to a repository at a time (the locking mechanism is safe 1.658 + even over filesystems that are notoriously hostile to locking, 1.659 + such as NFS). If a repository is locked, a writer will wait 1.660 + for a while to retry if the repository becomes unlocked, but 1.661 + if the repository remains locked for too long, the process 1.662 + attempting to write will time out after a while. This means 1.663 + that your daily automated scripts won't get stuck forever and 1.664 + pile up if a system crashes unnoticed, for example. (Yes, the 1.665 + timeout is configurable, from zero to infinity.)</para> 1.666 + 1.667 + <sect3> 1.668 + <title>Safe dirstate access</title> 1.669 + 1.670 + <para>As with revision data, Mercurial doesn't take a lock to 1.671 + read the dirstate file; it does acquire a lock to write it. 1.672 + To avoid the possibility of reading a partially written copy 1.673 + of the dirstate file, Mercurial writes to a file with a 1.674 + unique name in the same directory as the dirstate file, then 1.675 + renames the temporary file atomically to 1.676 + <filename>dirstate</filename>. The file named 1.677 + <filename>dirstate</filename> is thus guaranteed to be 1.678 + complete, not partially written.</para> 1.679 + 1.680 + </sect3> 1.681 + </sect2> 1.682 + <sect2> 1.683 + <title>Avoiding seeks</title> 1.684 + 1.685 + <para>Critical to Mercurial's performance is the avoidance of 1.686 + seeks of the disk head, since any seek is far more expensive 1.687 + than even a comparatively large read operation.</para> 1.688 + 1.689 + <para>This is why, for example, the dirstate is stored in a 1.690 + single file. If there were a dirstate file per directory that 1.691 + Mercurial tracked, the disk would seek once per directory. 1.692 + Instead, Mercurial reads the entire single dirstate file in 1.693 + one step.</para> 1.694 + 1.695 + <para>Mercurial also uses a <quote>copy on write</quote> scheme 1.696 + when cloning a repository on local storage. Instead of 1.697 + copying every revlog file from the old repository into the new 1.698 + repository, it makes a <quote>hard link</quote>, which is a 1.699 + shorthand way to say <quote>these two names point to the same 1.700 + file</quote>. When Mercurial is about to write to one of a 1.701 + revlog's files, it checks to see if the number of names 1.702 + pointing at the file is greater than one. If it is, more than 1.703 + one repository is using the file, so Mercurial makes a new 1.704 + copy of the file that is private to this repository.</para> 1.705 + 1.706 + <para>A few revision control developers have pointed out that 1.707 + this idea of making a complete private copy of a file is not 1.708 + very efficient in its use of storage. While this is true, 1.709 + storage is cheap, and this method gives the highest 1.710 + performance while deferring most book-keeping to the operating 1.711 + system. An alternative scheme would most likely reduce 1.712 + performance and increase the complexity of the software, each 1.713 + of which is much more important to the <quote>feel</quote> of 1.714 + day-to-day use.</para> 1.715 + 1.716 + </sect2> 1.717 + <sect2> 1.718 + <title>Other contents of the dirstate</title> 1.719 + 1.720 + <para>Because Mercurial doesn't force you to tell it when you're 1.721 + modifying a file, it uses the dirstate to store some extra 1.722 + information so it can determine efficiently whether you have 1.723 + modified a file. For each file in the working directory, it 1.724 + stores the time that it last modified the file itself, and the 1.725 + size of the file at that time.</para> 1.726 + 1.727 + <para>When you explicitly <command role="hg-cmd">hg 1.728 + add</command>, <command role="hg-cmd">hg remove</command>, 1.729 + <command role="hg-cmd">hg rename</command> or <command 1.730 + role="hg-cmd">hg copy</command> files, Mercurial updates the 1.731 + dirstate so that it knows what to do with those files when you 1.732 + commit.</para> 1.733 + 1.734 + <para>When Mercurial is checking the states of files in the 1.735 + working directory, it first checks a file's modification time. 1.736 + If that has not changed, the file must not have been modified. 1.737 + If the file's size has changed, the file must have been 1.738 + modified. If the modification time has changed, but the size 1.739 + has not, only then does Mercurial need to read the actual 1.740 + contents of the file to see if they've changed. Storing these 1.741 + few extra pieces of information dramatically reduces the 1.742 + amount of data that Mercurial needs to read, which yields 1.743 + large performance improvements compared to other revision 1.744 + control systems.</para> 1.745 + 1.746 + </sect2> 1.747 + </sect1> 1.748 +</chapter> 1.749 + 1.750 +<!-- 1.751 +local variables: 1.752 +sgml-parent-document: ("00book.xml" "book" "chapter") 1.753 +end: 1.754 +-->