hgbook
annotate fr/ch04-concepts.xml @ 964:6b680d569bb4
deleting a bunch of files not longer necessary to build the documentation.
Adding missing newly files needed to build the documentation
Adding missing newly files needed to build the documentation
author | Romain PELISSE <belaran@gmail.com> |
---|---|
date | Sun Aug 16 04:58:01 2009 +0200 (2009-08-16) |
parents | |
children | e6894aa7baf2 |
rev | line source |
---|---|
belaran@964 | 1 <!-- vim: set filetype=docbkxml shiftwidth=2 autoindent expandtab tw=77 : --> |
belaran@964 | 2 |
belaran@964 | 3 <chapter> |
belaran@964 | 4 <title>Behind the scenes</title> |
belaran@964 | 5 <para>\label{chap:concepts}</para> |
belaran@964 | 6 |
belaran@964 | 7 <para>Unlike many revision control systems, the concepts upon which |
belaran@964 | 8 Mercurial is built are simple enough that it's easy to understand how |
belaran@964 | 9 the software really works. Knowing this certainly isn't necessary, |
belaran@964 | 10 but I find it useful to have a <quote>mental model</quote> of what's going on.</para> |
belaran@964 | 11 |
belaran@964 | 12 <para>This understanding gives me confidence that Mercurial has been |
belaran@964 | 13 carefully designed to be both <emphasis>safe</emphasis> and <emphasis>efficient</emphasis>. And |
belaran@964 | 14 just as importantly, if it's easy for me to retain a good idea of what |
belaran@964 | 15 the software is doing when I perform a revision control task, I'm less |
belaran@964 | 16 likely to be surprised by its behaviour.</para> |
belaran@964 | 17 |
belaran@964 | 18 <para>In this chapter, we'll initially cover the core concepts behind |
belaran@964 | 19 Mercurial's design, then continue to discuss some of the interesting |
belaran@964 | 20 details of its implementation.</para> |
belaran@964 | 21 |
belaran@964 | 22 <sect1> |
belaran@964 | 23 <title>Mercurial's historical record</title> |
belaran@964 | 24 |
belaran@964 | 25 <sect2> |
belaran@964 | 26 <title>Tracking the history of a single file</title> |
belaran@964 | 27 |
belaran@964 | 28 <para>When Mercurial tracks modifications to a file, it stores the history |
belaran@964 | 29 of that file in a metadata object called a <emphasis>filelog</emphasis>. Each entry |
belaran@964 | 30 in the filelog contains enough information to reconstruct one revision |
belaran@964 | 31 of the file that is being tracked. Filelogs are stored as files in |
belaran@964 | 32 the <filename role="special" class="directory">.hg/store/data</filename> directory. A filelog contains two kinds |
belaran@964 | 33 of information: revision data, and an index to help Mercurial to find |
belaran@964 | 34 a revision efficiently.</para> |
belaran@964 | 35 |
belaran@964 | 36 <para>A file that is large, or has a lot of history, has its filelog stored |
belaran@964 | 37 in separate data (<quote><literal>.d</literal></quote> suffix) and index (<quote><literal>.i</literal></quote> |
belaran@964 | 38 suffix) files. For small files without much history, the revision |
belaran@964 | 39 data and index are combined in a single <quote><literal>.i</literal></quote> file. The |
belaran@964 | 40 correspondence between a file in the working directory and the filelog |
belaran@964 | 41 that tracks its history in the repository is illustrated in |
belaran@964 | 42 figure <xref linkend="fig:concepts:filelog"/>.</para> |
belaran@964 | 43 |
belaran@964 | 44 <informalfigure> |
belaran@964 | 45 |
belaran@964 | 46 <para> <mediaobject><imageobject><imagedata fileref="filelog"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 47 \caption{Relationships between files in working directory and |
belaran@964 | 48 filelogs in repository} |
belaran@964 | 49 \label{fig:concepts:filelog}</para> |
belaran@964 | 50 </informalfigure> |
belaran@964 | 51 |
belaran@964 | 52 </sect2> |
belaran@964 | 53 <sect2> |
belaran@964 | 54 <title>Managing tracked files</title> |
belaran@964 | 55 |
belaran@964 | 56 <para>Mercurial uses a structure called a <emphasis>manifest</emphasis> to collect |
belaran@964 | 57 together information about the files that it tracks. Each entry in |
belaran@964 | 58 the manifest contains information about the files present in a single |
belaran@964 | 59 changeset. An entry records which files are present in the changeset, |
belaran@964 | 60 the revision of each file, and a few other pieces of file metadata.</para> |
belaran@964 | 61 |
belaran@964 | 62 </sect2> |
belaran@964 | 63 <sect2> |
belaran@964 | 64 <title>Recording changeset information</title> |
belaran@964 | 65 |
belaran@964 | 66 <para>The <emphasis>changelog</emphasis> contains information about each changeset. Each |
belaran@964 | 67 revision records who committed a change, the changeset comment, other |
belaran@964 | 68 pieces of changeset-related information, and the revision of the |
belaran@964 | 69 manifest to use. |
belaran@964 | 70 </para> |
belaran@964 | 71 |
belaran@964 | 72 </sect2> |
belaran@964 | 73 <sect2> |
belaran@964 | 74 <title>Relationships between revisions</title> |
belaran@964 | 75 |
belaran@964 | 76 <para>Within a changelog, a manifest, or a filelog, each revision stores a |
belaran@964 | 77 pointer to its immediate parent (or to its two parents, if it's a |
belaran@964 | 78 merge revision). As I mentioned above, there are also relationships |
belaran@964 | 79 between revisions <emphasis>across</emphasis> these structures, and they are |
belaran@964 | 80 hierarchical in nature. |
belaran@964 | 81 </para> |
belaran@964 | 82 |
belaran@964 | 83 <para>For every changeset in a repository, there is exactly one revision |
belaran@964 | 84 stored in the changelog. Each revision of the changelog contains a |
belaran@964 | 85 pointer to a single revision of the manifest. A revision of the |
belaran@964 | 86 manifest stores a pointer to a single revision of each filelog tracked |
belaran@964 | 87 when that changeset was created. These relationships are illustrated |
belaran@964 | 88 in figure <xref linkend="fig:concepts:metadata"/>. |
belaran@964 | 89 </para> |
belaran@964 | 90 |
belaran@964 | 91 <informalfigure> |
belaran@964 | 92 |
belaran@964 | 93 <para> <mediaobject><imageobject><imagedata fileref="metadata"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 94 <caption><para>Metadata relationships</para></caption> |
belaran@964 | 95 \label{fig:concepts:metadata} |
belaran@964 | 96 </para> |
belaran@964 | 97 </informalfigure> |
belaran@964 | 98 |
belaran@964 | 99 <para>As the illustration shows, there is <emphasis>not</emphasis> a <quote>one to one</quote> |
belaran@964 | 100 relationship between revisions in the changelog, manifest, or filelog. |
belaran@964 | 101 If the manifest hasn't changed between two changesets, the changelog |
belaran@964 | 102 entries for those changesets will point to the same revision of the |
belaran@964 | 103 manifest. If a file that Mercurial tracks hasn't changed between two |
belaran@964 | 104 changesets, the entry for that file in the two revisions of the |
belaran@964 | 105 manifest will point to the same revision of its filelog. |
belaran@964 | 106 </para> |
belaran@964 | 107 |
belaran@964 | 108 </sect2> |
belaran@964 | 109 </sect1> |
belaran@964 | 110 <sect1> |
belaran@964 | 111 <title>Safe, efficient storage</title> |
belaran@964 | 112 |
belaran@964 | 113 <para>The underpinnings of changelogs, manifests, and filelogs are provided |
belaran@964 | 114 by a single structure called the <emphasis>revlog</emphasis>. |
belaran@964 | 115 </para> |
belaran@964 | 116 |
belaran@964 | 117 <sect2> |
belaran@964 | 118 <title>Efficient storage</title> |
belaran@964 | 119 |
belaran@964 | 120 <para>The revlog provides efficient storage of revisions using a |
belaran@964 | 121 <emphasis>delta</emphasis> mechanism. Instead of storing a complete copy of a file |
belaran@964 | 122 for each revision, it stores the changes needed to transform an older |
belaran@964 | 123 revision into the new revision. For many kinds of file data, these |
belaran@964 | 124 deltas are typically a fraction of a percent of the size of a full |
belaran@964 | 125 copy of a file. |
belaran@964 | 126 </para> |
belaran@964 | 127 |
belaran@964 | 128 <para>Some obsolete revision control systems can only work with deltas of |
belaran@964 | 129 text files. They must either store binary files as complete snapshots |
belaran@964 | 130 or encoded into a text representation, both of which are wasteful |
belaran@964 | 131 approaches. Mercurial can efficiently handle deltas of files with |
belaran@964 | 132 arbitrary binary contents; it doesn't need to treat text as special. |
belaran@964 | 133 </para> |
belaran@964 | 134 |
belaran@964 | 135 </sect2> |
belaran@964 | 136 <sect2> |
belaran@964 | 137 <title>Safe operation</title> |
belaran@964 | 138 <para>\label{sec:concepts:txn} |
belaran@964 | 139 </para> |
belaran@964 | 140 |
belaran@964 | 141 <para>Mercurial only ever <emphasis>appends</emphasis> data to the end of a revlog file. |
belaran@964 | 142 It never modifies a section of a file after it has written it. This |
belaran@964 | 143 is both more robust and efficient than schemes that need to modify or |
belaran@964 | 144 rewrite data. |
belaran@964 | 145 </para> |
belaran@964 | 146 |
belaran@964 | 147 <para>In addition, Mercurial treats every write as part of a |
belaran@964 | 148 <emphasis>transaction</emphasis> that can span a number of files. A transaction is |
belaran@964 | 149 <emphasis>atomic</emphasis>: either the entire transaction succeeds and its effects |
belaran@964 | 150 are all visible to readers in one go, or the whole thing is undone. |
belaran@964 | 151 This guarantee of atomicity means that if you're running two copies of |
belaran@964 | 152 Mercurial, where one is reading data and one is writing it, the reader |
belaran@964 | 153 will never see a partially written result that might confuse it. |
belaran@964 | 154 </para> |
belaran@964 | 155 |
belaran@964 | 156 <para>The fact that Mercurial only appends to files makes it easier to |
belaran@964 | 157 provide this transactional guarantee. The easier it is to do stuff |
belaran@964 | 158 like this, the more confident you should be that it's done correctly. |
belaran@964 | 159 </para> |
belaran@964 | 160 |
belaran@964 | 161 </sect2> |
belaran@964 | 162 <sect2> |
belaran@964 | 163 <title>Fast retrieval</title> |
belaran@964 | 164 |
belaran@964 | 165 <para>Mercurial cleverly avoids a pitfall common to all earlier |
belaran@964 | 166 revision control systems: the problem of <emphasis>inefficient retrieval</emphasis>. |
belaran@964 | 167 Most revision control systems store the contents of a revision as an |
belaran@964 | 168 incremental series of modifications against a <quote>snapshot</quote>. To |
belaran@964 | 169 reconstruct a specific revision, you must first read the snapshot, and |
belaran@964 | 170 then every one of the revisions between the snapshot and your target |
belaran@964 | 171 revision. The more history that a file accumulates, the more |
belaran@964 | 172 revisions you must read, hence the longer it takes to reconstruct a |
belaran@964 | 173 particular revision. |
belaran@964 | 174 </para> |
belaran@964 | 175 |
belaran@964 | 176 <informalfigure> |
belaran@964 | 177 |
belaran@964 | 178 <para> <mediaobject><imageobject><imagedata fileref="snapshot"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 179 <caption><para>Snapshot of a revlog, with incremental deltas</para></caption> |
belaran@964 | 180 \label{fig:concepts:snapshot} |
belaran@964 | 181 </para> |
belaran@964 | 182 </informalfigure> |
belaran@964 | 183 |
belaran@964 | 184 <para>The innovation that Mercurial applies to this problem is simple but |
belaran@964 | 185 effective. Once the cumulative amount of delta information stored |
belaran@964 | 186 since the last snapshot exceeds a fixed threshold, it stores a new |
belaran@964 | 187 snapshot (compressed, of course), instead of another delta. This |
belaran@964 | 188 makes it possible to reconstruct <emphasis>any</emphasis> revision of a file |
belaran@964 | 189 quickly. This approach works so well that it has since been copied by |
belaran@964 | 190 several other revision control systems. |
belaran@964 | 191 </para> |
belaran@964 | 192 |
belaran@964 | 193 <para>Figure <xref linkend="fig:concepts:snapshot"/> illustrates the idea. In an entry |
belaran@964 | 194 in a revlog's index file, Mercurial stores the range of entries from |
belaran@964 | 195 the data file that it must read to reconstruct a particular revision. |
belaran@964 | 196 </para> |
belaran@964 | 197 |
belaran@964 | 198 <sect3> |
belaran@964 | 199 <title>Aside: the influence of video compression</title> |
belaran@964 | 200 |
belaran@964 | 201 <para>If you're familiar with video compression or have ever watched a TV |
belaran@964 | 202 feed through a digital cable or satellite service, you may know that |
belaran@964 | 203 most video compression schemes store each frame of video as a delta |
belaran@964 | 204 against its predecessor frame. In addition, these schemes use |
belaran@964 | 205 <quote>lossy</quote> compression techniques to increase the compression ratio, so |
belaran@964 | 206 visual errors accumulate over the course of a number of inter-frame |
belaran@964 | 207 deltas. |
belaran@964 | 208 </para> |
belaran@964 | 209 |
belaran@964 | 210 <para>Because it's possible for a video stream to <quote>drop out</quote> occasionally |
belaran@964 | 211 due to signal glitches, and to limit the accumulation of artefacts |
belaran@964 | 212 introduced by the lossy compression process, video encoders |
belaran@964 | 213 periodically insert a complete frame (called a <quote>key frame</quote>) into the |
belaran@964 | 214 video stream; the next delta is generated against that frame. This |
belaran@964 | 215 means that if the video signal gets interrupted, it will resume once |
belaran@964 | 216 the next key frame is received. Also, the accumulation of encoding |
belaran@964 | 217 errors restarts anew with each key frame. |
belaran@964 | 218 </para> |
belaran@964 | 219 |
belaran@964 | 220 </sect3> |
belaran@964 | 221 </sect2> |
belaran@964 | 222 <sect2> |
belaran@964 | 223 <title>Identification and strong integrity</title> |
belaran@964 | 224 |
belaran@964 | 225 <para>Along with delta or snapshot information, a revlog entry contains a |
belaran@964 | 226 cryptographic hash of the data that it represents. This makes it |
belaran@964 | 227 difficult to forge the contents of a revision, and easy to detect |
belaran@964 | 228 accidental corruption. |
belaran@964 | 229 </para> |
belaran@964 | 230 |
belaran@964 | 231 <para>Hashes provide more than a mere check against corruption; they are |
belaran@964 | 232 used as the identifiers for revisions. The changeset identification |
belaran@964 | 233 hashes that you see as an end user are from revisions of the |
belaran@964 | 234 changelog. Although filelogs and the manifest also use hashes, |
belaran@964 | 235 Mercurial only uses these behind the scenes. |
belaran@964 | 236 </para> |
belaran@964 | 237 |
belaran@964 | 238 <para>Mercurial verifies that hashes are correct when it retrieves file |
belaran@964 | 239 revisions and when it pulls changes from another repository. If it |
belaran@964 | 240 encounters an integrity problem, it will complain and stop whatever |
belaran@964 | 241 it's doing. |
belaran@964 | 242 </para> |
belaran@964 | 243 |
belaran@964 | 244 <para>In addition to the effect it has on retrieval efficiency, Mercurial's |
belaran@964 | 245 use of periodic snapshots makes it more robust against partial data |
belaran@964 | 246 corruption. If a revlog becomes partly corrupted due to a hardware |
belaran@964 | 247 error or system bug, it's often possible to reconstruct some or most |
belaran@964 | 248 revisions from the uncorrupted sections of the revlog, both before and |
belaran@964 | 249 after the corrupted section. This would not be possible with a |
belaran@964 | 250 delta-only storage model. |
belaran@964 | 251 </para> |
belaran@964 | 252 |
belaran@964 | 253 <para>\section{Revision history, branching, |
belaran@964 | 254 and merging} |
belaran@964 | 255 </para> |
belaran@964 | 256 |
belaran@964 | 257 <para>Every entry in a Mercurial revlog knows the identity of its immediate |
belaran@964 | 258 ancestor revision, usually referred to as its <emphasis>parent</emphasis>. In fact, |
belaran@964 | 259 a revision contains room for not one parent, but two. Mercurial uses |
belaran@964 | 260 a special hash, called the <quote>null ID</quote>, to represent the idea <quote>there |
belaran@964 | 261 is no parent here</quote>. This hash is simply a string of zeroes. |
belaran@964 | 262 </para> |
belaran@964 | 263 |
belaran@964 | 264 <para>In figure <xref linkend="fig:concepts:revlog"/>, you can see an example of the |
belaran@964 | 265 conceptual structure of a revlog. Filelogs, manifests, and changelogs |
belaran@964 | 266 all have this same structure; they differ only in the kind of data |
belaran@964 | 267 stored in each delta or snapshot. |
belaran@964 | 268 </para> |
belaran@964 | 269 |
belaran@964 | 270 <para>The first revision in a revlog (at the bottom of the image) has the |
belaran@964 | 271 null ID in both of its parent slots. For a <quote>normal</quote> revision, its |
belaran@964 | 272 first parent slot contains the ID of its parent revision, and its |
belaran@964 | 273 second contains the null ID, indicating that the revision has only one |
belaran@964 | 274 real parent. Any two revisions that have the same parent ID are |
belaran@964 | 275 branches. A revision that represents a merge between branches has two |
belaran@964 | 276 normal revision IDs in its parent slots. |
belaran@964 | 277 </para> |
belaran@964 | 278 |
belaran@964 | 279 <informalfigure> |
belaran@964 | 280 |
belaran@964 | 281 <para> <mediaobject><imageobject><imagedata fileref="revlog"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 282 \caption{} |
belaran@964 | 283 \label{fig:concepts:revlog} |
belaran@964 | 284 </para> |
belaran@964 | 285 </informalfigure> |
belaran@964 | 286 |
belaran@964 | 287 </sect2> |
belaran@964 | 288 </sect1> |
belaran@964 | 289 <sect1> |
belaran@964 | 290 <title>The working directory</title> |
belaran@964 | 291 |
belaran@964 | 292 <para>In the working directory, Mercurial stores a snapshot of the files |
belaran@964 | 293 from the repository as of a particular changeset. |
belaran@964 | 294 </para> |
belaran@964 | 295 |
belaran@964 | 296 <para>The working directory <quote>knows</quote> which changeset it contains. When you |
belaran@964 | 297 update the working directory to contain a particular changeset, |
belaran@964 | 298 Mercurial looks up the appropriate revision of the manifest to find |
belaran@964 | 299 out which files it was tracking at the time that changeset was |
belaran@964 | 300 committed, and which revision of each file was then current. It then |
belaran@964 | 301 recreates a copy of each of those files, with the same contents it had |
belaran@964 | 302 when the changeset was committed. |
belaran@964 | 303 </para> |
belaran@964 | 304 |
belaran@964 | 305 <para>The <emphasis>dirstate</emphasis> contains Mercurial's knowledge of the working |
belaran@964 | 306 directory. This details which changeset the working directory is |
belaran@964 | 307 updated to, and all of the files that Mercurial is tracking in the |
belaran@964 | 308 working directory. |
belaran@964 | 309 </para> |
belaran@964 | 310 |
belaran@964 | 311 <para>Just as a revision of a revlog has room for two parents, so that it |
belaran@964 | 312 can represent either a normal revision (with one parent) or a merge of |
belaran@964 | 313 two earlier revisions, the dirstate has slots for two parents. When |
belaran@964 | 314 you use the <command role="hg-cmd">hg update</command> command, the changeset that you update to |
belaran@964 | 315 is stored in the <quote>first parent</quote> slot, and the null ID in the second. |
belaran@964 | 316 When you <command role="hg-cmd">hg merge</command> with another changeset, the first parent |
belaran@964 | 317 remains unchanged, and the second parent is filled in with the |
belaran@964 | 318 changeset you're merging with. The <command role="hg-cmd">hg parents</command> command tells you |
belaran@964 | 319 what the parents of the dirstate are. |
belaran@964 | 320 </para> |
belaran@964 | 321 |
belaran@964 | 322 <sect2> |
belaran@964 | 323 <title>What happens when you commit</title> |
belaran@964 | 324 |
belaran@964 | 325 <para>The dirstate stores parent information for more than just book-keeping |
belaran@964 | 326 purposes. Mercurial uses the parents of the dirstate as \emph{the |
belaran@964 | 327 parents of a new changeset} when you perform a commit. |
belaran@964 | 328 </para> |
belaran@964 | 329 |
belaran@964 | 330 <informalfigure> |
belaran@964 | 331 |
belaran@964 | 332 <para> <mediaobject><imageobject><imagedata fileref="wdir"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 333 <caption><para>The working directory can have two parents</para></caption> |
belaran@964 | 334 \label{fig:concepts:wdir} |
belaran@964 | 335 </para> |
belaran@964 | 336 </informalfigure> |
belaran@964 | 337 |
belaran@964 | 338 <para>Figure <xref linkend="fig:concepts:wdir"/> shows the normal state of the working |
belaran@964 | 339 directory, where it has a single changeset as parent. That changeset |
belaran@964 | 340 is the <emphasis>tip</emphasis>, the newest changeset in the repository that has no |
belaran@964 | 341 children. |
belaran@964 | 342 </para> |
belaran@964 | 343 |
belaran@964 | 344 <informalfigure> |
belaran@964 | 345 |
belaran@964 | 346 <para> <mediaobject><imageobject><imagedata fileref="wdir-after-commit"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 347 <caption><para>The working directory gains new parents after a commit</para></caption> |
belaran@964 | 348 \label{fig:concepts:wdir-after-commit} |
belaran@964 | 349 </para> |
belaran@964 | 350 </informalfigure> |
belaran@964 | 351 |
belaran@964 | 352 <para>It's useful to think of the working directory as <quote>the changeset I'm |
belaran@964 | 353 about to commit</quote>. Any files that you tell Mercurial that you've |
belaran@964 | 354 added, removed, renamed, or copied will be reflected in that |
belaran@964 | 355 changeset, as will modifications to any files that Mercurial is |
belaran@964 | 356 already tracking; the new changeset will have the parents of the |
belaran@964 | 357 working directory as its parents. |
belaran@964 | 358 </para> |
belaran@964 | 359 |
belaran@964 | 360 <para>After a commit, Mercurial will update the parents of the working |
belaran@964 | 361 directory, so that the first parent is the ID of the new changeset, |
belaran@964 | 362 and the second is the null ID. This is shown in |
belaran@964 | 363 figure <xref linkend="fig:concepts:wdir-after-commit"/>. Mercurial doesn't touch |
belaran@964 | 364 any of the files in the working directory when you commit; it just |
belaran@964 | 365 modifies the dirstate to note its new parents. |
belaran@964 | 366 </para> |
belaran@964 | 367 |
belaran@964 | 368 </sect2> |
belaran@964 | 369 <sect2> |
belaran@964 | 370 <title>Creating a new head</title> |
belaran@964 | 371 |
belaran@964 | 372 <para>It's perfectly normal to update the working directory to a changeset |
belaran@964 | 373 other than the current tip. For example, you might want to know what |
belaran@964 | 374 your project looked like last Tuesday, or you could be looking through |
belaran@964 | 375 changesets to see which one introduced a bug. In cases like this, the |
belaran@964 | 376 natural thing to do is update the working directory to the changeset |
belaran@964 | 377 you're interested in, and then examine the files in the working |
belaran@964 | 378 directory directly to see their contents as they were when you |
belaran@964 | 379 committed that changeset. The effect of this is shown in |
belaran@964 | 380 figure <xref linkend="fig:concepts:wdir-pre-branch"/>. |
belaran@964 | 381 </para> |
belaran@964 | 382 |
belaran@964 | 383 <informalfigure> |
belaran@964 | 384 |
belaran@964 | 385 <para> <mediaobject><imageobject><imagedata fileref="wdir-pre-branch"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 386 <caption><para>The working directory, updated to an older changeset</para></caption> |
belaran@964 | 387 \label{fig:concepts:wdir-pre-branch} |
belaran@964 | 388 </para> |
belaran@964 | 389 </informalfigure> |
belaran@964 | 390 |
belaran@964 | 391 <para>Having updated the working directory to an older changeset, what |
belaran@964 | 392 happens if you make some changes, and then commit? Mercurial behaves |
belaran@964 | 393 in the same way as I outlined above. The parents of the working |
belaran@964 | 394 directory become the parents of the new changeset. This new changeset |
belaran@964 | 395 has no children, so it becomes the new tip. And the repository now |
belaran@964 | 396 contains two changesets that have no children; we call these |
belaran@964 | 397 <emphasis>heads</emphasis>. You can see the structure that this creates in |
belaran@964 | 398 figure <xref linkend="fig:concepts:wdir-branch"/>. |
belaran@964 | 399 </para> |
belaran@964 | 400 |
belaran@964 | 401 <informalfigure> |
belaran@964 | 402 |
belaran@964 | 403 <para> <mediaobject><imageobject><imagedata fileref="wdir-branch"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 404 <caption><para>After a commit made while synced to an older changeset</para></caption> |
belaran@964 | 405 \label{fig:concepts:wdir-branch} |
belaran@964 | 406 </para> |
belaran@964 | 407 </informalfigure> |
belaran@964 | 408 |
belaran@964 | 409 <note> |
belaran@964 | 410 <para> If you're new to Mercurial, you should keep in mind a common |
belaran@964 | 411 <quote>error</quote>, which is to use the <command role="hg-cmd">hg pull</command> command without any |
belaran@964 | 412 options. By default, the <command role="hg-cmd">hg pull</command> command <emphasis>does not</emphasis> |
belaran@964 | 413 update the working directory, so you'll bring new changesets into |
belaran@964 | 414 your repository, but the working directory will stay synced at the |
belaran@964 | 415 same changeset as before the pull. If you make some changes and |
belaran@964 | 416 commit afterwards, you'll thus create a new head, because your |
belaran@964 | 417 working directory isn't synced to whatever the current tip is. |
belaran@964 | 418 </para> |
belaran@964 | 419 |
belaran@964 | 420 <para> I put the word <quote>error</quote> in quotes because all that you need to do |
belaran@964 | 421 to rectify this situation is <command role="hg-cmd">hg merge</command>, then <command role="hg-cmd">hg commit</command>. In |
belaran@964 | 422 other words, this almost never has negative consequences; it just |
belaran@964 | 423 surprises people. I'll discuss other ways to avoid this behaviour, |
belaran@964 | 424 and why Mercurial behaves in this initially surprising way, later |
belaran@964 | 425 on. |
belaran@964 | 426 </para> |
belaran@964 | 427 </note> |
belaran@964 | 428 |
belaran@964 | 429 </sect2> |
belaran@964 | 430 <sect2> |
belaran@964 | 431 <title>Merging heads</title> |
belaran@964 | 432 |
belaran@964 | 433 <para>When you run the <command role="hg-cmd">hg merge</command> command, Mercurial leaves the first |
belaran@964 | 434 parent of the working directory unchanged, and sets the second parent |
belaran@964 | 435 to the changeset you're merging with, as shown in |
belaran@964 | 436 figure <xref linkend="fig:concepts:wdir-merge"/>. |
belaran@964 | 437 </para> |
belaran@964 | 438 |
belaran@964 | 439 <informalfigure> |
belaran@964 | 440 |
belaran@964 | 441 <para> <mediaobject><imageobject><imagedata fileref="wdir-merge"/></imageobject><textobject><phrase>XXX add text</phrase></textobject></mediaobject> |
belaran@964 | 442 <caption><para>Merging two heads</para></caption> |
belaran@964 | 443 \label{fig:concepts:wdir-merge} |
belaran@964 | 444 </para> |
belaran@964 | 445 </informalfigure> |
belaran@964 | 446 |
belaran@964 | 447 <para>Mercurial also has to modify the working directory, to merge the files |
belaran@964 | 448 managed in the two changesets. Simplified a little, the merging |
belaran@964 | 449 process goes like this, for every file in the manifests of both |
belaran@964 | 450 changesets. |
belaran@964 | 451 </para> |
belaran@964 | 452 <itemizedlist> |
belaran@964 | 453 <listitem><para>If neither changeset has modified a file, do nothing with that |
belaran@964 | 454 file. |
belaran@964 | 455 </para> |
belaran@964 | 456 </listitem> |
belaran@964 | 457 <listitem><para>If one changeset has modified a file, and the other hasn't, |
belaran@964 | 458 create the modified copy of the file in the working directory. |
belaran@964 | 459 </para> |
belaran@964 | 460 </listitem> |
belaran@964 | 461 <listitem><para>If one changeset has removed a file, and the other hasn't (or |
belaran@964 | 462 has also deleted it), delete the file from the working directory. |
belaran@964 | 463 </para> |
belaran@964 | 464 </listitem> |
belaran@964 | 465 <listitem><para>If one changeset has removed a file, but the other has modified |
belaran@964 | 466 the file, ask the user what to do: keep the modified file, or remove |
belaran@964 | 467 it? |
belaran@964 | 468 </para> |
belaran@964 | 469 </listitem> |
belaran@964 | 470 <listitem><para>If both changesets have modified a file, invoke an external |
belaran@964 | 471 merge program to choose the new contents for the merged file. This |
belaran@964 | 472 may require input from the user. |
belaran@964 | 473 </para> |
belaran@964 | 474 </listitem> |
belaran@964 | 475 <listitem><para>If one changeset has modified a file, and the other has renamed |
belaran@964 | 476 or copied the file, make sure that the changes follow the new name |
belaran@964 | 477 of the file. |
belaran@964 | 478 </para> |
belaran@964 | 479 </listitem></itemizedlist> |
belaran@964 | 480 <para>There are more details&emdash;merging has plenty of corner cases&emdash;but |
belaran@964 | 481 these are the most common choices that are involved in a merge. As |
belaran@964 | 482 you can see, most cases are completely automatic, and indeed most |
belaran@964 | 483 merges finish automatically, without requiring your input to resolve |
belaran@964 | 484 any conflicts. |
belaran@964 | 485 </para> |
belaran@964 | 486 |
belaran@964 | 487 <para>When you're thinking about what happens when you commit after a merge, |
belaran@964 | 488 once again the working directory is <quote>the changeset I'm about to |
belaran@964 | 489 commit</quote>. After the <command role="hg-cmd">hg merge</command> command completes, the working |
belaran@964 | 490 directory has two parents; these will become the parents of the new |
belaran@964 | 491 changeset. |
belaran@964 | 492 </para> |
belaran@964 | 493 |
belaran@964 | 494 <para>Mercurial lets you perform multiple merges, but you must commit the |
belaran@964 | 495 results of each individual merge as you go. This is necessary because |
belaran@964 | 496 Mercurial only tracks two parents for both revisions and the working |
belaran@964 | 497 directory. While it would be technically possible to merge multiple |
belaran@964 | 498 changesets at once, the prospect of user confusion and making a |
belaran@964 | 499 terrible mess of a merge immediately becomes overwhelming. |
belaran@964 | 500 </para> |
belaran@964 | 501 |
belaran@964 | 502 </sect2> |
belaran@964 | 503 </sect1> |
belaran@964 | 504 <sect1> |
belaran@964 | 505 <title>Other interesting design features</title> |
belaran@964 | 506 |
belaran@964 | 507 <para>In the sections above, I've tried to highlight some of the most |
belaran@964 | 508 important aspects of Mercurial's design, to illustrate that it pays |
belaran@964 | 509 careful attention to reliability and performance. However, the |
belaran@964 | 510 attention to detail doesn't stop there. There are a number of other |
belaran@964 | 511 aspects of Mercurial's construction that I personally find |
belaran@964 | 512 interesting. I'll detail a few of them here, separate from the <quote>big |
belaran@964 | 513 ticket</quote> items above, so that if you're interested, you can gain a |
belaran@964 | 514 better idea of the amount of thinking that goes into a well-designed |
belaran@964 | 515 system. |
belaran@964 | 516 </para> |
belaran@964 | 517 |
belaran@964 | 518 <sect2> |
belaran@964 | 519 <title>Clever compression</title> |
belaran@964 | 520 |
belaran@964 | 521 <para>When appropriate, Mercurial will store both snapshots and deltas in |
belaran@964 | 522 compressed form. It does this by always <emphasis>trying to</emphasis> compress a |
belaran@964 | 523 snapshot or delta, but only storing the compressed version if it's |
belaran@964 | 524 smaller than the uncompressed version. |
belaran@964 | 525 </para> |
belaran@964 | 526 |
belaran@964 | 527 <para>This means that Mercurial does <quote>the right thing</quote> when storing a file |
belaran@964 | 528 whose native form is compressed, such as a <literal>zip</literal> archive or a |
belaran@964 | 529 JPEG image. When these types of files are compressed a second time, |
belaran@964 | 530 the resulting file is usually bigger than the once-compressed form, |
belaran@964 | 531 and so Mercurial will store the plain <literal>zip</literal> or JPEG. |
belaran@964 | 532 </para> |
belaran@964 | 533 |
belaran@964 | 534 <para>Deltas between revisions of a compressed file are usually larger than |
belaran@964 | 535 snapshots of the file, and Mercurial again does <quote>the right thing</quote> in |
belaran@964 | 536 these cases. It finds that such a delta exceeds the threshold at |
belaran@964 | 537 which it should store a complete snapshot of the file, so it stores |
belaran@964 | 538 the snapshot, again saving space compared to a naive delta-only |
belaran@964 | 539 approach. |
belaran@964 | 540 </para> |
belaran@964 | 541 |
belaran@964 | 542 <sect3> |
belaran@964 | 543 <title>Network recompression</title> |
belaran@964 | 544 |
belaran@964 | 545 <para>When storing revisions on disk, Mercurial uses the <quote>deflate</quote> |
belaran@964 | 546 compression algorithm (the same one used by the popular <literal>zip</literal> |
belaran@964 | 547 archive format), which balances good speed with a respectable |
belaran@964 | 548 compression ratio. However, when transmitting revision data over a |
belaran@964 | 549 network connection, Mercurial uncompresses the compressed revision |
belaran@964 | 550 data. |
belaran@964 | 551 </para> |
belaran@964 | 552 |
belaran@964 | 553 <para>If the connection is over HTTP, Mercurial recompresses the entire |
belaran@964 | 554 stream of data using a compression algorithm that gives a better |
belaran@964 | 555 compression ratio (the Burrows-Wheeler algorithm from the widely used |
belaran@964 | 556 <literal>bzip2</literal> compression package). This combination of algorithm |
belaran@964 | 557 and compression of the entire stream (instead of a revision at a time) |
belaran@964 | 558 substantially reduces the number of bytes to be transferred, yielding |
belaran@964 | 559 better network performance over almost all kinds of network. |
belaran@964 | 560 </para> |
belaran@964 | 561 |
belaran@964 | 562 <para>(If the connection is over <command>ssh</command>, Mercurial <emphasis>doesn't</emphasis> |
belaran@964 | 563 recompress the stream, because <command>ssh</command> can already do this |
belaran@964 | 564 itself.) |
belaran@964 | 565 </para> |
belaran@964 | 566 |
belaran@964 | 567 </sect3> |
belaran@964 | 568 </sect2> |
belaran@964 | 569 <sect2> |
belaran@964 | 570 <title>Read/write ordering and atomicity</title> |
belaran@964 | 571 |
belaran@964 | 572 <para>Appending to files isn't the whole story when it comes to guaranteeing |
belaran@964 | 573 that a reader won't see a partial write. If you recall |
belaran@964 | 574 figure <xref linkend="fig:concepts:metadata"/>, revisions in the changelog point to |
belaran@964 | 575 revisions in the manifest, and revisions in the manifest point to |
belaran@964 | 576 revisions in filelogs. This hierarchy is deliberate. |
belaran@964 | 577 </para> |
belaran@964 | 578 |
belaran@964 | 579 <para>A writer starts a transaction by writing filelog and manifest data, |
belaran@964 | 580 and doesn't write any changelog data until those are finished. A |
belaran@964 | 581 reader starts by reading changelog data, then manifest data, followed |
belaran@964 | 582 by filelog data. |
belaran@964 | 583 </para> |
belaran@964 | 584 |
belaran@964 | 585 <para>Since the writer has always finished writing filelog and manifest data |
belaran@964 | 586 before it writes to the changelog, a reader will never read a pointer |
belaran@964 | 587 to a partially written manifest revision from the changelog, and it will |
belaran@964 | 588 never read a pointer to a partially written filelog revision from the |
belaran@964 | 589 manifest. |
belaran@964 | 590 </para> |
belaran@964 | 591 |
belaran@964 | 592 </sect2> |
belaran@964 | 593 <sect2> |
belaran@964 | 594 <title>Concurrent access</title> |
belaran@964 | 595 |
belaran@964 | 596 <para>The read/write ordering and atomicity guarantees mean that Mercurial |
belaran@964 | 597 never needs to <emphasis>lock</emphasis> a repository when it's reading data, even |
belaran@964 | 598 if the repository is being written to while the read is occurring. |
belaran@964 | 599 This has a big effect on scalability; you can have an arbitrary number |
belaran@964 | 600 of Mercurial processes safely reading data from a repository safely |
belaran@964 | 601 all at once, no matter whether it's being written to or not. |
belaran@964 | 602 </para> |
belaran@964 | 603 |
belaran@964 | 604 <para>The lockless nature of reading means that if you're sharing a |
belaran@964 | 605 repository on a multi-user system, you don't need to grant other local |
belaran@964 | 606 users permission to <emphasis>write</emphasis> to your repository in order for them |
belaran@964 | 607 to be able to clone it or pull changes from it; they only need |
belaran@964 | 608 <emphasis>read</emphasis> permission. (This is <emphasis>not</emphasis> a common feature among |
belaran@964 | 609 revision control systems, so don't take it for granted! Most require |
belaran@964 | 610 readers to be able to lock a repository to access it safely, and this |
belaran@964 | 611 requires write permission on at least one directory, which of course |
belaran@964 | 612 makes for all kinds of nasty and annoying security and administrative |
belaran@964 | 613 problems.) |
belaran@964 | 614 </para> |
belaran@964 | 615 |
belaran@964 | 616 <para>Mercurial uses locks to ensure that only one process can write to a |
belaran@964 | 617 repository at a time (the locking mechanism is safe even over |
belaran@964 | 618 filesystems that are notoriously hostile to locking, such as NFS). If |
belaran@964 | 619 a repository is locked, a writer will wait for a while to retry if the |
belaran@964 | 620 repository becomes unlocked, but if the repository remains locked for |
belaran@964 | 621 too long, the process attempting to write will time out after a while. |
belaran@964 | 622 This means that your daily automated scripts won't get stuck forever |
belaran@964 | 623 and pile up if a system crashes unnoticed, for example. (Yes, the |
belaran@964 | 624 timeout is configurable, from zero to infinity.) |
belaran@964 | 625 </para> |
belaran@964 | 626 |
belaran@964 | 627 <sect3> |
belaran@964 | 628 <title>Safe dirstate access</title> |
belaran@964 | 629 |
belaran@964 | 630 <para>As with revision data, Mercurial doesn't take a lock to read the |
belaran@964 | 631 dirstate file; it does acquire a lock to write it. To avoid the |
belaran@964 | 632 possibility of reading a partially written copy of the dirstate file, |
belaran@964 | 633 Mercurial writes to a file with a unique name in the same directory as |
belaran@964 | 634 the dirstate file, then renames the temporary file atomically to |
belaran@964 | 635 <filename>dirstate</filename>. The file named <filename>dirstate</filename> is thus |
belaran@964 | 636 guaranteed to be complete, not partially written. |
belaran@964 | 637 </para> |
belaran@964 | 638 |
belaran@964 | 639 </sect3> |
belaran@964 | 640 </sect2> |
belaran@964 | 641 <sect2> |
belaran@964 | 642 <title>Avoiding seeks</title> |
belaran@964 | 643 |
belaran@964 | 644 <para>Critical to Mercurial's performance is the avoidance of seeks of the |
belaran@964 | 645 disk head, since any seek is far more expensive than even a |
belaran@964 | 646 comparatively large read operation. |
belaran@964 | 647 </para> |
belaran@964 | 648 |
belaran@964 | 649 <para>This is why, for example, the dirstate is stored in a single file. If |
belaran@964 | 650 there were a dirstate file per directory that Mercurial tracked, the |
belaran@964 | 651 disk would seek once per directory. Instead, Mercurial reads the |
belaran@964 | 652 entire single dirstate file in one step. |
belaran@964 | 653 </para> |
belaran@964 | 654 |
belaran@964 | 655 <para>Mercurial also uses a <quote>copy on write</quote> scheme when cloning a |
belaran@964 | 656 repository on local storage. Instead of copying every revlog file |
belaran@964 | 657 from the old repository into the new repository, it makes a <quote>hard |
belaran@964 | 658 link</quote>, which is a shorthand way to say <quote>these two names point to the |
belaran@964 | 659 same file</quote>. When Mercurial is about to write to one of a revlog's |
belaran@964 | 660 files, it checks to see if the number of names pointing at the file is |
belaran@964 | 661 greater than one. If it is, more than one repository is using the |
belaran@964 | 662 file, so Mercurial makes a new copy of the file that is private to |
belaran@964 | 663 this repository. |
belaran@964 | 664 </para> |
belaran@964 | 665 |
belaran@964 | 666 <para>A few revision control developers have pointed out that this idea of |
belaran@964 | 667 making a complete private copy of a file is not very efficient in its |
belaran@964 | 668 use of storage. While this is true, storage is cheap, and this method |
belaran@964 | 669 gives the highest performance while deferring most book-keeping to the |
belaran@964 | 670 operating system. An alternative scheme would most likely reduce |
belaran@964 | 671 performance and increase the complexity of the software, each of which |
belaran@964 | 672 is much more important to the <quote>feel</quote> of day-to-day use. |
belaran@964 | 673 </para> |
belaran@964 | 674 |
belaran@964 | 675 </sect2> |
belaran@964 | 676 <sect2> |
belaran@964 | 677 <title>Other contents of the dirstate</title> |
belaran@964 | 678 |
belaran@964 | 679 <para>Because Mercurial doesn't force you to tell it when you're modifying a |
belaran@964 | 680 file, it uses the dirstate to store some extra information so it can |
belaran@964 | 681 determine efficiently whether you have modified a file. For each file |
belaran@964 | 682 in the working directory, it stores the time that it last modified the |
belaran@964 | 683 file itself, and the size of the file at that time. |
belaran@964 | 684 </para> |
belaran@964 | 685 |
belaran@964 | 686 <para>When you explicitly <command role="hg-cmd">hg add</command>, <command role="hg-cmd">hg remove</command>, <command role="hg-cmd">hg rename</command> or |
belaran@964 | 687 <command role="hg-cmd">hg copy</command> files, Mercurial updates the dirstate so that it knows |
belaran@964 | 688 what to do with those files when you commit. |
belaran@964 | 689 </para> |
belaran@964 | 690 |
belaran@964 | 691 <para>When Mercurial is checking the states of files in the working |
belaran@964 | 692 directory, it first checks a file's modification time. If that has |
belaran@964 | 693 not changed, the file must not have been modified. If the file's size |
belaran@964 | 694 has changed, the file must have been modified. If the modification |
belaran@964 | 695 time has changed, but the size has not, only then does Mercurial need |
belaran@964 | 696 to read the actual contents of the file to see if they've changed. |
belaran@964 | 697 Storing these few extra pieces of information dramatically reduces the |
belaran@964 | 698 amount of data that Mercurial needs to read, which yields large |
belaran@964 | 699 performance improvements compared to other revision control systems. |
belaran@964 | 700 </para> |
belaran@964 | 701 |
belaran@964 | 702 </sect2> |
belaran@964 | 703 </sect1> |
belaran@964 | 704 </chapter> |
belaran@964 | 705 |
belaran@964 | 706 <!-- |
belaran@964 | 707 local variables: |
belaran@964 | 708 sgml-parent-document: ("00book.xml" "book" "chapter") |
belaran@964 | 709 end: |
belaran@964 | 710 --> |