Behind the scenes

belaran@964: belaran@964: belaran@964: belaran@964: Behind the scenes belaran@964: \label{chap:concepts} belaran@964: belaran@964: Unlike many revision control systems, the concepts upon which belaran@964: Mercurial is built are simple enough that it's easy to understand how belaran@964: the software really works. Knowing this certainly isn't necessary, belaran@964: but I find it useful to have a mental model of what's going on. belaran@964: belaran@964: This understanding gives me confidence that Mercurial has been belaran@964: carefully designed to be both safe and efficient. And belaran@964: just as importantly, if it's easy for me to retain a good idea of what belaran@964: the software is doing when I perform a revision control task, I'm less belaran@964: likely to be surprised by its behaviour. belaran@964: belaran@964: In this chapter, we'll initially cover the core concepts behind belaran@964: Mercurial's design, then continue to discuss some of the interesting belaran@964: details of its implementation. belaran@964: belaran@964: belaran@964: Mercurial's historical record belaran@964: belaran@964: belaran@964: Tracking the history of a single file belaran@964: belaran@964: When Mercurial tracks modifications to a file, it stores the history belaran@964: of that file in a metadata object called a filelog. Each entry belaran@964: in the filelog contains enough information to reconstruct one revision belaran@964: of the file that is being tracked. Filelogs are stored as files in belaran@964: the .hg/store/data directory. A filelog contains two kinds belaran@964: of information: revision data, and an index to help Mercurial to find belaran@964: a revision efficiently. belaran@964: belaran@964: A file that is large, or has a lot of history, has its filelog stored belaran@964: in separate data (.d suffix) and index (.i belaran@964: suffix) files. For small files without much history, the revision belaran@964: data and index are combined in a single .i file. The belaran@964: correspondence between a file in the working directory and the filelog belaran@964: that tracks its history in the repository is illustrated in belaran@964: figure . belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: \caption{Relationships between files in working directory and belaran@964: filelogs in repository} belaran@964: \label{fig:concepts:filelog} belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Managing tracked files belaran@964: belaran@964: Mercurial uses a structure called a manifest to collect belaran@964: together information about the files that it tracks. Each entry in belaran@964: the manifest contains information about the files present in a single belaran@964: changeset. An entry records which files are present in the changeset, belaran@964: the revision of each file, and a few other pieces of file metadata. belaran@964: belaran@964: belaran@964: belaran@964: Recording changeset information belaran@964: belaran@964: The changelog contains information about each changeset. Each belaran@964: revision records who committed a change, the changeset comment, other belaran@964: pieces of changeset-related information, and the revision of the belaran@964: manifest to use. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Relationships between revisions belaran@964: belaran@964: Within a changelog, a manifest, or a filelog, each revision stores a belaran@964: pointer to its immediate parent (or to its two parents, if it's a belaran@964: merge revision). As I mentioned above, there are also relationships belaran@964: between revisions across these structures, and they are belaran@964: hierarchical in nature. belaran@964: belaran@964: belaran@964: For every changeset in a repository, there is exactly one revision belaran@964: stored in the changelog. Each revision of the changelog contains a belaran@964: pointer to a single revision of the manifest. A revision of the belaran@964: manifest stores a pointer to a single revision of each filelog tracked belaran@964: when that changeset was created. These relationships are illustrated belaran@964: in figure . belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: Metadata relationships belaran@964: \label{fig:concepts:metadata} belaran@964: belaran@964: belaran@964: belaran@964: As the illustration shows, there is not a one to one belaran@964: relationship between revisions in the changelog, manifest, or filelog. belaran@964: If the manifest hasn't changed between two changesets, the changelog belaran@964: entries for those changesets will point to the same revision of the belaran@964: manifest. If a file that Mercurial tracks hasn't changed between two belaran@964: changesets, the entry for that file in the two revisions of the belaran@964: manifest will point to the same revision of its filelog. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Safe, efficient storage belaran@964: belaran@964: The underpinnings of changelogs, manifests, and filelogs are provided belaran@964: by a single structure called the revlog. belaran@964: belaran@964: belaran@964: belaran@964: Efficient storage belaran@964: belaran@964: The revlog provides efficient storage of revisions using a belaran@964: delta mechanism. Instead of storing a complete copy of a file belaran@964: for each revision, it stores the changes needed to transform an older belaran@964: revision into the new revision. For many kinds of file data, these belaran@964: deltas are typically a fraction of a percent of the size of a full belaran@964: copy of a file. belaran@964: belaran@964: belaran@964: Some obsolete revision control systems can only work with deltas of belaran@964: text files. They must either store binary files as complete snapshots belaran@964: or encoded into a text representation, both of which are wasteful belaran@964: approaches. Mercurial can efficiently handle deltas of files with belaran@964: arbitrary binary contents; it doesn't need to treat text as special. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Safe operation belaran@964: \label{sec:concepts:txn} belaran@964: belaran@964: belaran@964: Mercurial only ever appends data to the end of a revlog file. belaran@964: It never modifies a section of a file after it has written it. This belaran@964: is both more robust and efficient than schemes that need to modify or belaran@964: rewrite data. belaran@964: belaran@964: belaran@964: In addition, Mercurial treats every write as part of a belaran@964: transaction that can span a number of files. A transaction is belaran@964: atomic: either the entire transaction succeeds and its effects belaran@964: are all visible to readers in one go, or the whole thing is undone. belaran@964: This guarantee of atomicity means that if you're running two copies of belaran@964: Mercurial, where one is reading data and one is writing it, the reader belaran@964: will never see a partially written result that might confuse it. belaran@964: belaran@964: belaran@964: The fact that Mercurial only appends to files makes it easier to belaran@964: provide this transactional guarantee. The easier it is to do stuff belaran@964: like this, the more confident you should be that it's done correctly. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Fast retrieval belaran@964: belaran@964: Mercurial cleverly avoids a pitfall common to all earlier belaran@964: revision control systems: the problem of inefficient retrieval. belaran@964: Most revision control systems store the contents of a revision as an belaran@964: incremental series of modifications against a snapshot. To belaran@964: reconstruct a specific revision, you must first read the snapshot, and belaran@964: then every one of the revisions between the snapshot and your target belaran@964: revision. The more history that a file accumulates, the more belaran@964: revisions you must read, hence the longer it takes to reconstruct a belaran@964: particular revision. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: Snapshot of a revlog, with incremental deltas belaran@964: \label{fig:concepts:snapshot} belaran@964: belaran@964: belaran@964: belaran@964: The innovation that Mercurial applies to this problem is simple but belaran@964: effective. Once the cumulative amount of delta information stored belaran@964: since the last snapshot exceeds a fixed threshold, it stores a new belaran@964: snapshot (compressed, of course), instead of another delta. This belaran@964: makes it possible to reconstruct any revision of a file belaran@964: quickly. This approach works so well that it has since been copied by belaran@964: several other revision control systems. belaran@964: belaran@964: belaran@964: Figure illustrates the idea. In an entry belaran@964: in a revlog's index file, Mercurial stores the range of entries from belaran@964: the data file that it must read to reconstruct a particular revision. belaran@964: belaran@964: belaran@964: belaran@964: Aside: the influence of video compression belaran@964: belaran@964: If you're familiar with video compression or have ever watched a TV belaran@964: feed through a digital cable or satellite service, you may know that belaran@964: most video compression schemes store each frame of video as a delta belaran@964: against its predecessor frame. In addition, these schemes use belaran@964: lossy compression techniques to increase the compression ratio, so belaran@964: visual errors accumulate over the course of a number of inter-frame belaran@964: deltas. belaran@964: belaran@964: belaran@964: Because it's possible for a video stream to drop out occasionally belaran@964: due to signal glitches, and to limit the accumulation of artefacts belaran@964: introduced by the lossy compression process, video encoders belaran@964: periodically insert a complete frame (called a key frame) into the belaran@964: video stream; the next delta is generated against that frame. This belaran@964: means that if the video signal gets interrupted, it will resume once belaran@964: the next key frame is received. Also, the accumulation of encoding belaran@964: errors restarts anew with each key frame. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Identification and strong integrity belaran@964: belaran@964: Along with delta or snapshot information, a revlog entry contains a belaran@964: cryptographic hash of the data that it represents. This makes it belaran@964: difficult to forge the contents of a revision, and easy to detect belaran@964: accidental corruption. belaran@964: belaran@964: belaran@964: Hashes provide more than a mere check against corruption; they are belaran@964: used as the identifiers for revisions. The changeset identification belaran@964: hashes that you see as an end user are from revisions of the belaran@964: changelog. Although filelogs and the manifest also use hashes, belaran@964: Mercurial only uses these behind the scenes. belaran@964: belaran@964: belaran@964: Mercurial verifies that hashes are correct when it retrieves file belaran@964: revisions and when it pulls changes from another repository. If it belaran@964: encounters an integrity problem, it will complain and stop whatever belaran@964: it's doing. belaran@964: belaran@964: belaran@964: In addition to the effect it has on retrieval efficiency, Mercurial's belaran@964: use of periodic snapshots makes it more robust against partial data belaran@964: corruption. If a revlog becomes partly corrupted due to a hardware belaran@964: error or system bug, it's often possible to reconstruct some or most belaran@964: revisions from the uncorrupted sections of the revlog, both before and belaran@964: after the corrupted section. This would not be possible with a belaran@964: delta-only storage model. belaran@964: belaran@964: belaran@964: \section{Revision history, branching, belaran@964: and merging} belaran@964: belaran@964: belaran@964: Every entry in a Mercurial revlog knows the identity of its immediate belaran@964: ancestor revision, usually referred to as its parent. In fact, belaran@964: a revision contains room for not one parent, but two. Mercurial uses belaran@964: a special hash, called the null ID, to represent the idea there belaran@964: is no parent here. This hash is simply a string of zeroes. belaran@964: belaran@964: belaran@964: In figure , you can see an example of the belaran@964: conceptual structure of a revlog. Filelogs, manifests, and changelogs belaran@964: all have this same structure; they differ only in the kind of data belaran@964: stored in each delta or snapshot. belaran@964: belaran@964: belaran@964: The first revision in a revlog (at the bottom of the image) has the belaran@964: null ID in both of its parent slots. For a normal revision, its belaran@964: first parent slot contains the ID of its parent revision, and its belaran@964: second contains the null ID, indicating that the revision has only one belaran@964: real parent. Any two revisions that have the same parent ID are belaran@964: branches. A revision that represents a merge between branches has two belaran@964: normal revision IDs in its parent slots. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: \caption{} belaran@964: \label{fig:concepts:revlog} belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: The working directory belaran@964: belaran@964: In the working directory, Mercurial stores a snapshot of the files belaran@964: from the repository as of a particular changeset. belaran@964: belaran@964: belaran@964: The working directory knows which changeset it contains. When you belaran@964: update the working directory to contain a particular changeset, belaran@964: Mercurial looks up the appropriate revision of the manifest to find belaran@964: out which files it was tracking at the time that changeset was belaran@964: committed, and which revision of each file was then current. It then belaran@964: recreates a copy of each of those files, with the same contents it had belaran@964: when the changeset was committed. belaran@964: belaran@964: belaran@964: The dirstate contains Mercurial's knowledge of the working belaran@964: directory. This details which changeset the working directory is belaran@964: updated to, and all of the files that Mercurial is tracking in the belaran@964: working directory. belaran@964: belaran@964: belaran@964: Just as a revision of a revlog has room for two parents, so that it belaran@964: can represent either a normal revision (with one parent) or a merge of belaran@964: two earlier revisions, the dirstate has slots for two parents. When belaran@964: you use the hg update command, the changeset that you update to belaran@964: is stored in the first parent slot, and the null ID in the second. belaran@964: When you hg merge with another changeset, the first parent belaran@964: remains unchanged, and the second parent is filled in with the belaran@964: changeset you're merging with. The hg parents command tells you belaran@964: what the parents of the dirstate are. belaran@964: belaran@964: belaran@964: belaran@964: What happens when you commit belaran@964: belaran@964: The dirstate stores parent information for more than just book-keeping belaran@964: purposes. Mercurial uses the parents of the dirstate as \emph{the belaran@964: parents of a new changeset} when you perform a commit. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: The working directory can have two parents belaran@964: \label{fig:concepts:wdir} belaran@964: belaran@964: belaran@964: belaran@964: Figure shows the normal state of the working belaran@964: directory, where it has a single changeset as parent. That changeset belaran@964: is the tip, the newest changeset in the repository that has no belaran@964: children. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: The working directory gains new parents after a commit belaran@964: \label{fig:concepts:wdir-after-commit} belaran@964: belaran@964: belaran@964: belaran@964: It's useful to think of the working directory as the changeset I'm belaran@964: about to commit. Any files that you tell Mercurial that you've belaran@964: added, removed, renamed, or copied will be reflected in that belaran@964: changeset, as will modifications to any files that Mercurial is belaran@964: already tracking; the new changeset will have the parents of the belaran@964: working directory as its parents. belaran@964: belaran@964: belaran@964: After a commit, Mercurial will update the parents of the working belaran@964: directory, so that the first parent is the ID of the new changeset, belaran@964: and the second is the null ID. This is shown in belaran@964: figure . Mercurial doesn't touch belaran@964: any of the files in the working directory when you commit; it just belaran@964: modifies the dirstate to note its new parents. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Creating a new head belaran@964: belaran@964: It's perfectly normal to update the working directory to a changeset belaran@964: other than the current tip. For example, you might want to know what belaran@964: your project looked like last Tuesday, or you could be looking through belaran@964: changesets to see which one introduced a bug. In cases like this, the belaran@964: natural thing to do is update the working directory to the changeset belaran@964: you're interested in, and then examine the files in the working belaran@964: directory directly to see their contents as they were when you belaran@964: committed that changeset. The effect of this is shown in belaran@964: figure . belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: The working directory, updated to an older changeset belaran@964: \label{fig:concepts:wdir-pre-branch} belaran@964: belaran@964: belaran@964: belaran@964: Having updated the working directory to an older changeset, what belaran@964: happens if you make some changes, and then commit? Mercurial behaves belaran@964: in the same way as I outlined above. The parents of the working belaran@964: directory become the parents of the new changeset. This new changeset belaran@964: has no children, so it becomes the new tip. And the repository now belaran@964: contains two changesets that have no children; we call these belaran@964: heads. You can see the structure that this creates in belaran@964: figure . belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: After a commit made while synced to an older changeset belaran@964: \label{fig:concepts:wdir-branch} belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: If you're new to Mercurial, you should keep in mind a common belaran@964: error, which is to use the hg pull command without any belaran@964: options. By default, the hg pull command does not belaran@964: update the working directory, so you'll bring new changesets into belaran@964: your repository, but the working directory will stay synced at the belaran@964: same changeset as before the pull. If you make some changes and belaran@964: commit afterwards, you'll thus create a new head, because your belaran@964: working directory isn't synced to whatever the current tip is. belaran@964: belaran@964: belaran@964: I put the word error in quotes because all that you need to do belaran@964: to rectify this situation is hg merge, then hg commit. In belaran@964: other words, this almost never has negative consequences; it just belaran@964: surprises people. I'll discuss other ways to avoid this behaviour, belaran@964: and why Mercurial behaves in this initially surprising way, later belaran@964: on. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Merging heads belaran@964: belaran@964: When you run the hg merge command, Mercurial leaves the first belaran@964: parent of the working directory unchanged, and sets the second parent belaran@964: to the changeset you're merging with, as shown in belaran@964: figure . belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: XXX add text belaran@964: Merging two heads belaran@964: \label{fig:concepts:wdir-merge} belaran@964: belaran@964: belaran@964: belaran@964: Mercurial also has to modify the working directory, to merge the files belaran@964: managed in the two changesets. Simplified a little, the merging belaran@964: process goes like this, for every file in the manifests of both belaran@964: changesets. belaran@964: belaran@964: belaran@964: If neither changeset has modified a file, do nothing with that belaran@964: file. belaran@964: belaran@964: belaran@964: If one changeset has modified a file, and the other hasn't, belaran@964: create the modified copy of the file in the working directory. belaran@964: belaran@964: belaran@964: If one changeset has removed a file, and the other hasn't (or belaran@964: has also deleted it), delete the file from the working directory. belaran@964: belaran@964: belaran@964: If one changeset has removed a file, but the other has modified belaran@964: the file, ask the user what to do: keep the modified file, or remove belaran@964: it? belaran@964: belaran@964: belaran@964: If both changesets have modified a file, invoke an external belaran@964: merge program to choose the new contents for the merged file. This belaran@964: may require input from the user. belaran@964: belaran@964: belaran@964: If one changeset has modified a file, and the other has renamed belaran@964: or copied the file, make sure that the changes follow the new name belaran@964: of the file. belaran@964: belaran@964: belaran@964: There are more details&emdash;merging has plenty of corner cases&emdash;but belaran@964: these are the most common choices that are involved in a merge. As belaran@964: you can see, most cases are completely automatic, and indeed most belaran@964: merges finish automatically, without requiring your input to resolve belaran@964: any conflicts. belaran@964: belaran@964: belaran@964: When you're thinking about what happens when you commit after a merge, belaran@964: once again the working directory is the changeset I'm about to belaran@964: commit. After the hg merge command completes, the working belaran@964: directory has two parents; these will become the parents of the new belaran@964: changeset. belaran@964: belaran@964: belaran@964: Mercurial lets you perform multiple merges, but you must commit the belaran@964: results of each individual merge as you go. This is necessary because belaran@964: Mercurial only tracks two parents for both revisions and the working belaran@964: directory. While it would be technically possible to merge multiple belaran@964: changesets at once, the prospect of user confusion and making a belaran@964: terrible mess of a merge immediately becomes overwhelming. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Other interesting design features belaran@964: belaran@964: In the sections above, I've tried to highlight some of the most belaran@964: important aspects of Mercurial's design, to illustrate that it pays belaran@964: careful attention to reliability and performance. However, the belaran@964: attention to detail doesn't stop there. There are a number of other belaran@964: aspects of Mercurial's construction that I personally find belaran@964: interesting. I'll detail a few of them here, separate from the big belaran@964: ticket items above, so that if you're interested, you can gain a belaran@964: better idea of the amount of thinking that goes into a well-designed belaran@964: system. belaran@964: belaran@964: belaran@964: belaran@964: Clever compression belaran@964: belaran@964: When appropriate, Mercurial will store both snapshots and deltas in belaran@964: compressed form. It does this by always trying to compress a belaran@964: snapshot or delta, but only storing the compressed version if it's belaran@964: smaller than the uncompressed version. belaran@964: belaran@964: belaran@964: This means that Mercurial does the right thing when storing a file belaran@964: whose native form is compressed, such as a zip archive or a belaran@964: JPEG image. When these types of files are compressed a second time, belaran@964: the resulting file is usually bigger than the once-compressed form, belaran@964: and so Mercurial will store the plain zip or JPEG. belaran@964: belaran@964: belaran@964: Deltas between revisions of a compressed file are usually larger than belaran@964: snapshots of the file, and Mercurial again does the right thing in belaran@964: these cases. It finds that such a delta exceeds the threshold at belaran@964: which it should store a complete snapshot of the file, so it stores belaran@964: the snapshot, again saving space compared to a naive delta-only belaran@964: approach. belaran@964: belaran@964: belaran@964: belaran@964: Network recompression belaran@964: belaran@964: When storing revisions on disk, Mercurial uses the deflate belaran@964: compression algorithm (the same one used by the popular zip belaran@964: archive format), which balances good speed with a respectable belaran@964: compression ratio. However, when transmitting revision data over a belaran@964: network connection, Mercurial uncompresses the compressed revision belaran@964: data. belaran@964: belaran@964: belaran@964: If the connection is over HTTP, Mercurial recompresses the entire belaran@964: stream of data using a compression algorithm that gives a better belaran@964: compression ratio (the Burrows-Wheeler algorithm from the widely used belaran@964: bzip2 compression package). This combination of algorithm belaran@964: and compression of the entire stream (instead of a revision at a time) belaran@964: substantially reduces the number of bytes to be transferred, yielding belaran@964: better network performance over almost all kinds of network. belaran@964: belaran@964: belaran@964: (If the connection is over ssh, Mercurial doesn't belaran@964: recompress the stream, because ssh can already do this belaran@964: itself.) belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Read/write ordering and atomicity belaran@964: belaran@964: Appending to files isn't the whole story when it comes to guaranteeing belaran@964: that a reader won't see a partial write. If you recall belaran@964: figure , revisions in the changelog point to belaran@964: revisions in the manifest, and revisions in the manifest point to belaran@964: revisions in filelogs. This hierarchy is deliberate. belaran@964: belaran@964: belaran@964: A writer starts a transaction by writing filelog and manifest data, belaran@964: and doesn't write any changelog data until those are finished. A belaran@964: reader starts by reading changelog data, then manifest data, followed belaran@964: by filelog data. belaran@964: belaran@964: belaran@964: Since the writer has always finished writing filelog and manifest data belaran@964: before it writes to the changelog, a reader will never read a pointer belaran@964: to a partially written manifest revision from the changelog, and it will belaran@964: never read a pointer to a partially written filelog revision from the belaran@964: manifest. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Concurrent access belaran@964: belaran@964: The read/write ordering and atomicity guarantees mean that Mercurial belaran@964: never needs to lock a repository when it's reading data, even belaran@964: if the repository is being written to while the read is occurring. belaran@964: This has a big effect on scalability; you can have an arbitrary number belaran@964: of Mercurial processes safely reading data from a repository safely belaran@964: all at once, no matter whether it's being written to or not. belaran@964: belaran@964: belaran@964: The lockless nature of reading means that if you're sharing a belaran@964: repository on a multi-user system, you don't need to grant other local belaran@964: users permission to write to your repository in order for them belaran@964: to be able to clone it or pull changes from it; they only need belaran@964: read permission. (This is not a common feature among belaran@964: revision control systems, so don't take it for granted! Most require belaran@964: readers to be able to lock a repository to access it safely, and this belaran@964: requires write permission on at least one directory, which of course belaran@964: makes for all kinds of nasty and annoying security and administrative belaran@964: problems.) belaran@964: belaran@964: belaran@964: Mercurial uses locks to ensure that only one process can write to a belaran@964: repository at a time (the locking mechanism is safe even over belaran@964: filesystems that are notoriously hostile to locking, such as NFS). If belaran@964: a repository is locked, a writer will wait for a while to retry if the belaran@964: repository becomes unlocked, but if the repository remains locked for belaran@964: too long, the process attempting to write will time out after a while. belaran@964: This means that your daily automated scripts won't get stuck forever belaran@964: and pile up if a system crashes unnoticed, for example. (Yes, the belaran@964: timeout is configurable, from zero to infinity.) belaran@964: belaran@964: belaran@964: belaran@964: Safe dirstate access belaran@964: belaran@964: As with revision data, Mercurial doesn't take a lock to read the belaran@964: dirstate file; it does acquire a lock to write it. To avoid the belaran@964: possibility of reading a partially written copy of the dirstate file, belaran@964: Mercurial writes to a file with a unique name in the same directory as belaran@964: the dirstate file, then renames the temporary file atomically to belaran@964: dirstate. The file named dirstate is thus belaran@964: guaranteed to be complete, not partially written. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Avoiding seeks belaran@964: belaran@964: Critical to Mercurial's performance is the avoidance of seeks of the belaran@964: disk head, since any seek is far more expensive than even a belaran@964: comparatively large read operation. belaran@964: belaran@964: belaran@964: This is why, for example, the dirstate is stored in a single file. If belaran@964: there were a dirstate file per directory that Mercurial tracked, the belaran@964: disk would seek once per directory. Instead, Mercurial reads the belaran@964: entire single dirstate file in one step. belaran@964: belaran@964: belaran@964: Mercurial also uses a copy on write scheme when cloning a belaran@964: repository on local storage. Instead of copying every revlog file belaran@964: from the old repository into the new repository, it makes a hard belaran@964: link, which is a shorthand way to say these two names point to the belaran@964: same file. When Mercurial is about to write to one of a revlog's belaran@964: files, it checks to see if the number of names pointing at the file is belaran@964: greater than one. If it is, more than one repository is using the belaran@964: file, so Mercurial makes a new copy of the file that is private to belaran@964: this repository. belaran@964: belaran@964: belaran@964: A few revision control developers have pointed out that this idea of belaran@964: making a complete private copy of a file is not very efficient in its belaran@964: use of storage. While this is true, storage is cheap, and this method belaran@964: gives the highest performance while deferring most book-keeping to the belaran@964: operating system. An alternative scheme would most likely reduce belaran@964: performance and increase the complexity of the software, each of which belaran@964: is much more important to the feel of day-to-day use. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: Other contents of the dirstate belaran@964: belaran@964: Because Mercurial doesn't force you to tell it when you're modifying a belaran@964: file, it uses the dirstate to store some extra information so it can belaran@964: determine efficiently whether you have modified a file. For each file belaran@964: in the working directory, it stores the time that it last modified the belaran@964: file itself, and the size of the file at that time. belaran@964: belaran@964: belaran@964: When you explicitly hg add, hg remove, hg rename or belaran@964: hg copy files, Mercurial updates the dirstate so that it knows belaran@964: what to do with those files when you commit. belaran@964: belaran@964: belaran@964: When Mercurial is checking the states of files in the working belaran@964: directory, it first checks a file's modification time. If that has belaran@964: not changed, the file must not have been modified. If the file's size belaran@964: has changed, the file must have been modified. If the modification belaran@964: time has changed, but the size has not, only then does Mercurial need belaran@964: to read the actual contents of the file to see if they've changed. belaran@964: Storing these few extra pieces of information dramatically reduces the belaran@964: amount of data that Mercurial needs to read, which yields large belaran@964: performance improvements compared to other revision control systems. belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: belaran@964: