bos@16: \chapter{Introduction}
bos@16: \label{chap:intro}
bos@16: 
bos@217: \section{About revision control}
bos@155: 
bos@219: Revision control is the process of managing multiple versions of a
bos@219: piece of information.  In its simplest form, this is something that
bos@219: many people do by hand: every time you modify a file, save it under a
bos@219: new name that contains a number, each one higher than the number of
bos@219: the preceding version.
bos@217: 
bos@217: Manually managing multiple versions of even a single file is an
bos@217: error-prone task, though, so software tools to help automate this
bos@217: process have long been available.  The earliest automated revision
bos@217: control tools were intended to help a single user to manage revisions
bos@219: of a single file.  Over the past few decades, the scope of revision
bos@219: control tools has expanded greatly; they now manage multiple files,
bos@219: and help multiple people to work together.  The best modern revision
bos@219: control tools have no problem coping with thousands of people working
bos@219: together on projects that consist of hundreds of thousands of files.
bos@217: 
bos@217: \subsection{Why use revision control?}
bos@217: 
bos@217: There are a number of reasons why you or your team might want to use
bos@217: an automated revision control tool for a project.
bos@217: \begin{itemize}
bos@219: \item It will track the history and evolution of your project, so you
bos@219:   don't have to.  For every change, you'll have a log of \emph{who}
bos@219:   made it; \emph{why} they made it; \emph{when} they made it; and
bos@219:   \emph{what} the change was.
bos@219: \item When you're working with other people, revision control software
bos@219:   makes it easier for you to collaborate.  For example, when people
bos@219:   more or less simultaneously make potentially incompatible changes,
bos@219:   the software will help you to identify and resolve those conflicts.
bos@217: \item It can help you to recover from mistakes.  If you make a change
bos@217:   that later turns out to be in error, you can revert to an earlier
bos@217:   version of one or more files.  In fact, a \emph{really} good
bos@217:   revision control tool will even help you to efficiently figure out
bos@217:   exactly when a problem was introduced (see
bos@217:   section~\ref{sec:undo:bisect} for details).
bos@218: \item It will help you to work simultaneously on, and manage the drift
bos@218:   between, multiple versions of your project.
bos@217: \end{itemize}
bos@218: Most of these reasons are equally valid---at least in theory---whether
bos@218: you're working on a project by yourself, or with a hundred other
bos@218: people.
bos@218: 
bos@218: A key question about the practicality of revision control at these two
bos@218: different scales (``lone hacker'' and ``huge team'') is how its
bos@218: \emph{benefits} compare to its \emph{costs}.  A revision control tool
bos@218: that's difficult to understand or use is going to impose a high cost.
bos@218: 
bos@219: A five-hundred-person project is likely to collapse under its own
bos@219: weight almost immediately without a revision control tool and process.
bos@219: In this case, the cost of using revision control might hardly seem
bos@219: worth considering, since \emph{without} it, failure is almost
bos@219: guaranteed.
bos@218: 
bos@218: On the other hand, a one-person ``quick hack'' might seem like a poor
bos@218: place to use a revision control tool, because surely the cost of using
bos@218: one must be close to the overall cost of the project.  Right?
bos@218: 
bos@218: Mercurial uniquely supports \emph{both} of these scales of
bos@218: development.  You can learn the basics in just a few minutes, and due
bos@218: to its low overhead, you can apply revision control to the smallest of
bos@218: projects with ease.  Its simplicity means you won't have a lot of
bos@218: abstruse concepts or command sequences competing for mental space with
bos@218: whatever you're \emph{really} trying to do.  At the same time,
bos@218: Mercurial's high performance and peer-to-peer nature let you scale
bos@218: painlessly to handle large projects.
bos@217: 
bos@219: No revision control tool can rescue a poorly run project, but a good
bos@219: choice of tools can make a huge difference to the fluidity with which
bos@219: you can work on a project.
bos@219: 
bos@217: \subsection{The many names of revision control}
bos@217: 
bos@217: Revision control is a diverse field, so much so that it doesn't
bos@217: actually have a single name or acronym.  Here are a few of the more
bos@217: common names and acronyms you'll encounter:
bos@217: \begin{itemize}
bos@217: \item Revision control (RCS)
bos@219: \item Software configuration management (SCM), or configuration management
bos@218: \item Source code management
bos@219: \item Source code control, or source control
bos@217: \item Version control (VCS)
bos@217: \end{itemize}
bos@217: Some people claim that these terms actually have different meanings,
bos@217: but in practice they overlap so much that there's no agreed or even
bos@217: useful way to tease them apart.
bos@155: 
bos@219: \section{A short history of revision control}
bos@155: 
bos@218: The best known of the old-time revision control tools is SCCS (Source
bos@218: Code Control System), which Marc Rochkind wrote at Bell Labs, in the
bos@218: early 1970s.  SCCS operated on individual files, and required every
bos@218: person working on a project to have access to a shared workspace on a
bos@218: single system.  Only one person could modify a file at any time;
bos@218: arbitration for access to files was via locks.  It was common for
bos@218: people to lock files, and later forget to unlock them, preventing
bos@218: anyone else from modifying those files without the help of an
bos@218: administrator.  
bos@218: 
bos@218: Walter Tichy developed a free alternative to SCCS in the early 1980s;
bos@218: he called his program RCS (Revison Control System).  Like SCCS, RCS
bos@218: required developers to work in a single shared workspace, and to lock
bos@218: files to prevent multiple people from modifying them simultaneously.
bos@218: 
bos@218: Later in the 1980s, Dick Grune used RCS as a building block for a set
bos@218: of shell scripts he initially called cmt, but then renamed to CVS
bos@218: (Concurrent Versions System).  The big innovation of CVS was that it
bos@218: let developers work simultaneously and somewhat independently in their
bos@218: own personal workspaces.  The personal workspaces prevented developers
bos@218: from stepping on each other's toes all the time, as was common with
bos@218: SCCS and RCS.  Each developer had a copy of every project file, and
bos@218: could modify their copies independently.  They had to merge their
bos@218: edits prior to committing changes to the central repository.
bos@218: 
bos@218: Brian Berliner took Grune's original scripts and rewrote them in~C,
bos@218: releasing in 1989 the code that has since developed into the modern
bos@218: version of CVS.  CVS subsequently acquired the ability to operate over
bos@218: a network connection, giving it a client/server architecture.  CVS's
bos@218: architecture is centralised; only the server has a copy of the history
bos@218: of the project.  Client workspaces just contain copies of recent
bos@218: versions of the project's files, and a little metadata to tell them
bos@218: where the server is.  CVS has been enormously successful; it is
bos@218: probably the world's most widely used revision control system.
bos@218: 
bos@218: In the early 1990s, Sun Microsystems developed an early distributed
bos@218: revision control system, called TeamWare.  A TeamWare workspace
bos@218: contains a complete copy of the project's history.  TeamWare has no
bos@218: notion of a central repository.  (CVS relied upon RCS for its history
bos@218: storage; TeamWare used SCCS.)
bos@218: 
bos@218: As the 1990s progressed, awareness grew of a number of problems with
bos@218: CVS.  It records simultaneous changes to multiple files individually,
bos@218: instead of grouping them together as a single logically atomic
bos@218: operation.  It does not manage its file hierarchy well; it is easy to
bos@218: make a mess of a repository by renaming files and directories.  Worse,
bos@218: its source code is difficult to read and maintain, which made the
bos@218: ``pain level'' of fixing these architectural problems prohibitive.
bos@218: 
bos@218: In 2001, Jim Blandy and Karl Fogel, two developers who had worked on
bos@218: CVS, started a project to replace it with a tool that would have a
bos@218: better architecture and cleaner code.  The result, Subversion, does
bos@218: not stray from CVS's centralised client/server model, but it adds
bos@218: multi-file atomic commits, better namespace management, and a number
bos@218: of other features that make it a generally better tool than CVS.
bos@218: Since its initial release, it has rapidly grown in popularity.
bos@218: 
bos@218: More or less simultaneously, Graydon Hoare began working on an
bos@218: ambitious distributed revision control system that he named Monotone.
bos@218: While Monotone addresses many of CVS's design flaws and has a
bos@218: peer-to-peer architecture, it goes beyond earlier (and subsequent)
bos@218: revision control tools in a number of innovative ways.  It uses
bos@218: cryptographic hashes as identifiers, and has an integral notion of
bos@218: ``trust'' for code from different sources.
bos@218: 
bos@218: Mercurial began life in 2005.  While a few aspects of its design are
bos@218: influenced by Monotone, Mercurial focuses on ease of use, high
bos@218: performance, and scalability to very large projects.
bos@155: 
bos@219: \section{Trends in revision control}
bos@219: 
bos@219: There has been an unmistakable trend in the development and use of
bos@219: revision control tools over the past four decades, as people have
bos@219: become familiar with the capabilities of their tools and constrained
bos@219: by their limitations.
bos@219: 
bos@219: The first generation began by managing single files on individual
bos@219: computers.  Although these tools represented a huge advance over
bos@219: ad-hoc manual revision control, their locking model and reliance on a
bos@219: single computer limited them to small, tightly-knit teams.
bos@219: 
bos@219: The second generation loosened these constraints by moving to
bos@219: network-centered architectures, and managing entire projects at a
bos@219: time.  As projects grew larger, they ran into new problems.  With
bos@219: clients needing to talk to servers very frequently, server scaling
bos@219: became an issue for large projects.  An unreliable network connection
bos@219: could prevent remote users from being able to talk to the server at
bos@219: all.  As open source projects started making read-only access
bos@219: available anonymously to anyone, people without commit privileges
bos@219: found that they could not use the tools to interact with a project in
bos@219: a natural way, as they could not record their changes.
bos@219: 
bos@219: The current generation of revision control tools is peer-to-peer in
bos@219: nature.  All of these systems have dropped the dependency on a single
bos@219: central server, and allow people to distribute their revision control
bos@219: data to where it's actually needed.  Collaboration over the Internet
bos@219: has moved from constrained by technology to a matter of choice and
bos@219: consensus.  Modern tools can operate offline indefinitely and
bos@219: autonomously, with a network connection only needed when syncing
bos@219: changes with another repository.
bos@219: 
bos@219: \section{A few of the advantages of distributed revision control}
bos@219: 
bos@219: Even though distributed revision control tools have for several years
bos@219: been as robust and usable as their previous-generation counterparts,
bos@219: people using older tools have not yet necessarily woken up to their
bos@219: advantages.  There are a number of ways in which distributed tools
bos@219: shine relative to centralised ones.
bos@219: 
bos@219: For an individual developer, distributed tools are almost always much
bos@219: faster than centralised tools.  This is for a simple reason: a
bos@219: centralised tool needs to talk over the network for many common
bos@219: operations, because most metadata is stored in a single copy on the
bos@219: central server.  A distributed tool stores all of its metadata
bos@219: locally.  All else being equal, talking over the network adds overhead
bos@219: to a centralised tool.  Don't underestimate the value of a snappy,
bos@219: responsive tool: you're going to spend a lot of time interacting with
bos@219: your revision control software.
bos@219: 
bos@219: Distributed tools are indifferent to the vagaries of your server
bos@219: infrastructure, again because they replicate metadata to so many
bos@219: locations.  If you use a centralised system and your server catches
bos@219: fire, you'd better hope that your backup media are reliable, and that
bos@219: your last backup was recent and actually worked.  With a distributed
bos@219: tool, you have many backups available on every contributor's computer.
bos@219: 
bos@219: The reliability of your network will affect distributed tools far less
bos@219: than it will centralised tools.  You can't even use a centralised tool
bos@219: without a network connection, except for a few highly constrained
bos@219: commands.  With a distributed tool, if your network connection goes
bos@219: down while you're working, you may not even notice.  The only thing
bos@219: you won't be able to do is talk to repositories on other computers,
bos@219: something that is relatively rare compared with local operations.  If
bos@219: you have a far-flung team of collaborators, this may be significant.
bos@219: 
bos@220: \subsection{Advantages for open source projects}
bos@220: 
bos@219: If you take a shine to an open source project and decide that you
bos@219: would like to start hacking on it, and that project uses a distributed
bos@219: revision control tool, you are at once a peer with the people who
bos@219: consider themselves the ``core'' of that project.  If they publish
bos@219: their repositories, you can immediately copy their project history,
bos@219: start making changes, and record your work, using the same tools in
bos@219: the same ways as insiders.  By contrast, with a centralised tool, you
bos@219: must use the software in a ``read only'' mode unless someone grants
bos@219: you permission to commit changes to their central server.  Until then,
bos@219: you won't be able to record changes, and your local modifications will
bos@219: be at risk of corruption any time you try to update your client's view
bos@219: of the repository.
bos@155: 
bos@220: \subsubsection{The forking non-problem}
bos@220: 
bos@220: It has been suggested that distributed revision control tools pose
bos@220: some sort of risk to open source projects because they make it easy to
bos@220: ``fork'' the development of a project.  A fork happens when there are
bos@220: differences in opinion or attitude between groups of developers that
bos@220: cause them to decide that they can't work together any longer.  Each
bos@220: side takes a more or less complete copy of the project's source code,
bos@220: and goes off in its own direction.
bos@220: 
bos@220: Sometimes the camps in a fork decide to reconcile their differences.
bos@220: With a centralised revision control system, the \emph{technical}
bos@220: process of reconciliation is painful, and has to be performed largely
bos@220: by hand.  You have to decide whose revision history is going to
bos@220: ``win'', and graft the other team's changes into the tree somehow.
bos@220: This usually loses some or all of one side's revision history.
bos@220: 
bos@220: What distributed tools do with respect to forking is they make forking
bos@220: the \emph{only} way to develop a project.  Every single change that
bos@220: you make is potentially a fork point.  The great strength of this
bos@220: approach is that a distributed revision control tool has to be really
bos@220: good at \emph{merging} forks, because forks are absolutely
bos@220: fundamental: they happen all the time.  
bos@220: 
bos@220: If every piece of work that everybody does, all the time, is framed in
bos@220: terms of forking and merging, then what the open source world refers
bos@220: to as a ``fork'' becomes \emph{purely} a social issue.  If anything,
bos@220: distributed tools \emph{lower} the likelihood of a fork:
bos@220: \begin{itemize}
bos@220: \item They eliminate the social distinction that centralised tools
bos@220:   impose: that between insiders (people with commit access) and
bos@220:   outsiders (people without).
bos@220: \item They make it easier to reconcile after a social fork, because
bos@220:   all that's involved from the perspective of the revision control
bos@220:   software is just another merge.
bos@220: \end{itemize}
bos@220: 
bos@220: Some people resist distributed tools because they want to retain tight
bos@220: control over their projects, and they believe that centralised tools
bos@220: give them this control.  However, if you're of this belief, and you
bos@220: publish your CVS or Subversion repositories publically, there are
bos@220: plenty of tools available that can pull out your entire project's
bos@220: history (albeit slowly) and recreate it somewhere that you don't
bos@220: control.  So while your control in this case is illusory, you are
bos@220: foregoing the ability to fluidly collaborate with whatever people feel
bos@220: compelled to mirror and fork your history.
bos@220: 
bos@220: \subsection{Advantages for commercial projects}
bos@220: 
bos@220: Many commercial projects are undertaken by teams that are scattered
bos@220: across the globe.  Contributors who are far from a central server will
bos@220: see slower command execution and perhaps less reliability.  Commercial
bos@220: revision control systems attempt to ameliorate these problems with
bos@220: remote-site replication add-ons that are typically expensive to buy
bos@220: and cantankerous to administer.  A distributed system doesn't suffer
bos@220: from these problems in the first place.  Better yet, you can easily
bos@220: set up multiple authoritative servers, say one per site, so that
bos@220: there's no redundant communication between repositories over expensive
bos@220: long-haul network links.
bos@220: 
bos@220: Centralised revision control systems tend to have relatively low
bos@220: scalability.  It's not unusual for an expensive centralised system to
bos@220: fall over under the combined load of just a few dozen concurrent
bos@220: users.  Once again, the typical response tends to be an expensive and
bos@220: clunky replication facility.  Since the load on a central server---if
bos@220: you have one at all---is orders of magnitude lower with a distributed
bos@220: tool (because all of the data is replicated everywhere), a single
bos@220: cheap server can handle the needs of a much larger team, and
bos@220: replication to balance load becomes a simple matter of scripting.
bos@220: 
bos@220: If you have an employee in the field, troubleshooting a problem at a
bos@220: customer's site, they'll benefit from distributed revision control.
bos@220: The tool will let them generate custom builds, try different fixes in
bos@220: isolation from each other, and search efficiently through history for
bos@220: the sources of bugs and regressions in the customer's environment, all
bos@220: without needing to connect to your company's network.
bos@219: 
bos@155: \section{Why choose Mercurial?}
bos@155: 
bos@221: Mercurial has a unique set of properties that make it a particularly
bos@221: good choice as a revision control system.
bos@221: \begin{itemize}
bos@221: \item It is easy to learn and use.
bos@221: \item It is lightweight.
bos@221: \item It scales excellently.
bos@221: \item It is easy to customise.
bos@221: \end{itemize}
bos@221: 
bos@221: If you are at all familiar with revision control systems, you should
bos@221: be able to get up and running with Mercurial in less than five
bos@221: minutes.  Even if not, it will take no more than a few minutes
bos@221: longer.  Mercurial's command and feature sets are generally uniform
bos@221: and consistent, so you can keep track of a few general rules instead
bos@221: of a host of exceptions.
bos@221: 
bos@221: On a small project, you can start working with Mercurial in moments.
bos@221: Creating new changes and branches; transferring changes around
bos@221: (whether locally or over a network); and history and status operations
bos@221: are all fast.  Mercurial attempts to stay nimble and largely out of
bos@221: your way by combining low cognitive overhead with blazingly fast
bos@221: operations.
bos@221: 
bos@221: The usefulness of Mercurial is not limited to small projects: it is
bos@221: used by projects with hundreds to thousands of contributors, each
bos@221: containing tens of thousands of files and hundreds of megabytes of
bos@221: source code.
bos@221: 
bos@221: If the core functionality of Mercurial is not enough for you, it's
bos@221: easy to build on.  Mercurial is well suited to scripting tasks, and
bos@221: its clean internals and implementation in Python make it easy to add
bos@221: features in the form of extensions.  There are a number of popular and
bos@221: useful extensions already available, ranging from helping to identify
bos@221: bugs to improving performance.
bos@221: 
bos@221: \section{Mercurial compared with other tools}
bos@221: 
bos@221: Before you read on, please understand that this section necessarily
bos@221: reflects my own experiences, interests, and (dare I say it) biases.  I
bos@221: have used every one of the revision control tools listed below, in
bos@221: most cases for several years at a time.
bos@221: 
bos@221: \subsection{Subversion}
bos@221: 
bos@221: Subversion is a popular revision control tool, developed to replace
bos@221: CVS.  It has a centralised client/server architecture.
bos@221: 
bos@221: Subversion and Mercurial have similarly named commands for performing
bos@221: the same operations, so it is easy for a person who is familiar with
bos@221: one to learn to use the other.  Both tools are portable to all popular
bos@221: operating systems.
bos@221: 
bos@221: Mercurial has a substantial performance advantage over Subversion on
bos@221: every revision control operation I have benchmarked.  I have measured
bos@221: its advantage as ranging from a factor of two to a factor of six when
bos@221: compared with Subversion~1.4.3's \emph{ra\_local} file store, which is
bos@221: the fastest access method available).  In more realistic deployments
bos@221: involving a network-based store, Subversion will be at a substantially
bos@221: larger disadvantage.
bos@221: 
bos@221: Additionally, Subversion incurs a substantial storage overhead to
bos@221: avoid network transactions for a few common operations, such as
bos@221: finding modified files (\texttt{status}) and displaying modifications
bos@221: against the current revision (\texttt{diff}).  A Subversion working
bos@221: copy is, as a result, often approximately the same size as, or larger
bos@221: than, a Mercurial repository and working directory, even though the
bos@221: Mercurial repository contains a complete history of the project.
bos@221: 
bos@221: Subversion does not have a history-aware merge capability, forcing its
bos@221: users to know exactly which revisions to merge between branches.  This
bos@221: shortcoming is scheduled to be addressed in version 1.5.
bos@221: 
bos@221: Subversion is currently more widely supported by
bos@221: revision-control-aware third party tools than is Mercurial, although
bos@221: this gap is closing.  Like Mercurial, Subversion has an excellent user
bos@221: manual.
bos@221: 
bos@221: Several tools exist to accurately and completely import revision
bos@221: history from a Subversion repository into a Mercurial repository,
bos@221: making the transition from the older tool relatively painless.
bos@221: 
bos@221: \subsection{Git}
bos@221: 
bos@221: Git is a distributed revision control tool that was developed for
bos@221: managing the Linux kernel source tree.  Like Mercurial, its early
bos@221: design was somewhat influenced by Monotone.
bos@221: 
bos@221: Git has an overwhelming command set, with version~1.5.0 providing~139
bos@221: individual commands.  It has a reputation for being difficult to
bos@221: learn.  It does not have a user manual, only documentation for
bos@221: individual commands.
bos@221: 
bos@221: In terms of performance, git is extremely fast.  There are several
bos@221: common cases in which it is faster than Mercurial, at least on Linux.
bos@221: However, its performance on (and general support for) Windows is, at
bos@221: the time of writing, far behind that of Mercurial.
bos@221: 
bos@221: While a Mercurial repository needs no maintenance, a Git repository
bos@221: requires frequent manual ``repacks'' of its metadata.  Without these,
bos@221: performance degrades, while space usage grows rapidly.  A server that
bos@221: contains many Git repositories that are not rigorously and frequently
bos@221: repacked will become heavily disk-bound during backups, and there have
bos@221: been instances of daily backups taking far longer than~24 hours as a
bos@221: result.  A freshly packed Git repository is slightly smaller than a
bos@221: Mercurial repository, but an unpacked repository is several orders of
bos@221: magnitude larger.
bos@221: 
bos@221: The core of Git is written in C.  Many Git commands are implemented as
bos@221: shell or Perl scripts, and the quality of these scripts varies widely.
bos@221: I have encountered a number of instances where scripts charged along
bos@221: blindly in the presence of errors that should have been fatal.
bos@221: 
bos@221: \subsection{CVS}
bos@221: 
bos@221: CVS is probably the most widely used revision control tool in the
bos@221: world.  Due to its age and internal untidiness, it has been
bos@221: essentially unmaintained for many years.
bos@221: 
bos@221: It has a centralised client/server architecture.  It does not group
bos@221: related file changes into atomic commits, making it easy for people to
bos@221: ``break the build''.  It has a muddled and incoherent notion of tags
bos@221: and branches that I will not attempt to even describe.  It does not
bos@221: support renaming of files or directories well, making it easy to
bos@221: corrupt a repository.  It has almost no internal consistency checking
bos@221: capabilities, so it is usually not even possible to tell whether or
bos@221: how a repository is corrupt.  I would not recommend CVS for any
bos@221: project, existing or new.
bos@221: 
bos@221: Mercurial can import CVS revision history.  However, there are a few
bos@221: caveats that apply; these are true of every other revision control
bos@221: tool's CVS importer, too.  Due to CVS's lack of atomic changes and
bos@221: unversioned filesystem hierarchy, it is not possible to reconstruct
bos@221: CVS history completely accurately; some guesswork is involved, and
bos@221: renames will usually not show up.  Because a lot of advanced CVS
bos@221: administration has to be done by hand and is hence error-prone, it's
bos@221: common for CVS importers to run into multiple problems with corrupted
bos@221: repositories (completely bogus revision timestamps and files that have
bos@221: remained locked for over a decade are just two of the less interesting
bos@221: problems I can recall from personal experience).
bos@221: 
bos@221: \subsection{Commercial tools}
bos@221: 
bos@221: Perforce has a centralised client/server architecture, with no
bos@221: client-side caching of any data.  Unlike modern revision control
bos@221: tools, Perforce requires that a user run a command to inform the
bos@221: server about every file they intend to edit.
bos@221: 
bos@221: The performance of Perforce is quite good for small teams, but it
bos@221: falls off rapidly as the number of users grows beyond a few dozen.
bos@221: Modestly large Perforce installations require the deployment of
bos@221: proxies to cope with the load their users generate.
bos@16: 
bos@16: %%% Local Variables: 
bos@16: %%% mode: latex
bos@16: %%% TeX-master: "00book"
bos@16: %%% End: