hgbook

view en/intro.tex @ 232:2469608b4a08

Start writing up the patchbomb extension.
author Bryan O'Sullivan <bos@serpentine.com>
date Sun May 27 09:40:12 2007 -0700 (2007-05-27)
parents 0ca9045035f7
children 649a93bb45ae
line source
1 \chapter{Introduction}
2 \label{chap:intro}
4 \section{About revision control}
6 Revision control is the process of managing multiple versions of a
7 piece of information. In its simplest form, this is something that
8 many people do by hand: every time you modify a file, save it under a
9 new name that contains a number, each one higher than the number of
10 the preceding version.
12 Manually managing multiple versions of even a single file is an
13 error-prone task, though, so software tools to help automate this
14 process have long been available. The earliest automated revision
15 control tools were intended to help a single user to manage revisions
16 of a single file. Over the past few decades, the scope of revision
17 control tools has expanded greatly; they now manage multiple files,
18 and help multiple people to work together. The best modern revision
19 control tools have no problem coping with thousands of people working
20 together on projects that consist of hundreds of thousands of files.
22 \subsection{Why use revision control?}
24 There are a number of reasons why you or your team might want to use
25 an automated revision control tool for a project.
26 \begin{itemize}
27 \item It will track the history and evolution of your project, so you
28 don't have to. For every change, you'll have a log of \emph{who}
29 made it; \emph{why} they made it; \emph{when} they made it; and
30 \emph{what} the change was.
31 \item When you're working with other people, revision control software
32 makes it easier for you to collaborate. For example, when people
33 more or less simultaneously make potentially incompatible changes,
34 the software will help you to identify and resolve those conflicts.
35 \item It can help you to recover from mistakes. If you make a change
36 that later turns out to be in error, you can revert to an earlier
37 version of one or more files. In fact, a \emph{really} good
38 revision control tool will even help you to efficiently figure out
39 exactly when a problem was introduced (see
40 section~\ref{sec:undo:bisect} for details).
41 \item It will help you to work simultaneously on, and manage the drift
42 between, multiple versions of your project.
43 \end{itemize}
44 Most of these reasons are equally valid---at least in theory---whether
45 you're working on a project by yourself, or with a hundred other
46 people.
48 A key question about the practicality of revision control at these two
49 different scales (``lone hacker'' and ``huge team'') is how its
50 \emph{benefits} compare to its \emph{costs}. A revision control tool
51 that's difficult to understand or use is going to impose a high cost.
53 A five-hundred-person project is likely to collapse under its own
54 weight almost immediately without a revision control tool and process.
55 In this case, the cost of using revision control might hardly seem
56 worth considering, since \emph{without} it, failure is almost
57 guaranteed.
59 On the other hand, a one-person ``quick hack'' might seem like a poor
60 place to use a revision control tool, because surely the cost of using
61 one must be close to the overall cost of the project. Right?
63 Mercurial uniquely supports \emph{both} of these scales of
64 development. You can learn the basics in just a few minutes, and due
65 to its low overhead, you can apply revision control to the smallest of
66 projects with ease. Its simplicity means you won't have a lot of
67 abstruse concepts or command sequences competing for mental space with
68 whatever you're \emph{really} trying to do. At the same time,
69 Mercurial's high performance and peer-to-peer nature let you scale
70 painlessly to handle large projects.
72 No revision control tool can rescue a poorly run project, but a good
73 choice of tools can make a huge difference to the fluidity with which
74 you can work on a project.
76 \subsection{The many names of revision control}
78 Revision control is a diverse field, so much so that it doesn't
79 actually have a single name or acronym. Here are a few of the more
80 common names and acronyms you'll encounter:
81 \begin{itemize}
82 \item Revision control (RCS)
83 \item Software configuration management (SCM), or configuration management
84 \item Source code management
85 \item Source code control, or source control
86 \item Version control (VCS)
87 \end{itemize}
88 Some people claim that these terms actually have different meanings,
89 but in practice they overlap so much that there's no agreed or even
90 useful way to tease them apart.
92 \section{A short history of revision control}
94 The best known of the old-time revision control tools is SCCS (Source
95 Code Control System), which Marc Rochkind wrote at Bell Labs, in the
96 early 1970s. SCCS operated on individual files, and required every
97 person working on a project to have access to a shared workspace on a
98 single system. Only one person could modify a file at any time;
99 arbitration for access to files was via locks. It was common for
100 people to lock files, and later forget to unlock them, preventing
101 anyone else from modifying those files without the help of an
102 administrator.
104 Walter Tichy developed a free alternative to SCCS in the early 1980s;
105 he called his program RCS (Revison Control System). Like SCCS, RCS
106 required developers to work in a single shared workspace, and to lock
107 files to prevent multiple people from modifying them simultaneously.
109 Later in the 1980s, Dick Grune used RCS as a building block for a set
110 of shell scripts he initially called cmt, but then renamed to CVS
111 (Concurrent Versions System). The big innovation of CVS was that it
112 let developers work simultaneously and somewhat independently in their
113 own personal workspaces. The personal workspaces prevented developers
114 from stepping on each other's toes all the time, as was common with
115 SCCS and RCS. Each developer had a copy of every project file, and
116 could modify their copies independently. They had to merge their
117 edits prior to committing changes to the central repository.
119 Brian Berliner took Grune's original scripts and rewrote them in~C,
120 releasing in 1989 the code that has since developed into the modern
121 version of CVS. CVS subsequently acquired the ability to operate over
122 a network connection, giving it a client/server architecture. CVS's
123 architecture is centralised; only the server has a copy of the history
124 of the project. Client workspaces just contain copies of recent
125 versions of the project's files, and a little metadata to tell them
126 where the server is. CVS has been enormously successful; it is
127 probably the world's most widely used revision control system.
129 In the early 1990s, Sun Microsystems developed an early distributed
130 revision control system, called TeamWare. A TeamWare workspace
131 contains a complete copy of the project's history. TeamWare has no
132 notion of a central repository. (CVS relied upon RCS for its history
133 storage; TeamWare used SCCS.)
135 As the 1990s progressed, awareness grew of a number of problems with
136 CVS. It records simultaneous changes to multiple files individually,
137 instead of grouping them together as a single logically atomic
138 operation. It does not manage its file hierarchy well; it is easy to
139 make a mess of a repository by renaming files and directories. Worse,
140 its source code is difficult to read and maintain, which made the
141 ``pain level'' of fixing these architectural problems prohibitive.
143 In 2001, Jim Blandy and Karl Fogel, two developers who had worked on
144 CVS, started a project to replace it with a tool that would have a
145 better architecture and cleaner code. The result, Subversion, does
146 not stray from CVS's centralised client/server model, but it adds
147 multi-file atomic commits, better namespace management, and a number
148 of other features that make it a generally better tool than CVS.
149 Since its initial release, it has rapidly grown in popularity.
151 More or less simultaneously, Graydon Hoare began working on an
152 ambitious distributed revision control system that he named Monotone.
153 While Monotone addresses many of CVS's design flaws and has a
154 peer-to-peer architecture, it goes beyond earlier (and subsequent)
155 revision control tools in a number of innovative ways. It uses
156 cryptographic hashes as identifiers, and has an integral notion of
157 ``trust'' for code from different sources.
159 Mercurial began life in 2005. While a few aspects of its design are
160 influenced by Monotone, Mercurial focuses on ease of use, high
161 performance, and scalability to very large projects.
163 \section{Trends in revision control}
165 There has been an unmistakable trend in the development and use of
166 revision control tools over the past four decades, as people have
167 become familiar with the capabilities of their tools and constrained
168 by their limitations.
170 The first generation began by managing single files on individual
171 computers. Although these tools represented a huge advance over
172 ad-hoc manual revision control, their locking model and reliance on a
173 single computer limited them to small, tightly-knit teams.
175 The second generation loosened these constraints by moving to
176 network-centered architectures, and managing entire projects at a
177 time. As projects grew larger, they ran into new problems. With
178 clients needing to talk to servers very frequently, server scaling
179 became an issue for large projects. An unreliable network connection
180 could prevent remote users from being able to talk to the server at
181 all. As open source projects started making read-only access
182 available anonymously to anyone, people without commit privileges
183 found that they could not use the tools to interact with a project in
184 a natural way, as they could not record their changes.
186 The current generation of revision control tools is peer-to-peer in
187 nature. All of these systems have dropped the dependency on a single
188 central server, and allow people to distribute their revision control
189 data to where it's actually needed. Collaboration over the Internet
190 has moved from constrained by technology to a matter of choice and
191 consensus. Modern tools can operate offline indefinitely and
192 autonomously, with a network connection only needed when syncing
193 changes with another repository.
195 \section{A few of the advantages of distributed revision control}
197 Even though distributed revision control tools have for several years
198 been as robust and usable as their previous-generation counterparts,
199 people using older tools have not yet necessarily woken up to their
200 advantages. There are a number of ways in which distributed tools
201 shine relative to centralised ones.
203 For an individual developer, distributed tools are almost always much
204 faster than centralised tools. This is for a simple reason: a
205 centralised tool needs to talk over the network for many common
206 operations, because most metadata is stored in a single copy on the
207 central server. A distributed tool stores all of its metadata
208 locally. All else being equal, talking over the network adds overhead
209 to a centralised tool. Don't underestimate the value of a snappy,
210 responsive tool: you're going to spend a lot of time interacting with
211 your revision control software.
213 Distributed tools are indifferent to the vagaries of your server
214 infrastructure, again because they replicate metadata to so many
215 locations. If you use a centralised system and your server catches
216 fire, you'd better hope that your backup media are reliable, and that
217 your last backup was recent and actually worked. With a distributed
218 tool, you have many backups available on every contributor's computer.
220 The reliability of your network will affect distributed tools far less
221 than it will centralised tools. You can't even use a centralised tool
222 without a network connection, except for a few highly constrained
223 commands. With a distributed tool, if your network connection goes
224 down while you're working, you may not even notice. The only thing
225 you won't be able to do is talk to repositories on other computers,
226 something that is relatively rare compared with local operations. If
227 you have a far-flung team of collaborators, this may be significant.
229 \subsection{Advantages for open source projects}
231 If you take a shine to an open source project and decide that you
232 would like to start hacking on it, and that project uses a distributed
233 revision control tool, you are at once a peer with the people who
234 consider themselves the ``core'' of that project. If they publish
235 their repositories, you can immediately copy their project history,
236 start making changes, and record your work, using the same tools in
237 the same ways as insiders. By contrast, with a centralised tool, you
238 must use the software in a ``read only'' mode unless someone grants
239 you permission to commit changes to their central server. Until then,
240 you won't be able to record changes, and your local modifications will
241 be at risk of corruption any time you try to update your client's view
242 of the repository.
244 \subsubsection{The forking non-problem}
246 It has been suggested that distributed revision control tools pose
247 some sort of risk to open source projects because they make it easy to
248 ``fork'' the development of a project. A fork happens when there are
249 differences in opinion or attitude between groups of developers that
250 cause them to decide that they can't work together any longer. Each
251 side takes a more or less complete copy of the project's source code,
252 and goes off in its own direction.
254 Sometimes the camps in a fork decide to reconcile their differences.
255 With a centralised revision control system, the \emph{technical}
256 process of reconciliation is painful, and has to be performed largely
257 by hand. You have to decide whose revision history is going to
258 ``win'', and graft the other team's changes into the tree somehow.
259 This usually loses some or all of one side's revision history.
261 What distributed tools do with respect to forking is they make forking
262 the \emph{only} way to develop a project. Every single change that
263 you make is potentially a fork point. The great strength of this
264 approach is that a distributed revision control tool has to be really
265 good at \emph{merging} forks, because forks are absolutely
266 fundamental: they happen all the time.
268 If every piece of work that everybody does, all the time, is framed in
269 terms of forking and merging, then what the open source world refers
270 to as a ``fork'' becomes \emph{purely} a social issue. If anything,
271 distributed tools \emph{lower} the likelihood of a fork:
272 \begin{itemize}
273 \item They eliminate the social distinction that centralised tools
274 impose: that between insiders (people with commit access) and
275 outsiders (people without).
276 \item They make it easier to reconcile after a social fork, because
277 all that's involved from the perspective of the revision control
278 software is just another merge.
279 \end{itemize}
281 Some people resist distributed tools because they want to retain tight
282 control over their projects, and they believe that centralised tools
283 give them this control. However, if you're of this belief, and you
284 publish your CVS or Subversion repositories publically, there are
285 plenty of tools available that can pull out your entire project's
286 history (albeit slowly) and recreate it somewhere that you don't
287 control. So while your control in this case is illusory, you are
288 foregoing the ability to fluidly collaborate with whatever people feel
289 compelled to mirror and fork your history.
291 \subsection{Advantages for commercial projects}
293 Many commercial projects are undertaken by teams that are scattered
294 across the globe. Contributors who are far from a central server will
295 see slower command execution and perhaps less reliability. Commercial
296 revision control systems attempt to ameliorate these problems with
297 remote-site replication add-ons that are typically expensive to buy
298 and cantankerous to administer. A distributed system doesn't suffer
299 from these problems in the first place. Better yet, you can easily
300 set up multiple authoritative servers, say one per site, so that
301 there's no redundant communication between repositories over expensive
302 long-haul network links.
304 Centralised revision control systems tend to have relatively low
305 scalability. It's not unusual for an expensive centralised system to
306 fall over under the combined load of just a few dozen concurrent
307 users. Once again, the typical response tends to be an expensive and
308 clunky replication facility. Since the load on a central server---if
309 you have one at all---is orders of magnitude lower with a distributed
310 tool (because all of the data is replicated everywhere), a single
311 cheap server can handle the needs of a much larger team, and
312 replication to balance load becomes a simple matter of scripting.
314 If you have an employee in the field, troubleshooting a problem at a
315 customer's site, they'll benefit from distributed revision control.
316 The tool will let them generate custom builds, try different fixes in
317 isolation from each other, and search efficiently through history for
318 the sources of bugs and regressions in the customer's environment, all
319 without needing to connect to your company's network.
321 \section{Why choose Mercurial?}
323 Mercurial has a unique set of properties that make it a particularly
324 good choice as a revision control system.
325 \begin{itemize}
326 \item It is easy to learn and use.
327 \item It is lightweight.
328 \item It scales excellently.
329 \item It is easy to customise.
330 \end{itemize}
332 If you are at all familiar with revision control systems, you should
333 be able to get up and running with Mercurial in less than five
334 minutes. Even if not, it will take no more than a few minutes
335 longer. Mercurial's command and feature sets are generally uniform
336 and consistent, so you can keep track of a few general rules instead
337 of a host of exceptions.
339 On a small project, you can start working with Mercurial in moments.
340 Creating new changes and branches; transferring changes around
341 (whether locally or over a network); and history and status operations
342 are all fast. Mercurial attempts to stay nimble and largely out of
343 your way by combining low cognitive overhead with blazingly fast
344 operations.
346 The usefulness of Mercurial is not limited to small projects: it is
347 used by projects with hundreds to thousands of contributors, each
348 containing tens of thousands of files and hundreds of megabytes of
349 source code.
351 If the core functionality of Mercurial is not enough for you, it's
352 easy to build on. Mercurial is well suited to scripting tasks, and
353 its clean internals and implementation in Python make it easy to add
354 features in the form of extensions. There are a number of popular and
355 useful extensions already available, ranging from helping to identify
356 bugs to improving performance.
358 \section{Mercurial compared with other tools}
360 Before you read on, please understand that this section necessarily
361 reflects my own experiences, interests, and (dare I say it) biases. I
362 have used every one of the revision control tools listed below, in
363 most cases for several years at a time.
365 \subsection{Subversion}
367 Subversion is a popular revision control tool, developed to replace
368 CVS. It has a centralised client/server architecture.
370 Subversion and Mercurial have similarly named commands for performing
371 the same operations, so it is easy for a person who is familiar with
372 one to learn to use the other. Both tools are portable to all popular
373 operating systems.
375 Mercurial has a substantial performance advantage over Subversion on
376 every revision control operation I have benchmarked. I have measured
377 its advantage as ranging from a factor of two to a factor of six when
378 compared with Subversion~1.4.3's \emph{ra\_local} file store, which is
379 the fastest access method available). In more realistic deployments
380 involving a network-based store, Subversion will be at a substantially
381 larger disadvantage.
383 Additionally, Subversion incurs a substantial storage overhead to
384 avoid network transactions for a few common operations, such as
385 finding modified files (\texttt{status}) and displaying modifications
386 against the current revision (\texttt{diff}). A Subversion working
387 copy is, as a result, often approximately the same size as, or larger
388 than, a Mercurial repository and working directory, even though the
389 Mercurial repository contains a complete history of the project.
391 Subversion does not have a history-aware merge capability, forcing its
392 users to know exactly which revisions to merge between branches. This
393 shortcoming is scheduled to be addressed in version 1.5.
395 Subversion is currently more widely supported by
396 revision-control-aware third party tools than is Mercurial, although
397 this gap is closing. Like Mercurial, Subversion has an excellent user
398 manual.
400 Several tools exist to accurately and completely import revision
401 history from a Subversion repository into a Mercurial repository,
402 making the transition from the older tool relatively painless.
404 \subsection{Git}
406 Git is a distributed revision control tool that was developed for
407 managing the Linux kernel source tree. Like Mercurial, its early
408 design was somewhat influenced by Monotone.
410 Git has an overwhelming command set, with version~1.5.0 providing~139
411 individual commands. It has a reputation for being difficult to
412 learn. It does not have a user manual, only documentation for
413 individual commands.
415 In terms of performance, git is extremely fast. There are several
416 common cases in which it is faster than Mercurial, at least on Linux.
417 However, its performance on (and general support for) Windows is, at
418 the time of writing, far behind that of Mercurial.
420 While a Mercurial repository needs no maintenance, a Git repository
421 requires frequent manual ``repacks'' of its metadata. Without these,
422 performance degrades, while space usage grows rapidly. A server that
423 contains many Git repositories that are not rigorously and frequently
424 repacked will become heavily disk-bound during backups, and there have
425 been instances of daily backups taking far longer than~24 hours as a
426 result. A freshly packed Git repository is slightly smaller than a
427 Mercurial repository, but an unpacked repository is several orders of
428 magnitude larger.
430 The core of Git is written in C. Many Git commands are implemented as
431 shell or Perl scripts, and the quality of these scripts varies widely.
432 I have encountered a number of instances where scripts charged along
433 blindly in the presence of errors that should have been fatal.
435 \subsection{CVS}
437 CVS is probably the most widely used revision control tool in the
438 world. Due to its age and internal untidiness, it has been
439 essentially unmaintained for many years.
441 It has a centralised client/server architecture. It does not group
442 related file changes into atomic commits, making it easy for people to
443 ``break the build''. It has a muddled and incoherent notion of tags
444 and branches that I will not attempt to even describe. It does not
445 support renaming of files or directories well, making it easy to
446 corrupt a repository. It has almost no internal consistency checking
447 capabilities, so it is usually not even possible to tell whether or
448 how a repository is corrupt. I would not recommend CVS for any
449 project, existing or new.
451 Mercurial can import CVS revision history. However, there are a few
452 caveats that apply; these are true of every other revision control
453 tool's CVS importer, too. Due to CVS's lack of atomic changes and
454 unversioned filesystem hierarchy, it is not possible to reconstruct
455 CVS history completely accurately; some guesswork is involved, and
456 renames will usually not show up. Because a lot of advanced CVS
457 administration has to be done by hand and is hence error-prone, it's
458 common for CVS importers to run into multiple problems with corrupted
459 repositories (completely bogus revision timestamps and files that have
460 remained locked for over a decade are just two of the less interesting
461 problems I can recall from personal experience).
463 \subsection{Commercial tools}
465 Perforce has a centralised client/server architecture, with no
466 client-side caching of any data. Unlike modern revision control
467 tools, Perforce requires that a user run a command to inform the
468 server about every file they intend to edit.
470 The performance of Perforce is quite good for small teams, but it
471 falls off rapidly as the number of users grows beyond a few dozen.
472 Modestly large Perforce installations require the deployment of
473 proxies to cope with the load their users generate.
475 %%% Local Variables:
476 %%% mode: latex
477 %%% TeX-master: "00book"
478 %%% End: