hgbook: d0160b0b1a9e en/ch04-concepts.xml

hgbook

view en/ch04-concepts.xml @ 647:d0160b0b1a9e

Merge with http://hg.serpentine.com/mercurial/book

author	Dongsheng Song <dongsheng.song@gmail.com>
date	Wed Mar 18 20:32:37 2009 +0800 (2009-03-18)
parents	cfdb601a3c8b
children

line source

1

3 <chapter id="chap.concepts">

4 <?dbhtml filename="behind-the-scenes.html"?>

5 <title>Behind the scenes</title>

7 <para>Unlike many revision control systems, the concepts upon which

8 Mercurial is built are simple enough that it's easy to understand

9 how the software really works. Knowing this certainly isn't

10 necessary, but I find it useful to have a <quote>mental

11 model</quote> of what's going on.</para>

13 <para>This understanding gives me confidence that Mercurial has been

14 carefully designed to be both <emphasis>safe</emphasis> and

15 <emphasis>efficient</emphasis>. And just as importantly, if it's

16 easy for me to retain a good idea of what the software is doing

17 when I perform a revision control task, I'm less likely to be

18 surprised by its behaviour.</para>

20 <para>In this chapter, we'll initially cover the core concepts

21 behind Mercurial's design, then continue to discuss some of the

22 interesting details of its implementation.</para>

24 <sect1>

25 <title>Mercurial's historical record</title>

27 <sect2>

28 <title>Tracking the history of a single file</title>

30 <para>When Mercurial tracks modifications to a file, it stores

31 the history of that file in a metadata object called a

32 <emphasis>filelog</emphasis>. Each entry in the filelog

33 contains enough information to reconstruct one revision of the

34 file that is being tracked. Filelogs are stored as files in

35 the <filename role="special"

36 class="directory">.hg/store/data</filename> directory. A

37 filelog contains two kinds of information: revision data, and

38 an index to help Mercurial to find a revision

39 efficiently.</para>

41 <para>A file that is large, or has a lot of history, has its

42 filelog stored in separate data

43 (<quote><literal>.d</literal></quote> suffix) and index

44 (<quote><literal>.i</literal></quote> suffix) files. For

45 small files without much history, the revision data and index

46 are combined in a single <quote><literal>.i</literal></quote>

47 file. The correspondence between a file in the working

48 directory and the filelog that tracks its history in the

49 repository is illustrated in figure <xref

50 endterm="fig.concepts.filelog.caption"

51 linkend="fig.concepts.filelog"/>.</para>

53 <informalfigure id="fig.concepts.filelog">

54 <mediaobject>

55 <imageobject><imagedata fileref="images/filelog.png"/></imageobject>

56 <textobject><phrase>XXX add text</phrase></textobject>

57 <caption><para id="fig.concepts.filelog.caption">Relationships between

58 files in working directory and filelogs in repository</para>

59 </caption>

60 </mediaobject>

61 </informalfigure>

63 </sect2>

64 <sect2>

65 <title>Managing tracked files</title>

67 <para>Mercurial uses a structure called a

68 <emphasis>manifest</emphasis> to collect together information

69 about the files that it tracks. Each entry in the manifest

70 contains information about the files present in a single

71 changeset. An entry records which files are present in the

72 changeset, the revision of each file, and a few other pieces

73 of file metadata.</para>

75 </sect2>

76 <sect2>

77 <title>Recording changeset information</title>

79 <para>The <emphasis>changelog</emphasis> contains information

80 about each changeset. Each revision records who committed a

81 change, the changeset comment, other pieces of

82 changeset-related information, and the revision of the

83 manifest to use.</para>

85 </sect2>

86 <sect2>

87 <title>Relationships between revisions</title>

89 <para>Within a changelog, a manifest, or a filelog, each

90 revision stores a pointer to its immediate parent (or to its

91 two parents, if it's a merge revision). As I mentioned above,

92 there are also relationships between revisions

93 <emphasis>across</emphasis> these structures, and they are

94 hierarchical in nature.</para>

96 <para>For every changeset in a repository, there is exactly one

97 revision stored in the changelog. Each revision of the

98 changelog contains a pointer to a single revision of the

99 manifest. A revision of the manifest stores a pointer to a

100 single revision of each filelog tracked when that changeset

101 was created. These relationships are illustrated in figure

102 <xref endterm="fig.concepts.metadata.caption"

103 linkend="fig.concepts.metadata"/>.</para>

104

105 <informalfigure id="fig.concepts.metadata">

106 <mediaobject>

107 <imageobject><imagedata fileref="images/metadata.png"/></imageobject>

108 <textobject><phrase>XXX add text</phrase></textobject>

109 <caption><para id="fig.concepts.metadata.caption">Metadata

110 relationships</para></caption>

111 </mediaobject>

112 </informalfigure>

113

114 <para>As the illustration shows, there is

115 <emphasis>not</emphasis> a <quote>one to one</quote>

116 relationship between revisions in the changelog, manifest, or

117 filelog. If the manifest hasn't changed between two

118 changesets, the changelog entries for those changesets will

119 point to the same revision of the manifest. If a file that

120 Mercurial tracks hasn't changed between two changesets, the

121 entry for that file in the two revisions of the manifest will

122 point to the same revision of its filelog.</para>

123

124 </sect2>

125 </sect1>

126 <sect1>

127 <title>Safe, efficient storage</title>

128

129 <para>The underpinnings of changelogs, manifests, and filelogs are

130 provided by a single structure called the

131 <emphasis>revlog</emphasis>.</para>

132

133 <sect2>

134 <title>Efficient storage</title>

135

136 <para>The revlog provides efficient storage of revisions using a

137 <emphasis>delta</emphasis> mechanism. Instead of storing a

138 complete copy of a file for each revision, it stores the

139 changes needed to transform an older revision into the new

140 revision. For many kinds of file data, these deltas are

141 typically a fraction of a percent of the size of a full copy

142 of a file.</para>

143

144 <para>Some obsolete revision control systems can only work with

145 deltas of text files. They must either store binary files as

146 complete snapshots or encoded into a text representation, both

147 of which are wasteful approaches. Mercurial can efficiently

148 handle deltas of files with arbitrary binary contents; it

149 doesn't need to treat text as special.</para>

150

151 </sect2>

152 <sect2 id="sec.concepts.txn">

153 <title>Safe operation</title>

154

155 <para>Mercurial only ever <emphasis>appends</emphasis> data to

156 the end of a revlog file. It never modifies a section of a

157 file after it has written it. This is both more robust and

158 efficient than schemes that need to modify or rewrite

159 data.</para>

160

161 <para>In addition, Mercurial treats every write as part of a

162 <emphasis>transaction</emphasis> that can span a number of

163 files. A transaction is <emphasis>atomic</emphasis>: either

164 the entire transaction succeeds and its effects are all

165 visible to readers in one go, or the whole thing is undone.

166 This guarantee of atomicity means that if you're running two

167 copies of Mercurial, where one is reading data and one is

168 writing it, the reader will never see a partially written

169 result that might confuse it.</para>

170

171 <para>The fact that Mercurial only appends to files makes it

172 easier to provide this transactional guarantee. The easier it

173 is to do stuff like this, the more confident you should be

174 that it's done correctly.</para>

175

176 </sect2>

177 <sect2>

178 <title>Fast retrieval</title>

179

180 <para>Mercurial cleverly avoids a pitfall common to all earlier

181 revision control systems: the problem of <emphasis>inefficient

182 retrieval</emphasis>. Most revision control systems store

183 the contents of a revision as an incremental series of

184 modifications against a <quote>snapshot</quote>. To

185 reconstruct a specific revision, you must first read the

186 snapshot, and then every one of the revisions between the

187 snapshot and your target revision. The more history that a

188 file accumulates, the more revisions you must read, hence the

189 longer it takes to reconstruct a particular revision.</para>

190

191 <informalfigure id="fig.concepts.snapshot">

192 <mediaobject>

193 <imageobject><imagedata fileref="images/snapshot.png"/></imageobject>

194 <textobject><phrase>XXX add text</phrase></textobject>

195 <caption><para id="fig.concepts.snapshot.caption">Snapshot of

196 a revlog, with incremental deltas</para></caption>

197 </mediaobject>

198 </informalfigure>

199

200 <para>The innovation that Mercurial applies to this problem is

201 simple but effective. Once the cumulative amount of delta

202 information stored since the last snapshot exceeds a fixed

203 threshold, it stores a new snapshot (compressed, of course),

204 instead of another delta. This makes it possible to

205 reconstruct <emphasis>any</emphasis> revision of a file

206 quickly. This approach works so well that it has since been

207 copied by several other revision control systems.</para>

208

209 <para>Figure <xref endterm="fig.concepts.snapshot.caption"

210 linkend="fig.concepts.snapshot"/> illustrates

211 the idea. In an entry in a revlog's index file, Mercurial

212 stores the range of entries from the data file that it must

213 read to reconstruct a particular revision.</para>

214

215 <sect3>

216 <title>Aside: the influence of video compression</title>

217

218 <para>If you're familiar with video compression or have ever

219 watched a TV feed through a digital cable or satellite

220 service, you may know that most video compression schemes

221 store each frame of video as a delta against its predecessor

222 frame. In addition, these schemes use <quote>lossy</quote>

223 compression techniques to increase the compression ratio, so

224 visual errors accumulate over the course of a number of

225 inter-frame deltas.</para>

226

227 <para>Because it's possible for a video stream to <quote>drop

228 out</quote> occasionally due to signal glitches, and to

229 limit the accumulation of artefacts introduced by the lossy

230 compression process, video encoders periodically insert a

231 complete frame (called a <quote>key frame</quote>) into the

232 video stream; the next delta is generated against that

233 frame. This means that if the video signal gets

234 interrupted, it will resume once the next key frame is

235 received. Also, the accumulation of encoding errors

236 restarts anew with each key frame.</para>

237

238 </sect3>

239 </sect2>

240 <sect2>

241 <title>Identification and strong integrity</title>

242

243 <para>Along with delta or snapshot information, a revlog entry

244 contains a cryptographic hash of the data that it represents.

245 This makes it difficult to forge the contents of a revision,

246 and easy to detect accidental corruption.</para>

247

248 <para>Hashes provide more than a mere check against corruption;

249 they are used as the identifiers for revisions. The changeset

250 identification hashes that you see as an end user are from

251 revisions of the changelog. Although filelogs and the

252 manifest also use hashes, Mercurial only uses these behind the

253 scenes.</para>

254

255 <para>Mercurial verifies that hashes are correct when it

256 retrieves file revisions and when it pulls changes from

257 another repository. If it encounters an integrity problem, it

258 will complain and stop whatever it's doing.</para>

259

260 <para>In addition to the effect it has on retrieval efficiency,

261 Mercurial's use of periodic snapshots makes it more robust

262 against partial data corruption. If a revlog becomes partly

263 corrupted due to a hardware error or system bug, it's often

264 possible to reconstruct some or most revisions from the

265 uncorrupted sections of the revlog, both before and after the

266 corrupted section. This would not be possible with a

267 delta-only storage model.</para>

268

269 </sect2>

270 </sect1>

271 <sect1>

272 <title>Revision history, branching, and merging</title>

273

274 <para>Every entry in a Mercurial revlog knows the identity of its

275 immediate ancestor revision, usually referred to as its

276 <emphasis>parent</emphasis>. In fact, a revision contains room

277 for not one parent, but two. Mercurial uses a special hash,

278 called the <quote>null ID</quote>, to represent the idea

279 <quote>there is no parent here</quote>. This hash is simply a

280 string of zeroes.</para>

281

282 <para>In figure <xref endterm="fig.concepts.revlog.caption"

283 linkend="fig.concepts.revlog"/>, you can see

284 an example of the conceptual structure of a revlog. Filelogs,

285 manifests, and changelogs all have this same structure; they

286 differ only in the kind of data stored in each delta or

287 snapshot.</para>

288

289 <para>The first revision in a revlog (at the bottom of the image)

290 has the null ID in both of its parent slots. For a

291 <quote>normal</quote> revision, its first parent slot contains

292 the ID of its parent revision, and its second contains the null

293 ID, indicating that the revision has only one real parent. Any

294 two revisions that have the same parent ID are branches. A

295 revision that represents a merge between branches has two normal

296 revision IDs in its parent slots.</para>

297

298 <informalfigure id="fig.concepts.revlog">

299 <mediaobject>

300 <imageobject><imagedata fileref="images/revlog.png"/></imageobject>

301 <textobject><phrase>XXX add text</phrase></textobject>

302 <caption><para id="fig.concepts.revlog.caption">Revision in revlog</para>

303 </caption>

304 </mediaobject>

305 </informalfigure>

306

307 </sect1>

308 <sect1>

309 <title>The working directory</title>

310

311 <para>In the working directory, Mercurial stores a snapshot of the

312 files from the repository as of a particular changeset.</para>

313

314 <para>The working directory <quote>knows</quote> which changeset

315 it contains. When you update the working directory to contain a

316 particular changeset, Mercurial looks up the appropriate

317 revision of the manifest to find out which files it was tracking

318 at the time that changeset was committed, and which revision of

319 each file was then current. It then recreates a copy of each of

320 those files, with the same contents it had when the changeset

321 was committed.</para>

322

323 <para>The <emphasis>dirstate</emphasis> contains Mercurial's

324 knowledge of the working directory. This details which

325 changeset the working directory is updated to, and all of the

326 files that Mercurial is tracking in the working

327 directory.</para>

328

329 <para>Just as a revision of a revlog has room for two parents, so

330 that it can represent either a normal revision (with one parent)

331 or a merge of two earlier revisions, the dirstate has slots for

332 two parents. When you use the <command role="hg-cmd">hg

333 update</command> command, the changeset that you update to is

334 stored in the <quote>first parent</quote> slot, and the null ID

335 in the second. When you <command role="hg-cmd">hg

336 merge</command> with another changeset, the first parent

337 remains unchanged, and the second parent is filled in with the

338 changeset you're merging with. The <command role="hg-cmd">hg

339 parents</command> command tells you what the parents of the

340 dirstate are.</para>

341

342 <sect2>

343 <title>What happens when you commit</title>

344

345 <para>The dirstate stores parent information for more than just

346 book-keeping purposes. Mercurial uses the parents of the

347 dirstate as <emphasis>the parents of a new

348 changeset</emphasis> when you perform a commit.</para>

349

350 <informalfigure id="fig.concepts.wdir">

351 <mediaobject>

352 <imageobject><imagedata fileref="images/wdir.png"/></imageobject>

353 <textobject><phrase>XXX add text</phrase></textobject>

354 <caption><para id="fig.concepts.wdir.caption">The working

355 directory can have two parents</para></caption>

356 </mediaobject>

357 </informalfigure>

358

359 <para>Figure <xref endterm="fig.concepts.wdir.caption"

360 linkend="fig.concepts.wdir"/> shows the

361 normal state of the working directory, where it has a single

362 changeset as parent. That changeset is the

363 <emphasis>tip</emphasis>, the newest changeset in the

364 repository that has no children.</para>

365

366 <informalfigure id="fig.concepts.wdir-after-commit">

367 <mediaobject>

368 <imageobject><imagedata fileref="images/wdir-after-commit.png"/>

369 </imageobject>

370 <textobject><phrase>XXX add text</phrase></textobject>

371 <caption><para id="fig.concepts.wdir-after-commit.caption">The working

372 directory gains new parents after a commit</para></caption>

373 </mediaobject>

374 </informalfigure>

375

376 <para>It's useful to think of the working directory as

377 <quote>the changeset I'm about to commit</quote>. Any files

378 that you tell Mercurial that you've added, removed, renamed,

379 or copied will be reflected in that changeset, as will

380 modifications to any files that Mercurial is already tracking;

381 the new changeset will have the parents of the working

382 directory as its parents.</para>

383

384 <para>After a commit, Mercurial will update the parents of the

385 working directory, so that the first parent is the ID of the

386 new changeset, and the second is the null ID. This is shown

387 in figure <xref endterm="fig.concepts.wdir-after-commit.caption"

388 linkend="fig.concepts.wdir-after-commit"/>.

389 Mercurial

390 doesn't touch any of the files in the working directory when

391 you commit; it just modifies the dirstate to note its new

392 parents.</para>

393

394 </sect2>

395 <sect2>

396 <title>Creating a new head</title>

397

398 <para>It's perfectly normal to update the working directory to a

399 changeset other than the current tip. For example, you might

400 want to know what your project looked like last Tuesday, or

401 you could be looking through changesets to see which one

402 introduced a bug. In cases like this, the natural thing to do

403 is update the working directory to the changeset you're

404 interested in, and then examine the files in the working

405 directory directly to see their contents as they were when you

406 committed that changeset. The effect of this is shown in

407 figure <xref endterm="fig.concepts.wdir-pre-branch.caption"

408 linkend="fig.concepts.wdir-pre-branch"/>.</para>

409

410 <informalfigure id="fig.concepts.wdir-pre-branch">

411 <mediaobject>

412 <imageobject><imagedata fileref="images/wdir-pre-branch.png"/>

413 </imageobject>

414 <textobject><phrase>XXX add text</phrase></textobject>

415 <caption><para id="fig.concepts.wdir-pre-branch.caption">The working

416 directory, updated to an older changeset</para></caption>

417 </mediaobject>

418 </informalfigure>

419

420 <para>Having updated the working directory to an older

421 changeset, what happens if you make some changes, and then

422 commit? Mercurial behaves in the same way as I outlined

423 above. The parents of the working directory become the

424 parents of the new changeset. This new changeset has no

425 children, so it becomes the new tip. And the repository now

426 contains two changesets that have no children; we call these

427 <emphasis>heads</emphasis>. You can see the structure that

428 this creates in figure <xref

429 endterm="fig.concepts.wdir-branch.caption"

430 linkend="fig.concepts.wdir-branch"/>.</para>

431

432 <informalfigure id="fig.concepts.wdir-branch">

433 <mediaobject>

434 <imageobject><imagedata fileref="images/wdir-branch.png"/>

435 </imageobject>

436 <textobject><phrase>XXX add text</phrase></textobject>

437 <caption><para id="fig.concepts.wdir-branch.caption">After a

438 commit made while synced to an older changeset</para></caption>

439 </mediaobject>

440 </informalfigure>

441

442 <note>

443 <para> If you're new to Mercurial, you should keep in mind a

444 common <quote>error</quote>, which is to use the <command

445 role="hg-cmd">hg pull</command> command without any

446 options. By default, the <command role="hg-cmd">hg

447 pull</command> command <emphasis>does not</emphasis>

448 update the working directory, so you'll bring new changesets

449 into your repository, but the working directory will stay

450 synced at the same changeset as before the pull. If you

451 make some changes and commit afterwards, you'll thus create

452 a new head, because your working directory isn't synced to

453 whatever the current tip is.</para>

454

455 <para> I put the word <quote>error</quote> in quotes because

456 all that you need to do to rectify this situation is

457 <command role="hg-cmd">hg merge</command>, then <command

458 role="hg-cmd">hg commit</command>. In other words, this

459 almost never has negative consequences; it just surprises

460 people. I'll discuss other ways to avoid this behaviour,

461 and why Mercurial behaves in this initially surprising way,

462 later on.</para>

463 </note>

464

465 </sect2>

466 <sect2>

467 <title>Merging heads</title>

468

469 <para>When you run the <command role="hg-cmd">hg merge</command>

470 command, Mercurial leaves the first parent of the working

471 directory unchanged, and sets the second parent to the

472 changeset you're merging with, as shown in figure <xref

473 endterm="fig.concepts.wdir-merge.caption"

474 linkend="fig.concepts.wdir-merge"/>.</para>

475

476 <informalfigure id="fig.concepts.wdir-merge">

477 <mediaobject>

478 <imageobject><imagedata fileref="images/wdir-merge.png"/>

479 </imageobject>

480 <textobject><phrase>XXX add text</phrase></textobject>

481 <caption><para id="fig.concepts.wdir-merge.caption">Merging two

482 heads</para></caption>

483 </mediaobject>

484 </informalfigure>

485

486 <para>Mercurial also has to modify the working directory, to

487 merge the files managed in the two changesets. Simplified a

488 little, the merging process goes like this, for every file in

489 the manifests of both changesets.</para>

490 <itemizedlist>

491 <listitem><para>If neither changeset has modified a file, do

492 nothing with that file.</para>

493 </listitem>

494 <listitem><para>If one changeset has modified a file, and the

495 other hasn't, create the modified copy of the file in the

496 working directory.</para>

497 </listitem>

498 <listitem><para>If one changeset has removed a file, and the

499 other hasn't (or has also deleted it), delete the file

500 from the working directory.</para>

501 </listitem>

502 <listitem><para>If one changeset has removed a file, but the

503 other has modified the file, ask the user what to do: keep

504 the modified file, or remove it?</para>

505 </listitem>

506 <listitem><para>If both changesets have modified a file,

507 invoke an external merge program to choose the new

508 contents for the merged file. This may require input from

509 the user.</para>

510 </listitem>

511 <listitem><para>If one changeset has modified a file, and the

512 other has renamed or copied the file, make sure that the

513 changes follow the new name of the file.</para>

514 </listitem></itemizedlist>

515 <para>There are more details&emdash;merging has plenty of corner

516 cases&emdash;but these are the most common choices that are

517 involved in a merge. As you can see, most cases are

518 completely automatic, and indeed most merges finish

519 automatically, without requiring your input to resolve any

520 conflicts.</para>

521

522 <para>When you're thinking about what happens when you commit

523 after a merge, once again the working directory is <quote>the

524 changeset I'm about to commit</quote>. After the <command

525 role="hg-cmd">hg merge</command> command completes, the

526 working directory has two parents; these will become the

527 parents of the new changeset.</para>

528

529 <para>Mercurial lets you perform multiple merges, but you must

530 commit the results of each individual merge as you go. This

531 is necessary because Mercurial only tracks two parents for

532 both revisions and the working directory. While it would be

533 technically possible to merge multiple changesets at once, the

534 prospect of user confusion and making a terrible mess of a

535 merge immediately becomes overwhelming.</para>

536

537 </sect2>

538 </sect1>

539 <sect1>

540 <title>Other interesting design features</title>

541

542 <para>In the sections above, I've tried to highlight some of the

543 most important aspects of Mercurial's design, to illustrate that

544 it pays careful attention to reliability and performance.

545 However, the attention to detail doesn't stop there. There are

546 a number of other aspects of Mercurial's construction that I

547 personally find interesting. I'll detail a few of them here,

548 separate from the <quote>big ticket</quote> items above, so that

549 if you're interested, you can gain a better idea of the amount

550 of thinking that goes into a well-designed system.</para>

551

552 <sect2>

553 <title>Clever compression</title>

554

555 <para>When appropriate, Mercurial will store both snapshots and

556 deltas in compressed form. It does this by always

557 <emphasis>trying to</emphasis> compress a snapshot or delta,

558 but only storing the compressed version if it's smaller than

559 the uncompressed version.</para>

560

561 <para>This means that Mercurial does <quote>the right

562 thing</quote> when storing a file whose native form is

563 compressed, such as a <literal>zip</literal> archive or a JPEG

564 image. When these types of files are compressed a second

565 time, the resulting file is usually bigger than the

566 once-compressed form, and so Mercurial will store the plain

567 <literal>zip</literal> or JPEG.</para>

568

569 <para>Deltas between revisions of a compressed file are usually

570 larger than snapshots of the file, and Mercurial again does

571 <quote>the right thing</quote> in these cases. It finds that

572 such a delta exceeds the threshold at which it should store a

573 complete snapshot of the file, so it stores the snapshot,

574 again saving space compared to a naive delta-only

575 approach.</para>

576

577 <sect3>

578 <title>Network recompression</title>

579

580 <para>When storing revisions on disk, Mercurial uses the

581 <quote>deflate</quote> compression algorithm (the same one

582 used by the popular <literal>zip</literal> archive format),

583 which balances good speed with a respectable compression

584 ratio. However, when transmitting revision data over a

585 network connection, Mercurial uncompresses the compressed

586 revision data.</para>

587

588 <para>If the connection is over HTTP, Mercurial recompresses

589 the entire stream of data using a compression algorithm that

590 gives a better compression ratio (the Burrows-Wheeler

591 algorithm from the widely used <literal>bzip2</literal>

592 compression package). This combination of algorithm and

593 compression of the entire stream (instead of a revision at a

594 time) substantially reduces the number of bytes to be

595 transferred, yielding better network performance over almost

596 all kinds of network.</para>

597

598 <para>(If the connection is over <command>ssh</command>,

599 Mercurial <emphasis>doesn't</emphasis> recompress the

600 stream, because <command>ssh</command> can already do this

601 itself.)</para>

602

603 </sect3>

604 </sect2>

605 <sect2>

606 <title>Read/write ordering and atomicity</title>

607

608 <para>Appending to files isn't the whole story when it comes to

609 guaranteeing that a reader won't see a partial write. If you

610 recall figure <xref endterm="fig.concepts.metadata.caption"

611 linkend="fig.concepts.metadata"/>, revisions in the

612 changelog point to revisions in the manifest, and revisions in

613 the manifest point to revisions in filelogs. This hierarchy

614 is deliberate.</para>

615

616 <para>A writer starts a transaction by writing filelog and

617 manifest data, and doesn't write any changelog data until

618 those are finished. A reader starts by reading changelog

619 data, then manifest data, followed by filelog data.</para>

620

621 <para>Since the writer has always finished writing filelog and

622 manifest data before it writes to the changelog, a reader will

623 never read a pointer to a partially written manifest revision

624 from the changelog, and it will never read a pointer to a

625 partially written filelog revision from the manifest.</para>

626

627 </sect2>

628 <sect2>

629 <title>Concurrent access</title>

630

631 <para>The read/write ordering and atomicity guarantees mean that

632 Mercurial never needs to <emphasis>lock</emphasis> a

633 repository when it's reading data, even if the repository is

634 being written to while the read is occurring. This has a big

635 effect on scalability; you can have an arbitrary number of

636 Mercurial processes safely reading data from a repository

637 safely all at once, no matter whether it's being written to or

638 not.</para>

639

640 <para>The lockless nature of reading means that if you're

641 sharing a repository on a multi-user system, you don't need to

642 grant other local users permission to

643 <emphasis>write</emphasis> to your repository in order for

644 them to be able to clone it or pull changes from it; they only

645 need <emphasis>read</emphasis> permission. (This is

646 <emphasis>not</emphasis> a common feature among revision

647 control systems, so don't take it for granted! Most require

648 readers to be able to lock a repository to access it safely,

649 and this requires write permission on at least one directory,

650 which of course makes for all kinds of nasty and annoying

651 security and administrative problems.)</para>

652

653 <para>Mercurial uses locks to ensure that only one process can

654 write to a repository at a time (the locking mechanism is safe

655 even over filesystems that are notoriously hostile to locking,

656 such as NFS). If a repository is locked, a writer will wait

657 for a while to retry if the repository becomes unlocked, but

658 if the repository remains locked for too long, the process

659 attempting to write will time out after a while. This means

660 that your daily automated scripts won't get stuck forever and

661 pile up if a system crashes unnoticed, for example. (Yes, the

662 timeout is configurable, from zero to infinity.)</para>

663

664 <sect3>

665 <title>Safe dirstate access</title>

666

667 <para>As with revision data, Mercurial doesn't take a lock to

668 read the dirstate file; it does acquire a lock to write it.

669 To avoid the possibility of reading a partially written copy

670 of the dirstate file, Mercurial writes to a file with a

671 unique name in the same directory as the dirstate file, then

672 renames the temporary file atomically to

673 <filename>dirstate</filename>. The file named

674 <filename>dirstate</filename> is thus guaranteed to be

675 complete, not partially written.</para>

676

677 </sect3>

678 </sect2>

679 <sect2>

680 <title>Avoiding seeks</title>

681

682 <para>Critical to Mercurial's performance is the avoidance of

683 seeks of the disk head, since any seek is far more expensive

684 than even a comparatively large read operation.</para>

685

686 <para>This is why, for example, the dirstate is stored in a

687 single file. If there were a dirstate file per directory that

688 Mercurial tracked, the disk would seek once per directory.

689 Instead, Mercurial reads the entire single dirstate file in

690 one step.</para>

691

692 <para>Mercurial also uses a <quote>copy on write</quote> scheme

693 when cloning a repository on local storage. Instead of

694 copying every revlog file from the old repository into the new

695 repository, it makes a <quote>hard link</quote>, which is a

696 shorthand way to say <quote>these two names point to the same

697 file</quote>. When Mercurial is about to write to one of a

698 revlog's files, it checks to see if the number of names

699 pointing at the file is greater than one. If it is, more than

700 one repository is using the file, so Mercurial makes a new

701 copy of the file that is private to this repository.</para>

702

703 <para>A few revision control developers have pointed out that

704 this idea of making a complete private copy of a file is not

705 very efficient in its use of storage. While this is true,

706 storage is cheap, and this method gives the highest

707 performance while deferring most book-keeping to the operating

708 system. An alternative scheme would most likely reduce

709 performance and increase the complexity of the software, each

710 of which is much more important to the <quote>feel</quote> of

711 day-to-day use.</para>

712

713 </sect2>

714 <sect2>

715 <title>Other contents of the dirstate</title>

716

717 <para>Because Mercurial doesn't force you to tell it when you're

718 modifying a file, it uses the dirstate to store some extra

719 information so it can determine efficiently whether you have

720 modified a file. For each file in the working directory, it

721 stores the time that it last modified the file itself, and the

722 size of the file at that time.</para>

723

724 <para>When you explicitly <command role="hg-cmd">hg

725 add</command>, <command role="hg-cmd">hg remove</command>,

726 <command role="hg-cmd">hg rename</command> or <command

727 role="hg-cmd">hg copy</command> files, Mercurial updates the

728 dirstate so that it knows what to do with those files when you

729 commit.</para>

730

731 <para>When Mercurial is checking the states of files in the

732 working directory, it first checks a file's modification time.

733 If that has not changed, the file must not have been modified.

734 If the file's size has changed, the file must have been

735 modified. If the modification time has changed, but the size

736 has not, only then does Mercurial need to read the actual

737 contents of the file to see if they've changed. Storing these

738 few extra pieces of information dramatically reduces the

739 amount of data that Mercurial needs to read, which yields

740 large performance improvements compared to other revision

741 control systems.</para>

742

743 </sect2>

744 </sect1>

745 </chapter>

746

747 <!--

748 local variables:

749 sgml-parent-document: ("00book.xml" "book" "chapter")

750 end:

751 -->