[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Arch as a replacement for CVS for OpenBSD?
Judging by the amount of "marketing speak", [arch] is not ready
Arch is indeed new: ready for "early adopters" not for people wanting
a glitch-free product. That fact is well documented in numerous
places and I have never assumed that the OpenBSD project as a whole
should or would be an "early adopter" at this stage.
[various complaints about the documentation]
I started to reply point-by-point to each question, but quickly came
to the conclusion that it might be more helpful to to write a brief
overview for programmers. That might make the existing documentation
easier to navigate and understand. Does the enclosed overview (at the
end of this message) help?
Here are answers to a few of the more interesting specific questions:
What are the costs associated with doing large-scale diffs (300MB of
In the current release, a few of us have measured 5-10x the cost of
"diff -r", depending on what's cached. We have a prototype for a new
version of `mkpatch' that benchmarks at 1-2x the cost of "diff -r".
What are the costs associated with doing large-scale "cvs annotate"
See the note below about "revision libraries". You can do this sort
of operation very quickly. The built-in commands which perform such
operations are, admittedly, minimal, but revision libraries let you
get the same information more flexibly using ordinary shell tools.
What is that different approach? Where is the documentation
that explains this philosophy [the difference between arch and
I don't mean to be too flippant, but the documentation you're asking
for starts with P. J. Plauger's "Software Tools" and goes on from
there. arch is built out of simple tools that each do one thing well
and that are designed to be combined together. It combines those
tools in useful ways to yield a system of rich functionality that
requires very few lines of code to implement.
A Hacker-Oriented Overview of Arch
The most central piece of (low level) functionality in arch is the
three commands: `mkpatch', `dopatch', and `inventory'. It is
easiest to understand the high level functionality of arch in terms
of those three commands.
Conceptually, `mkpatch' and `dopatch' are very similar to `diff -r'
and `patch'. The most important difference is that the arch
renamed files and directories
files which `diff' thinks are binary files
`inventory' is used to identify which files in a tree are
significant, and to assign a logical identity to each file and
directory. The logical identity remains the same even if a file
is renamed and is the basis on which renames are detected.
* Sequences of Revisions
When you import a new tree to an arch repository, the essence of what
happens is that the tree is stored in the repository as a compressed
tar file. When you commit successive revisions of that tree, each
revision is stored as a compressed tar file containing the patch set
for that revision. (Also see below about "revision libraries".)
* Atomic Operations
Storing new revisions in an archive is an atomic operation. In
normal operation, locks are held transiently, only while the write
is being performed (similarly to CVS' "lock-less" operation).
Concurrent reads and writes do not interfere with one another.
* Revision Libraries
In addition to a repository of patch sets, arch is typically
configured to maintain a "revision library". A revision library is
a collection of revisions stored as complete copies of the source
tree, but with an important space optimization: unmodified files are
shared among these trees using hard links.
Many operations (such as checking out a new revision or computing
diffs between arbitrary revisions) use the revision library as a
performance optimization. In addition, programmers can use the
library directly with their favorite tools to explore various
revisions. (This is similar functionality to that offered by
ClearCase, but it is implemented in a portable way.)
Any revision, instead of being a complete tar file of the entire
tree or a simple patch set, can be a "tag". Conceptually, a tag is
a symbolic link to some other revision. Tags are how branches are
implemented (the baseline revision of a branch is a tag of the
revision being branched from).
* Patch Logs
Each "project tree" (or, in CVS terms, "working directory") contains
meta-data that records what patch sets have been applied to that
tree. For example, when you merge one branch into another, the
merged-into branch gains patch logs for the changes from the
Arch contains some higher-level merge operations (e.g. star-merge,
replay) which use the patch logs to perform merges intelligently.
(For an example of why merging is a non-trivial problem, requiring
higher-level operations, see
Patch logs are also useful for data-mining about the history of a
tree. For example, they can be used to produce ChangeLog files.
* Name-spaces and Distributed Repositories
Every repository has a globally unique name. The name is location
independent: it remains the same for all mirrors of the repository
and if the repository is migrated.
Every revision in a repository has a unique name:
Putting those two together, every revision has a globally unique
"fully qualified name" of the form:
Tags use fully qualified names: thus you can form a branch from
one repository to another. Patch logs use fully qualified names,
thus the history of each project tree includes a record of all patch
sets merged into that tree, from any repository.
The effect of this is that, as far as arch is concerned, there is
just one global repository, stored in many distributed parts. There
is no centralized operation involved: anyone can create a new
repository, extending the global repository. Programmers can create
private repositories for day to day work. Loosely cooperating teams
can create branches of one anothers' projects.