[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Arch as a replacement for CVS for OpenBSD?



	Judging by the amount of "marketing speak", [arch] is not ready
	for prime-time.

Arch is indeed new: ready for "early adopters" not for people wanting
a glitch-free product.  That fact is well documented in numerous
places and I have never assumed that the OpenBSD project as a whole
should or would be an "early adopter" at this stage.

	[various complaints about the documentation]

I started to reply point-by-point to each question, but quickly came
to the conclusion that it might be more helpful to to write a brief
overview for programmers.  That might make the existing documentation
easier to navigate and understand.  Does the enclosed overview (at the
end of this message) help?

Here are answers to a few of the more interesting specific questions:

	What are the costs associated with doing large-scale diffs (300MB of
	source)?

In the current release, a few of us have measured 5-10x the cost of
"diff -r", depending on what's cached.  We have a prototype for a new
version of `mkpatch' that benchmarks at 1-2x the cost of "diff -r".


	What are the costs associated with doing large-scale "cvs annotate"
	style operations?

See the note below about "revision libraries".  You can do this sort
of operation very quickly.  The built-in commands which perform such
operations are, admittedly, minimal, but revision libraries let you
get the same information more flexibly using ordinary shell tools.


	What is that different approach?  Where is the documentation
	that explains this philosophy [the difference between arch and
	CVS]?

I don't mean to be too flippant, but the documentation you're asking
for starts with P. J. Plauger's "Software Tools" and goes on from
there.  arch is built out of simple tools that each do one thing well
and that are designed to be combined together.  It combines those
tools in useful ways to yield a system of rich functionality that
requires very few lines of code to implement.

-t


		  A Hacker-Oriented Overview of Arch

* Foundations

  The most central piece of (low level) functionality in arch is the
  three commands: `mkpatch', `dopatch', and `inventory'.  It is
  easiest to understand the high level functionality of arch in terms
  of those three commands.

  Conceptually, `mkpatch' and `dopatch' are very similar to `diff -r'
  and `patch'.  The most important difference is that the arch
  versions handle:

	symbolic links
	file permissions
	renamed files and directories
	files which `diff' thinks are binary files

  `inventory' is used to identify which files in a tree are
  significant, and to assign a logical identity to each file and
  directory.  The logical identity remains the same even if a file
  is renamed and is the basis on which renames are detected.

* Sequences of Revisions

  When you import a new tree to an arch repository, the essence of what
  happens is that the tree is stored in the repository as a compressed
  tar file.  When you commit successive revisions of that tree, each
  revision is stored as a compressed tar file containing the patch set
  for that revision.  (Also see below about "revision libraries".)

* Atomic Operations

  Storing new revisions in an archive is an atomic operation.  In
  normal operation, locks are held transiently, only while the write
  is being performed (similarly to CVS' "lock-less" operation).
  Concurrent reads and writes do not interfere with one another.

* Revision Libraries

  In addition to a repository of patch sets, arch is typically
  configured to maintain a "revision library".  A revision library is
  a collection of revisions stored as complete copies of the source
  tree, but with an important space optimization: unmodified files are
  shared among these trees using hard links.

  Many operations (such as checking out a new revision or computing
  diffs between arbitrary revisions) use the revision library as a
  performance optimization.  In addition, programmers can use the
  library directly with their favorite tools to explore various
  revisions.  (This is similar functionality to that offered by
  ClearCase, but it is implemented in a portable way.)

* Tags

  Any revision, instead of being a complete tar file of the entire
  tree or a simple patch set, can be a "tag".  Conceptually, a tag is
  a symbolic link to some other revision.  Tags are how branches are 
  implemented (the baseline revision of a branch is a tag of the
  revision being branched from).

* Patch Logs

  Each "project tree" (or, in CVS terms, "working directory") contains
  meta-data that records what patch sets have been applied to that
  tree.  For example, when you merge one branch into another, the
  merged-into branch gains patch logs for the changes from the
  merged-from branch.

  Arch contains some higher-level merge operations (e.g. star-merge,
  replay) which use the patch logs to perform merges intelligently.
  (For an example of why merging is a non-trivial problem, requiring
  higher-level operations, see
  http://www.regexps.com/src/docs.d/arch/html/star-topology.html) 

  Patch logs are also useful for data-mining about the history of a
  tree.  For example, they can be used to produce ChangeLog files.


* Name-spaces and Distributed Repositories

  Every repository has a globally unique name.  The name is location
  independent: it remains the same for all mirrors of the repository
  and if the repository is migrated.

  Every revision in a repository has a unique name:

	CATEGORY--BRANCH--VERSION--PATCH-LEVEL

  Putting those two together, every revision has a globally unique
  "fully qualified name" of the form:

	REPOSITORY/CATEGORY--BRANCH--VERSION--PATCH-LEVEL

  Tags use fully qualified names: thus you can form a branch from
  one repository to another.  Patch logs use fully qualified names,
  thus the history of each project tree includes a record of all patch
  sets merged into that tree, from any repository.

  The effect of this is that, as far as arch is concerned, there is
  just one global repository, stored in many distributed parts.  There
  is no centralized operation involved: anyone can create a new
  repository, extending the global repository.  Programmers can create 
  private repositories for day to day work.  Loosely cooperating teams 
  can create branches of one anothers' projects.