[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Arch as a replacement for CVS for OpenBSD?
On Thursday, March 7, Tom Lord wrote:
>
> Arch is indeed new: ready for "early adopters" not for people wanting
> a glitch-free product. That fact is well documented in numerous
> places and I have never assumed that the OpenBSD project as a whole
> should or would be an "early adopter" at this stage.
I was not advocating its use for OpenBSD. I have also been looking
around for a different SCM tool. Arch I did not know about, so
I've had a quick (cursory) look at it.
> I started to reply point-by-point to each question, but quickly came
> to the conclusion that it might be more helpful to to write a brief
> overview for programmers. That might make the existing documentation
> easier to navigate and understand. Does the enclosed overview (at the
> end of this message) help?
Well, it does point out some limitations within your implementation
that I will try to point out. Thank you for the concise writeup.
> What is that different approach? Where is the documentation
> that explains this philosophy [the difference between arch and
> CVS]?
>
> I don't mean to be too flippant, but the documentation you're asking
> for starts with P. J. Plauger's "Software Tools" and goes on from
> there. arch is built out of simple tools that each do one thing well
> and that are designed to be combined together. It combines those
> tools in useful ways to yield a system of rich functionality that
> requires very few lines of code to implement.
That is the unix way. However, there has to be an underlying
cohesion to the whole mess. For example, unix uses the fork/exec
model of giving birth to a new process. That is fundamentally
different than what VMS and/or DOS give you. Both environments
have small blocks, but due to the nature of them, they are combined
differently, and have different end- functionality. IE: posix
compliance under VMS is actually quite hard, and it took
DEC/Compaq/HP/whatever-next quite some time to get something at
least somewhat useable (and reasonably portable) out there.
> * Foundations
>
> The most central piece of (low level) functionality in arch is the
> three commands: `mkpatch', `dopatch', and `inventory'. It is
> easiest to understand the high level functionality of arch in terms
> of those three commands.
In other words, you have 2 tools to manage change-sets (of sorts).
> Conceptually, `mkpatch' and `dopatch' are very similar to `diff -r'
> and `patch'. The most important difference is that the arch
> versions handle:
>
> symbolic links
> file permissions
> renamed files and directories
> files which `diff' thinks are binary files
This answers one of my questions already. You do not have a general
attribute system (or multiple file-streams) for each file. For
example, being able to largely interoperate between Mac and Unix
systems may require you to be able to store other additional
information about each file/directory. VMS may require file
structure information.
> `inventory' is used to identify which files in a tree are
> significant, and to assign a logical identity to each file and
> directory. The logical identity remains the same even if a file
> is renamed and is the basis on which renames are detected.
I gather that this identity will exist for eternity once it is in
existance? If so, how do you manage these identities? Is there
a cost associated with looking them up or managing them? What
operations gives rise to a new identity? Can you rename a branch
of a file, and have the trunk still be the original name?
I assume that directories are lists of such identities? If so,
can you have the equivelant of hard links (one file show up in two
directory objects)?
> * Sequences of Revisions
>
> When you import a new tree to an arch repository, the essence of what
> happens is that the tree is stored in the repository as a compressed
> tar file. When you commit successive revisions of that tree, each
> revision is stored as a compressed tar file containing the patch set
> for that revision. (Also see below about "revision libraries".)
In other words, I need to decompress/detar each revision (possibly quite
large) in order to construct a "diff" between 1.1 and 1.112 (that would
possibly be 112 decomp/detar)? Also, what ordering do you impose on the
construction of these diff sets? Are they in SCCS or RCS order? Some
other ordering?
> * Atomic Operations
>
> Storing new revisions in an archive is an atomic operation. In
> normal operation, locks are held transiently, only while the write
> is being performed (similarly to CVS' "lock-less" operation).
> Concurrent reads and writes do not interfere with one another.
This is a nice property.
> * Revision Libraries
>
> In addition to a repository of patch sets, arch is typically
> configured to maintain a "revision library". A revision library is
> a collection of revisions stored as complete copies of the source
> tree, but with an important space optimization: unmodified files are
> shared among these trees using hard links.
These are uncompressed/detarred versions of particular revisions of the
patch sets? IE: they exist wholesale within the repo? If so, how do
you handle moved/renamed files? Where do they exist? In both places?
Hardlink? Also, are the patch sets based on any of these revision
libraries? If so, how do you optimize speed/space wrt the number of
patch sets you need to apply to recover any one revision?
This functionality seems dangerously close to having a lock-step version
of the repo checked out. What I mean, is that you in some sense have a
global counter, which counts the "step" the repo is at. Each operation
increments the counter.
> Many operations (such as checking out a new revision or computing
> diffs between arbitrary revisions) use the revision library as a
> performance optimization. In addition, programmers can use the
> library directly with their favorite tools to explore various
> revisions. (This is similar functionality to that offered by
> ClearCase, but it is implemented in a portable way.)
How is this optimization implemented. What instrumentation do you use
and what policy does the repo use to store/delete things.
> * Tags
>
> Any revision, instead of being a complete tar file of the entire
> tree or a simple patch set, can be a "tag". Conceptually, a tag is
> a symbolic link to some other revision. Tags are how branches are
> implemented (the baseline revision of a branch is a tag of the
> revision being branched from).
Does this mean tags are global to the repo? This concept is very fuzy
to me. How do I compare this to the cvs tag/branch concept?
> * Patch Logs
>
> Each "project tree" (or, in CVS terms, "working directory") contains
> meta-data that records what patch sets have been applied to that
> tree. For example, when you merge one branch into another, the
> merged-into branch gains patch logs for the changes from the
> merged-from branch.
This can be very usefull. On "commit" does this information get saved
within the repository? If so, how?
> Arch contains some higher-level merge operations (e.g. star-merge,
> replay) which use the patch logs to perform merges intelligently.
> (For an example of why merging is a non-trivial problem, requiring
> higher-level operations, see
> http://www.regexps.com/src/docs.d/arch/html/star-topology.html)
Interesting way of looking at the problem. I've not read the complete
description yet, so can not comment on this.
> Patch logs are also useful for data-mining about the history of a
> tree. For example, they can be used to produce ChangeLog files.
Patch logs? Is this the meta-data you talked about before, or something
else?
> * Name-spaces and Distributed Repositories
>
> Every repository has a globally unique name. The name is location
> independent: it remains the same for all mirrors of the repository
> and if the repository is migrated.
How are they located? DNS entries? Config file?
> Every revision in a repository has a unique name:
>
> CATEGORY--BRANCH--VERSION--PATCH-LEVEL
Can you explain CATEGORY and how you deal with VERSION/PATCH-LEVEL?
IE: if something is in the repo, why should it have a PATCH-LEVEL?
> Tags use fully qualified names: thus you can form a branch from
> one repository to another. Patch logs use fully qualified names,
> thus the history of each project tree includes a record of all patch
> sets merged into that tree, from any repository.
Nice.
> The effect of this is that, as far as arch is concerned, there is
> just one global repository, stored in many distributed parts. There
> is no centralized operation involved: anyone can create a new
> repository, extending the global repository. Programmers can create
> private repositories for day to day work. Loosely cooperating teams
> can create branches of one anothers' projects.
How do you deal with people that wish to have their own repo? IE: they
explicitly do not with to be part of a branch or any other connection
with the official repo? How do you deal with patches from such places?
Do you have support for that, or is it handled much like cvs, except that
you maintian your own "vendor" branch?
--Toby.