Benefits from a real world switch from CVS to darcs

Spiral Vine at Clifty Falls State Park
Merging branches?
I recently switched the source control management for a project at work from CVS to darcs. For others who may be considering such a switch, I'd like to explain those areas that I feel like have improved, and those where darcs is currently weak.

First, I'll give some details about the project scope, so you can better compare it to your own usage model. My team consists of four darcs users. We use a mix of Mac OS X and Linux on the desktop, although we primarily interact with darcs through shell accounts on FreeBSD servers.

The project itself is a website, including about 40,000 lines of Perl code, with over 3,000 files stored in over 600 directories. It uses about 150 Megs of disk space per copy, which includes some video files as big as 10 Megs each. At the time we switched, we had over 2,000 commits in the CVS tree, spanning about a years worth of work.

Background: Our usage model and reasons for switching

Each developer had their own code copy of the repository. We used an "alphasite" for internal quality reviews, a "betasite" for client work reviews, and a production copy for the live website.

We weren't trying to maintain branches in CVS. I was aware they are one of the hassles of using CVS. If we wanted to share code with each other, or update the alphasite, the betasite or or the production site, we would commit to CVS HEAD and update the desired location.

This worked OK, with some coordination among ourselves about when it was a good time to commit or update and when it wasn't.

The problem was the exceptional cases: While some work was being previewed on the alphasite or betasite, a high-priority request would come in that should be fast-tracked to production, ahead of the work that was already committed.

We handled that with a little bit of luck and a fair amount of headache. If we were lucky, the fast-track change affected different files, and only those would be updated on the live site. If we weren't so lucky, some one-off solution would be devised for the particular case, until the project could reach equilibrium again.

I felt there should be a better way.

Looking Past CVS: Comparing the Alternatives

So I got the itch to look at CVS alternatives. I read a fair amount about alternatives, and tried Arch on some personal projects before arriving at darcs. You can read more about why I think distributed source control is the way to go.

I preferred Arch to CVS and found it usable. I simply found that darcs was simpler and more pleasant to use, with a feature set I found sufficient for my needs.

How darcs has made my life easier

basic work flow design and our new smoke bot

I took a chance and designed a new work flow around darcs' decentralized nature, taking advantage of its metaphor that every repository is a working directory.

First, I eliminated the abstract "central repository". Now our "alphasite" is the place the developers push their changes.

This on its own has multiple advantages. First, the alphasite is constantly updated, and doesn't need to be updated specially for an alphasite review. When these reviews do happen, which are relatively brief and infrequent compared to development time, we can simply do not push then, keeping the alphasite stable.

This design also helped with a problem that's perhaps somewhat unique to website projects. To run our automated test suite, we need a mod_perl server dedicated to this code line, and many Perl modules installed in the system libraries.

Since we already had to have a mod_perl setup for the alphasite, there was little to do to create a "smoke bot", a script that frequently tests the code via cron and e-mails the developers if there any problems. In fact, our current solution is a single line cron script entry which restarts the Apache server and runs the test suite.

With CVS this would be more complex, as a new check-out of the code would need to be made. In our case, the extra effort was enough that the smoke bot system never got setup to run under CVS.

A better launch process

I made one other tweak to the repository flow. Although the betasite pulls its changes from the alphasite, the production copy does not. Instead, it pulls from the betasite by default.

Although we could subvert this flow, by design our agreed quality control flow is now built-in into the system. With CVS, it was easy for me to make a change to my personal copy, and then switch to the production copy to do a cvs update to get that change.

Now there is at least one extra step: I have to pull the change to the betasite before pulling it to the production site. This helps to prevent subversion of the quality control mechanisms simply because it's easy to do so.

Better personal change management

Admittedly, we are all still adjusting to the fact that a darcs record does not share our changes like cvs commit does.

We are also noticing the benefits of it. Each us may be personally working on a task of some complexity, such as "optimize site", which may involve several individual optimization tasks. With darcs, a developer can record several individual optimization changes, but only push them once the whole task is complete.

This is another way in which we avoid changes that are not ready for launch or review from getting in the way of something that is. This illustrates that each copy is indeed like it's own branch, without the overhead of learning extra commands to deal with branches.

Easy cherry picking


Branching.
We use RT as our issue tracking system. So we may track a particular programming task as RT#123.

We use darcs' ability to take action based on the patch name to create another branch-like feature based our issue numbers.

The "fast track" change request is a great example of the benefits of this. Let's say my personal repo, the alphasite and betasite all have various changes that are not ready to launch, and each is in a different state.

A request comes in as RT#654 that should be launched ASAP, ahead of other work. I complete the work with three records, including "RT#654" as prefix to each record message. From there, it's easy to let these updates flow by the others towards production. I do:

darcs push -p 'RT#654'

And just those patches will flow to the alphasite. Then on the betasite and then production, I simply do:

darcs pull -p 'RT#654'

The change launches with a minimum of fuss, and we all leave early to play mini-golf. (That's actually happened at my office...)

As the launch-master, this feature alone is worth the switch.

Easier developer collaboration

On occasion, I'll want to help another developer on a task before his work is ready to send our central repository. With darcs, I could pull a changeset from one of my peer's personal repos. With CVS, we may have resorted to committing something broken, just so another developer could have a copy to work on, or resort to manually shuffling files around in our directories. Darcs is a cleaner solution.

Better exception handling

Although we try to eliminate special cases for each place we checkout code, we have at least one exception on the production site: There is an e-mail handling script that we haven't yet figured out how to make work with relative paths, so the complete paths of the production environment are hardcoded in it.

In CVS, the files involved were left un-committed, thus CVS reported them with a "modified" status, which is usually something that shouldn't happen on the production site. This was messy, because it relied on remembering about uncommitted changes.

Darcs has a cleaner solution. Until the production-specific cases are eliminated, I can record them as darcs changesets which only exist on the production site.

Then instead of wondering why there are some modified files in production, I see (using darcs pull --dry-run --verbose) that there are a few changes which only exist in production and darcs shows me the human-friendly patch names to remind me why the heck that is.

Better code review

By default, darcs works with changes at a more detailed level than CVS does. It not only tells that there are changes in particular files, it shows you each change and interactively asks you to confirm each patch "hunk" you are about to record.

If "show not tell" is a recipe for good storytelling, it's also a good recipe for a source control system interface.

Seeing the changes in detail encourages better habits. For example, I may see a forgotten about change and record it as second change or not at all. I may see some extra debugging statements which I left in my personal copy, but don't want to include in the changeset.

Soon I will have recorded all my changes, except a few debugging statements which I now want to remove. Instead of searching through the code for spurious output to STDERR, I can just run darcs revert. Like record, I will be prompted to review each pending change, but this time to remove them. Cleaning up my code just got a little faster.

Better infrastructure: Built-in XML support

Besides the improvements to my user-level experience, I also appreciate some of the deeper design decisions in darcs. Here's an anecdote which illustrates the benefit of darcs built-in support for XML.

Pedro Melo wrote a Darcs-Changes-to-RSS Perl script in 82 lines and less than one hour. Alexander Staubo noted that with the right XML stylesheet, this function can be done in a single line:


darcs changes --xml | xsltproc rss.xsl -

That's notable because to accomplish the same function for CVS, the cvs2rss Perl script takes over 3 times as much code, and in turn relies on the non-standard cvs2cl.pl script, adding almost another 1,200 lines of complexity to the system.

Darcs was designed to play well with others, and it shows in its simplicity of integration.

Performance: the good and the bad.

When darcs performs better

In my common usage, there is one case where darcs is noticeably faster than CVS: Tagging the repository. With CVS it takes 5 minutes or more to tag a release, as I wait for CVS to update records for the thousands of files in the project.

With darcs, tagging is instant because it's unrelated to the size of the repository.

When darcs performs worse and what I can do about it.

There are some cases when darcs performs noticeably slower. In the most common cases with the most common commands (record, push and pull), performance is fine.

One that is much slower than CVS right now is using darcs diff to see what's changed in a file since the last version.

This is more of annoyance than a real problem. With CVS the equivalent command was nearly instant, with darcs it can take more like 10 seconds -- enough to notice and become impatient. This issue may have been adequately solved for me today through the release of darcs.vim, a Vim text editor plugin which provides darcs integration. It takes a known shortcut which makes the speed instant, not to mention presenting me a nicely formatted diff in my editor! (For those in the Emacs camp, I believe there is an Emacs mode which uses the same technique )

The real problem is that sometimes when there are conflicts, darcs can go into an exponential computation, hogging the processor for hours to compute the correct result. I should add that this is on top of the author's priority list and is currently being worked on over the 2004-2005 winter break. The problem cuts to the core of darcs' design. Since there aren't any other systems designed quite like this, no one really knows how much the performance can be improved, although the author is optimistic. (Or perhaps that should be 'optimizationistic'?)

Since our switch, this hasn't been a problem for us that I recall, although I have seen it happen a number of times in other places. I expect it to occur at some point.

However as a small isolated team, I think we could workaround this fairly quickly. With the design of our patch flows, we should notice the potential conflict before we are pulling a patch into production. After isolating the conflicting patches, we can eliminate one from our system, and create another one that means that same thing, but conflicts a minimal amount.

I still see these performance issues as a notable drawback, but considering all the benefits I presented above, I think they are worthwhile to put up with, especially if they will be addressed soon.

Final words

I have no illusions that darcs has reached the maturity that CVS has. CVS is stable and consistent, even if it's not ideal.

Darcs still has some performance problems to work through and some other rough edges. Yet, darcs has helped me at every level of source control management, from designing a better flow between repositories, to helping me work productively with individual changes in file.

So, I feel like darcs has contributed enough to my productivity and source code management to merit the switch of an important real world project.

I encourage others to evaluate it as well.

Recent Entries

Close