Sunday, September 12, 2010

So, Why Isn't PostgreSQL Using Git Yet?

Just over a month ago, I wrote a blog posting entitled Git is Coming to PostgreSQL, in which I stated that we planned to move to git sometime in the next several weeks.  But a funny thing happened on the way to the conversion.  After we had frozen the CVS repository and while Magnus Hagander was in the process of performing the migration, using cvs2git, I happened to notice - just by coincidence - that the conversion had big problems.  cvs2git had interpreted some of the cases where we'd back-patched commits from newer branches into older branches as merges, and generated merge commits.  This made the history look really weird: the merge commits pulled in the entire history of the branch behind them, with the result that newer commits appeared in the commit logs of older branches, even we didn't commit them there and the changes were not present there.

Fortunately, Max Bowsher and Michael Haggerty of the cvs2git project were able to jump in and help us out, first by advising us not to panic, and secondly by making it possible to run cvs2git in a way that doesn't generate merge commits.  Once this was done, Magnus reran the conversion.  The results looked a lot better, but there were still a few things we weren't quite happy with.  There were a number of "manufactured commits" in the history, for a variety of reasons.  Some of these were the result of spurious revisions in the CVS history of generated files that were removed from CVS many years ago; Max Bowsher figured out how to fix this for us.  Others represented cases where a file was deleted from the trunk and then later re-added to a back branch.  But because we are running a very old version of CVS (shame on us!), not enough information was recorded in the RCS files that make up the CVS repository to reconstruct the commit history correctly.  Tom Lane, again with help from the cvs2git folks, has figured out how to fix this.  We also end up with a few spurious branches (which are easily deleted), and there are some other manufactured commits that Tom is still investigating.

In spite of the difficulties, I'm feeling optimistic again.  We seem to have gotten past the worst of the issues, and seem to be making progress on the ones that remain.  It seems likely that we may decide to postpone the migration until after the upcoming CommitFest is over (get your patches in by September 14!) so it may be a bit longer before we get this done - but we're making headway.


  1. Why do you feel that you have to move the whole history to git?

    My experience of several (commercial, closed-source) teams moving respectable sized codebases from one repo technology to another is that so long as you don't delete nor decommission the old repo you're fine.

    Bring the recent history of the active branches across and let anyone who wants deep history go to the old repo.

  2. We've had the same question asked at NetBSD (we're still on CVS), by both git people and subversion people.

    History is valuable information. FSF projects often put changelog text into the codebase itself, but BSD projects usually dislike this. That could affect how often people want to go to the VCS history.

    Putting in any barriers to convenience that make it harder to look at the history is detrimental to development, in the same way that making documentation difficult to access is bad.

    I think in a closed-source environment, people are often focused on 'the next release' - and history matters less. In long term projects like NetBSD and PostgreSQL history is important for understanding the intent behind other developers' changes, and for maintaining consistency with past behavior.

  3. From a blog post elsewhere on Git by an X developer: "It's come to the point where the most annoying X server bugs are the ones where the git history stops at the original import from XFree86." That's why to get the whole history.

    The case is a little bit weird, in that I suspect they didn't have the cooperation of in migrating it all in, and that's made ongoing development more difficult.

  4. There were a significant number of problems in the "recent" history, so even if we were willing to drop ancient history, we'd still have work to do to have an acceptable git conversion.

  5. keithb is right. Keep the old repo as is and copy the head of the new repo into git. It's a no-brainer.

  6. The Drupal project is also moving from CVS to git. Maybe you should have a look at what they did.

  7. I did the CVS to git migration to OpenAFS.

    One of our key requirements was to preserve revision history. We already have a year zero (we have no history before IBM open sourced the code) and were determined to avoid another. When it comes to tracing bugs and determining why things are the way they are in such a large, long lived, project, history is vital in our experience.

    In terms of your migration, good luck. I ended up taking some existing tools (cvsps, particularly), modifying them to suite the way we'd used CVS, and writing a perl constraint solver to sort out the final commit ordering. It wasn't much fun.

  8. It is important to have the complete history in Git. If the CVS version should be fixed up first to make this cleaner, so be it. The CVS version should also be kept around for awhile as a backup, but a redundant one.