Tuesday, September 28, 2010

Stupid Git Tricks for PostgreSQL

Even before PostgreSQL switched to git, we had a git mirror of our old CVS repository.  So I suppose I could have hacked up these scripts any time.  But I didn't get around to it until we really did the switch.  Here's the first one.  It's a one-liner.  For some definition of "one line".

git log --format='%H' --shortstat `git merge-base REL9_0_STABLE master`..master | perl -ne 'chomp; if (/^[0-9a-f]/) { print $_, " "; } elsif (/files changed/) { s/^\s+//; my @a = split /\s+/; print $a[3] + $a[5], "\n" }' | sort -k2 -n -r | head | cut -d' ' -f1 | while read commit; do git log --shortstat -n 1 $commit | cat; echo ""; done

This will show you the ten "biggest" commits since the REL9_0_STABLE branch was created, according to number of lines of code touched.  Of course, this isn't a great proxy for significance, as the output shows.  Heavily abbreviated, largest first:

66424a284879b Fix indentation of verbatim block elements (Peter Eisentraut)
9f2e211386931 Remove cvs keywords from all files (Magnus Hagander)
4d355a8336e0f Add a SECURITY LABEL command (Robert Haas)
c10575ff005c3 Rewrite comment code for better modular
ity, and add necessary locking (Robert Haas)
53e757689ce94 Make NestLoop plan nodes pass outer-relation variables into their inner relation using the general PARAM_EXEC executor parameter mechanism, rather than the ad-hoc kluge of passing the outer tuple down through ExecReScan (Tom Lane)
5194b9d04988a Spell and markup checking (Peter Eisentraut)
005e427a22e3b Make an editorial pass over the 9.0 release notes. (Tom Lane)
3186560f46b50 Replace doc references to install-win32 with install-windows (Robert Haas)
debcec7dc31a9 Include the backend ID in the relpath of temporary relations (Robert Haas)
2746e5f21d4dc Introduce latches. A latch is a boolean variable, with the capability to wait until it is set (Heikki Linnakangas)

Of course, some of these are not-very-interesting commits that happen to touch a lot of lines of code, but a number of them represented significant refactoring work that can be expected to lead to good things down the line.  In particular, latches are intended to reduce replication latency and eventually facilitate synchronous replication; and Tom's PARAM_EXEC refactoring is one step towards support for the SQL construct LATERAL().

OK, one more.

#!/bin/sh

BP=`git merge-base master REL9_0_STABLE`

git log --format='format:%an' $BP..master | sort -u |
while read author; do
    echo "$author: \c"
    git log --author="$author" --numstat $BP..master |
    awk '/^[0-9]/ { P += $1; M += $2 }
         /^commit/ { C++ }
         END { print C " commits, " P " additions, " M " deletions, " (P+M) " total"}'
done

This one shows you the total number of lines of code committed to 9.1devel, summed up by committer.  It has the same problem as the previous script, which is that it sometimes you change a lot of lines of code without actually doing anything terribly important.  It has a further problem, too: it only takes into account the committer, rather other important roles, including reporter, authors, and reviewers.  Unfortunately, that information can't easily be extracted from the commit logs in a structured way.  I would like to see us address that defect in the future, but we're going to need something more clever than git's Author field.  Most non-trivial patches, in the form in which they are eventually committed, are the work of more than one person; and, at least IMO, crediting only the main author (if there even is one) would be misleading and unfair in many cases.

I think the most interesting tidbit I learned from playing around with this stuff is that git merge-base can be used to find the branch point for a release.  That's definitely handy.

Friday, September 24, 2010

Enjoying Git

OK, I admit it.  This is awesome.  I'm still getting used to committing to PostgreSQL with git rather than CVS, but it's sort of like the feeling of being let out of the dungeon.  Wow, sunlight, what am I supposed to do about that?

Actually, I've never really been into CVS bashing; it's an OK system for what it does.  And compare to RCS, which I actually used once or twice a long time ago, it's positively phenomenal.  But git, despite its imperfections, is just a lot better.

There are two major things that caused problems for me when committing to CVS.  First, it was painfully slow.  Second, since I was doing all of my development work on git, that meant extracting the patch, applying it to CVS, making sure to CVS add/rm any new/deleted files, retyping (or copying) the commit message, and double-checking that I hadn't messed anything up while moving the patch around.

$ git commit
$ git show
$ git push

Nice!  I feel like someone gave me an easy button.

Saturday, September 18, 2010

Git Conversion, Take Two

The PostgreSQL project will be making its second attempt to migrate from CVS to git this coming Monday.  In a previous blog post, I talked about some of the difficulties we've had getting a clean conversion of our project history to git.  I was surprised that a number of people suggested throwing out our development history and just moving the head of each branch to git; and I agree with some of the later comments that this would be a bad idea.  I refer back to our development history fairly frequently, for a variety of reasons: to determine when particular features were introduced, to determine what patch last touched a particular area of the code, to see how old a particular bit of code is, and sometimes even to model a new patch on a previous patch that added a similar feature.  So I'd find it very painful to lose convenient access to all of that history.  Even a somewhat messed-up conversion would be better than no conversion at all.

Fortunately, it looks like we're going to end up with a pretty darn good conversion.  Tom Lane spent most of last weekend cleaning up most of the remaining infelicities.  The newest conversions are a huge improvement over both our current, incrementally-updated conversion (which is what I use for day to day development) as well as earlier attempts at a final conversion.  Only a handful of minor artifacts remain, mostly because of wacky things that were done in CVS many years ago.  Our use of CVS in recent years has been quite disciplined, which is why such a clean conversion is possible.

Tuesday, September 14, 2010

SE-Linux For PostgreSQL: Part 2

In part 1 of this blog post, I talked about my recent visit to BWPUG to discuss SE-Linux and PostgreSQL, and reviewed the work that has been done so far, as well as a few upcoming patches.  In this second part of the post, I'm going to review what I learned at the meeting and where the SE-Linux folks told me they'd like to see us go with this.



One of the most interesting concepts we discussed was the idea of a type transition.  This may be old hat for experienced SE-Linux users, but it was a new idea for me.  I'm not sure that I understand this concept in its full generality, but Joshua Brindle explained two specific applications of it to me.  First, when objects are created, SE-Linux apparently allows the context of that object to depend not only on the context of the creator, but also on where the object was created.  For example, if Apache is running in a context called apache_t and creates a temporary file in /tmp, the context of the new file might be apache_tmp_t.  Similarly, if a PostgreSQL client creates a table, the SE-Linux folks would like to be able to determine the security label for the table based on a combination of the client's context and the containing schema's label.

The second application of type transitions which we discussed in the meeting related to what KaiGai Kohei has been calling a trusted procedure.  The idea here seems to be that when a certain function is executed, SE-Linux should have the option based on the user's context and the function's context to transition to a new security context for just the period of time during which the function is executing.  This doesn't involve a kernel call: it's just internal recordkeeping.  I'm imaging that SE-Linux support for PostgreSQL will be provided by a loadable module, so essentially we'd need a bullet-proof way of allowing the SE-Linux module to gain control briefly at function entry and exit time (and that would be certain to be called even if, say, we exit the function due to an error condition).

We also talked about SE-Linux control of operations other than DML, which is what the ExecCheckRTPerms hook I talked about in part 1 of this posting will support.  In particular, Joshua Brindle and David Quigley were very concerned about proper control over the LOAD statement.  It looks like this can be easily accomplished using the existing ProcessUtility_hook.  They were also concerned about DDL, but again it seems like the existing ProcessUtility_hook would be sufficient to impose coarse-grained restrictions.  Ultimately, that may not be the best way to go forward, as it may not provide easy access to all the bits they care about - in particular, I think we will need one or more special-purpose hooks in ALTER TABLE - but it may be enough to do something crude.

Another very large hole that will need to be plugged is control over large objects.  These will need security labels and appropriate access checks.

Finally, the SE-Linux folks indicated that in the long run they would really like to have row-level access control, but they believe that they can accomplish useful things with an implementation which does not include that capability, as long as they have the "trusted procedure" facility discussed above.


I'm not sure how far we're going to get with this work during the PostgreSQL 9.1 time frame.  KaiGai Kohei has poured a tremendous amount of time into this work over the last several years, but progress has been slow.  I think one of the big reasons for that is that doing this work in a way that is acceptable to the PostgreSQL community can sometimes require significant refactoring of the existing code.  It's not always obvious how to accomplish that, and many of the people who are in the best position to carry it off successfully can't put a lot of time into it unless there is funding attached.  So far, no one stepped forward in this area; if that changes, I expect to see much more rapid progress.

Sunday, September 12, 2010

So, Why Isn't PostgreSQL Using Git Yet?

Just over a month ago, I wrote a blog posting entitled Git is Coming to PostgreSQL, in which I stated that we planned to move to git sometime in the next several weeks.  But a funny thing happened on the way to the conversion.  After we had frozen the CVS repository and while Magnus Hagander was in the process of performing the migration, using cvs2git, I happened to notice - just by coincidence - that the conversion had big problems.  cvs2git had interpreted some of the cases where we'd back-patched commits from newer branches into older branches as merges, and generated merge commits.  This made the history look really weird: the merge commits pulled in the entire history of the branch behind them, with the result that newer commits appeared in the commit logs of older branches, even we didn't commit them there and the changes were not present there.

Fortunately, Max Bowsher and Michael Haggerty of the cvs2git project were able to jump in and help us out, first by advising us not to panic, and secondly by making it possible to run cvs2git in a way that doesn't generate merge commits.  Once this was done, Magnus reran the conversion.  The results looked a lot better, but there were still a few things we weren't quite happy with.  There were a number of "manufactured commits" in the history, for a variety of reasons.  Some of these were the result of spurious revisions in the CVS history of generated files that were removed from CVS many years ago; Max Bowsher figured out how to fix this for us.  Others represented cases where a file was deleted from the trunk and then later re-added to a back branch.  But because we are running a very old version of CVS (shame on us!), not enough information was recorded in the RCS files that make up the CVS repository to reconstruct the commit history correctly.  Tom Lane, again with help from the cvs2git folks, has figured out how to fix this.  We also end up with a few spurious branches (which are easily deleted), and there are some other manufactured commits that Tom is still investigating.

In spite of the difficulties, I'm feeling optimistic again.  We seem to have gotten past the worst of the issues, and seem to be making progress on the ones that remain.  It seems likely that we may decide to postpone the migration until after the upcoming CommitFest is over (get your patches in by September 14!) so it may be a bit longer before we get this done - but we're making headway.

Friday, September 10, 2010

SE-Linux For PostgreSQL: Part 1

I made the trip down to OmniTI headquarters just south of Baltimore, MD this Wednesday for BWPUG. This month's topic was the ongoing project to integrate SE-Linux with PostgreSQL. Besides myself, Stephen Frost, Greg Smith, Robert Treat were all there from the PostgreSQL community, along with David Quigley and Joshua Brindle from the SE-Linux community. It was a very productive meeting and I learned a lot about what the SE-Linux community is looking for from PostgreSQL.

We first discussed the current status of the project. Following discussions with Stephen Frost, KaiGai Kohei, and Greg Smith at PGCon 2010, I wrote and committed four patches which have, I think, helped to clear the way for an eventual loadable module implementing basic SE-Linux support for PostgreSQL; and I also committed a fifth patch by KaiGai Kohei. These were, in order of commit:

1. Add a hook in ExecCheckRTPerms(). This is just a very simple hook to allow a loadable module to gain control at the time DML relation permissions are checked. Whenever a SELECT, INSERT, UPDATE, or DELETE statement is invoked, this function gets a listed of the relation OIDs and can choose to allow the statement to proceed or throw an error. (It could also potentially do other things, like write to a log file, if that were useful for some reason.)

2. Centralize DML permissions-checking logic. KaiGai Kohei spotted the fact that the previous patch didn't actually work for a couple of important cases. In particular, COPY did not previously go through ExecCheckRTPerms(), and there is some hairy code inside the foreign key stuff that also needed adjustment to work properly with this hook. This patch, by KaiGai Kohei, cleaned all of that up. So as far as I know, we now have a single point for all DML permissions checking, and a hook function at that point. Yeah!

Unfortunately, in order to do label-based security, a simple hook function is not enough. You also need a place to store the labels, and ideally that place should be a PostgreSQL system catalog. I had initially thought that we would add a security label column to the system catalog for each object type, but that would require fairly invasive changes across the whole system and carry some minor performance penalty even for people who did not use it. At PGCon, we came up with the idea of storing all security labels for all objects in a separate catalog. Security labels are, in essence, just strings, which we don't try to interpret but which have some meaning (the details of which we need not understand) to an external security provider such as SE-Linux.

As luck would have it, we already have a model for such a facility: the COMMENT statement already knows how to store arbitrary strings which it does not attempt to interpret for arbitrary database objects, using a catalog (actually two catalogs) dedicated to that purpose. Unfortunately, the comment code is quite large, and, as it turned out, buggy, so it didn't seem like a good idea to copy-and-paste it into a new file and then hack it up from there, as I had initially hoped to do. So that led to three more patches.

3. Standardize get_whatever_oid functions for object types with unqualified names. As it turns out, one of the things that the comment code needed to do over and over again was examine the parse tree representation of an object and convert it to an OID by looking up the name in a system catalog. But there wasn't any standard way to do this, and in some cases the code was quite lengthy and already duplicated in multiple places throughout our source base. This patch cleaned that up, by introducing a standard API and adjusting the existing OID-getter functions, or adding new ones, for tablespaces, databases, roles, schemas, languages, and access methods, to conform to that API.

4. Standardize get_whatever_oid functions for other object types. More of the same, this time for text search parsers, dictionaries, templates, and configs; as well as for conversions, constraints, operator classes, operator families, rules, triggers, and casts.

5. Rewrite comment code for better modularity, and add necessary locking. This patch took the refactoring in the previous two patches one step further. The functions in the previous patches provide a way to translate a named object of a different type to an OID. This patch creates a more general API that can be passed an object type and a parse tree and return an ObjectAddress, which is an internal representation that can point to a database object of any type. The ObjectAddress representation is used for management of dependencies between database objects (e.g. you can't drop a table if there's a view using it, unless you also drop the view) as well as by the comment code, and they will be useful for security label support as well.

This new facility also fixes a longstanding locking bug in the COMMENT code, which still exists (and likely won't be fixed) in 9.0 and all prior releases. An object that is dropped concurrently with a COMMENT operation on that same object could lead to an orphaned comment in the pg_description or pg_shdescription catalog. If another object of the same type is subsequently assigned the same OID, it will inherit the orphaned comment. This is fairly unlikely and, for comments, fairly innocuous, but it would obviously create a potential security hole for security labels.

With these preliminary patches in place, I think we're now well-positioned to introduce the major piece of functionality which we will need to support SE-Linux integration: an in-core security label facility for use by SE-Linux and perhaps other label-based security systems. Stephen Frost, KaiGai Kohei, and I have had extensive discussions about the design of this facility and there are currently two pending patches by KaiGai Kohei which are intended to implement that design: one adds the basic security facility and commands for manually applying labels, and the other adds hooks at table creation time to allow enhanced security providers to automatically set a label on newly created tables. I have not yet reviewed these patches in detail, but I hope to see them committed - likely with some modifications - within the next month.

In the second part of this blog post, I'll go over what I learned from David and Joshua (who were extremely helpful in explaining SE-Linux to me), the additional facilities which they felt would be necessary for a minimally useful SE-Linux integration, and what they'd like to see over the longer term.