Monday, March 05, 2012

The Perils of Collation-Aware Comparisons

If you use psql's \l command to list all the databases in your PostgreSQL cluster, you'll notice a column in the output labelled "Collate"; on my system, that column has the value "en_US.UTF-8".  This means that when, for example, you sort strings, you'll use the "English" sort order rather than the traditional "C" sort order.  You might ask: what's the difference?

In the "C" sort order, all capital letters come before all lower-case letters, whereas in en_US.UTF8, a comes before A which comes before b which comes before B, and so on.  In other words, every collation can have its own rules for sorting strings, consistent with the way that the people who speak that language like to alphabetize things; or at least with the way that the people who wrote the locale definitions for that language think that they like to alphabetize things.  Collations are OS-dependent: some operating systems don't support them at all, while Windows has a completely different naming convention from every other operating system, and probably different behaviors as well.

Friday, March 02, 2012

The Git Workflow

When the PostgreSQL project decided to migrate to git, we decided not to allow merge commits.  A number of people made comments, in a number of different fora, to the effect that we weren't following "the git workflow".  While a few commentators seemed to think that was reasonable, many thought that it demonstrated our ignorance and stupidity, and some felt it outright heresy.

So I noted with some interest Julio Hamano's blog post about the forthcoming release of git 1.7.10, which is slated to include a change to the way that merge commits work: users will now be prompted to edit the commit message, rather than just accepting a default one.  Actually, what I found most interesting where Linus Torvalds' comments on this change, particularly where he says this: "This change hopefully makes people write merge messages to explain their merges, and maybe even decide not to merge at all when it's not necessary."  His comments are quoted more fully in the above-linked blog article; unfortunately I don't know how to link directly to his Google+ post.  And Julio Hamano makes this remark: "Merging updated upstream into your work-in-progress topic without having a good reason is generally considered a bad practice.  [...] Otherwise, your topic branch will stop being about any particular topic but just a garbage heap that absorbs commits from many sources, both from you to work on a specific goal, and also from the upstream that contains work by others made for random other unfocused purposes."

Thursday, March 01, 2012

Performance and Scalability on IBM POWER7

I recently had a chance to run some benchmarks of PostgreSQL 9.2devel on an IBM POWER7 machine provided by IBM and hosted by Oregon State University's Open Source Lab.  I'd like to run more tests, but for a first pass I decided to run a SELECT-only pgbench test at various client counts and scale factors.  This basically measures how quickly we can do primary key lookups in a completely read-only environment.  This generated a pretty graph.  Here it is.

Wednesday, February 08, 2012

My Patches Are Breeding

One of the great things about being a long-term contributor to an open source project like PostgreSQL is that you get to see other people take the stuff you've done and use it as a stepping stone to bigger and better things. One of my early PostgreSQL hacking projects was a patch to extend the syntax of EXPLAIN. Prior to 9.0, the grammar looked like this:

EXPLAIN [ ANALYZE ] [ VERBOSE ] statement

Now, there's nothing particularly wrong with that grammar from a usability perspective, but it turns out to be pretty terrible for extensibility.  Let's suppose we want to add a new EXPLAIN option that does something new and different - say, omit the costing information from the output. Then we have to change the grammar to something like this:

EXPLAIN [ ANALYZE ] [ VERBOSE ] [ NOCOSTS ] statement

There are a couple of problems with this. One is that, as the number of options increases, it gets hard to remember the order in which they must be specified.  You might think that it would be easy enough to recode the grammar to look like this:

EXPLAIN [ ANALYZE | VERBOSE | NOCOSTS ]... statement

...and it might be, but it's surprisingly easy, when using bison, to create situations that bison finds ambiguous, even though a human being might not. Another problem is that NOCOSTS has to become what's called a keyword, which has a very small but nonzero distributed cost across our entire grammar. Rightly or wrongly, a number of patches that proposed to enhance EXPLAIN in various ways got shot down because of these issues.

Thursday, January 12, 2012

Linux Memory Reporting

As much as I like Linux (and, really, I do: I ran Linux 0.99.something on my desktop in college, and wrote my class papers using vim and LaTeX), there are certain things about it that drive me crazy, and the way it reports memory usage is definitely on the list.  It should be possible for a reasonably intelligent human being (in which category I place myself) to answer simple questions about system memory usage, such as "How much memory is my database using?" or "How much memory is my web server using?" relatively simply.

Thursday, December 15, 2011

Write Scalability

Time flies when you're benchmarking.  I noticed today that it's been over a month since my last blog post, so it's past time for an update.  One of the great things about the PostgreSQL community is that it is full of smart people.  One of them is my colleague Pavan Deolasee, who came up with a great idea for reducing contention on one of PostgreSQL's most heavily-trafficked locks: ProcArrayLock.  Heikki Linnakangas (another really smart guy, who is also a colleague of mine) did some more work on the patch, and then I cleaned it up further and committed it.

Monday, November 14, 2011

Linux lseek scalability

I don't normally follow Linux kernel development, but I was pleased to hear (via Andres Freund) that the Linux kernel developers have committed a series of patches by Andi Kleen to reduce locking around the lseek() system call.  As I blogged about back in August, PostgreSQL calls lseek quite frequently (to determine the file length, not to actually move the file pointer), and due to the performance enhancements in 9.2devel, it's now much easier to hit the contention problems that can be caused by frequently acquiring and releasing the inode mutex.  But it looks like this should be fixed in Linux 3.2, which is now at rc1, and therefore on track to be released well before PostgreSQL 9.2.

Meanwhile, we're gearing up for CommitFest #3.  Interesting stuff in this CommitFest includes Álvaro Herrera's work on reducing foreign key lock strength and a PostgreSQL foreign data wrapper (pgsql_fdw) by Hanada Shigeru.  Reviewers are needed, for those and many other patches!