Time flies when you're benchmarking. I noticed today that it's been over a month since my last blog post, so it's past time for an update. One of the great things about the PostgreSQL community is that it is full of smart people. One of them is my colleague Pavan Deolasee, who came up with a great idea for reducing contention on one of PostgreSQL's most heavily-trafficked locks: ProcArrayLock. Heikki Linnakangas (another really smart guy, who is also a colleague of mine) did some more work on the patch, and then I cleaned it up further and committed it.
Thursday, December 15, 2011
Monday, November 14, 2011
Linux lseek scalability
I don't normally follow Linux kernel development, but I was pleased to hear (via Andres Freund) that the Linux kernel developers have committed a series of patches by Andi Kleen to reduce locking around the lseek() system call. As I blogged about back in August, PostgreSQL calls lseek quite frequently (to determine the file length, not to actually move the file pointer), and due to the performance enhancements in 9.2devel, it's now much easier to hit the contention problems that can be caused by frequently acquiring and releasing the inode mutex. But it looks like this should be fixed in Linux 3.2, which is now at rc1, and therefore on track to be released well before PostgreSQL 9.2.
Meanwhile, we're gearing up for CommitFest #3. Interesting stuff in this CommitFest includes Álvaro Herrera's work on reducing foreign key lock strength and a PostgreSQL foreign data wrapper (pgsql_fdw) by Hanada Shigeru. Reviewers are needed, for those and many other patches!
Meanwhile, we're gearing up for CommitFest #3. Interesting stuff in this CommitFest includes Álvaro Herrera's work on reducing foreign key lock strength and a PostgreSQL foreign data wrapper (pgsql_fdw) by Hanada Shigeru. Reviewers are needed, for those and many other patches!
Thursday, November 10, 2011
Unsticking VACUUM
Every PostgreSQL release adds new features, but sometimes the key to a release has less to do with what you add than with what you take away. PostgreSQL 8.4, for example, removed the settings max_fsm_pages and max_fsm_relations, and replaced them with a per-relation free space map that no longer requires manual sizing. Those parameters are now gone, and more importantly, something that you previously needed to understand and manage was replaced with something that just works. People who are still running PostgreSQL 8.3, or older versions, want to understand exactly how the free space map works; people who are running PostgreSQL 8.4, or newer, don't care. It's enough to know that it does work.
Now, about eight months ago, I wrote a blog entry on troubleshooting stuck vacuums. I would not say that this is an everyday problem, but in ten years of working with PostgreSQL, I've seen it a few times, and it's very unpleasant. It's easy to miss the fact that you have a problem at all, because in most cases, nothing immediately breaks. Instead, system performance just slowly degrades, gradually enough that you may not realize what the problem is until things have gotten pretty bad and you need to CLUSTER or VACUUM FULL to recover.
Now, about eight months ago, I wrote a blog entry on troubleshooting stuck vacuums. I would not say that this is an everyday problem, but in ten years of working with PostgreSQL, I've seen it a few times, and it's very unpleasant. It's easy to miss the fact that you have a problem at all, because in most cases, nothing immediately breaks. Instead, system performance just slowly degrades, gradually enough that you may not realize what the problem is until things have gotten pretty bad and you need to CLUSTER or VACUUM FULL to recover.
Monday, November 07, 2011
Hint Bits
Heikki Linnakangas was doing some benchmarking last week and discovered something surprising: in some circumstances, unlogged tables were actually slower than permanent tables. Upon examination, he discovered that the problem was caused by CLOG contention, due to hint bits not being set soon enough. This leads to a few questions:
1. What is CLOG?
2. What are hint bits?
3. How does setting hint bits prevent CLOG contention?
4. Why weren't hint bits being set sooner?
Let's take those in order.
1. What is CLOG?
2. What are hint bits?
3. How does setting hint bits prevent CLOG contention?
4. Why weren't hint bits being set sooner?
Let's take those in order.
Monday, October 31, 2011
Fast Counting
Since I wrote my previous blog entry on index-only scans, quite a bit of additional work has been done. Tom Lane cleaned up the code and improved the costing model, but possibly the most interesting thing he did was to allow index-only scans to be used for queries that don't involve an indexable condition at all. The classic example is SELECT COUNT(*) FROM table. In previous versions of PostgreSQL, there's just one way to implement this: sequential scan the table and count 'em up. In PostgreSQL 9.2, that method will still, of course, be available, but now there will be another choice: pick any index you like and do a full index scan, checking whether each tuple is all-visible either using the visibility map or via a heap fetch. So, how well does it work?
Monday, October 24, 2011
PostgreSQL Crash Debugging
As I mentioned in a previous blog post, I spend some of my time working in and with EnterpriseDB's support department. And what that means is that every customer I talk to has a problem, typically a fairly serious problem, and they want me to help them fix it. Of course, to fix it, you first have to be able to identify the problem, and sometimes that's not so simple. Database crashes can be among the more difficult cases to debug.
Monday, October 17, 2011
Deadlocks
Last week, someone pinged me on instant messenger to ask about the following message, which their PostgreSQL instance had just produced:
This message is a complaining about a deadlock. But unless you've seen and debugged these a few times before, it might not be entirely obvious to you what's actually going on here. What, exactly, did the offending processes do that caused the problem?
DETAIL: Process 22986 waits for ShareLock on transaction 939; blocked by process 22959.
Process 22959 waits for ShareLock on transaction 940; blocked by process 22986.
Process 22959 waits for ShareLock on transaction 940; blocked by process 22986.
This message is a complaining about a deadlock. But unless you've seen and debugged these a few times before, it might not be entirely obvious to you what's actually going on here. What, exactly, did the offending processes do that caused the problem?
Friday, October 07, 2011
Index-Only Scans: We've Got 'Em
Tom Lane committed a patch for index-only scans by myself and Ibrar Ahmed, which also incorporated some previous work by Heikki Linnakangas, after hacking on it some more himself. Woohoo!
There is, of course, more work to be done here - performance fine-tuning, cost estimation, extensions to the core functionality - but the core of the feature is now in. If you get a chance, please test it out and let us know how it works for you.
For those that may not have been following along at home, what we're essentially doing here is allowing any index to act as a "covering index". If all of the columns the query needs are available from the index tuple, we'll skip fetching the corresponding heap (table) page if every tuple on that page is visible to all running transactions.
Although I know we're not even really done with this feature yet, I can't help wondering what's next. Index-only scans have so often be cited as "the big performance feature that PostgreSQL is missing" that it's become something of a cliché. Now that we have them, what will take their place as the next big thing?
There is, of course, more work to be done here - performance fine-tuning, cost estimation, extensions to the core functionality - but the core of the feature is now in. If you get a chance, please test it out and let us know how it works for you.
For those that may not have been following along at home, what we're essentially doing here is allowing any index to act as a "covering index". If all of the columns the query needs are available from the index tuple, we'll skip fetching the corresponding heap (table) page if every tuple on that page is visible to all running transactions.
Although I know we're not even really done with this feature yet, I can't help wondering what's next. Index-only scans have so often be cited as "the big performance feature that PostgreSQL is missing" that it's become something of a cliché. Now that we have them, what will take their place as the next big thing?
Tuesday, October 04, 2011
CommitFest In Progress
I've seen a lot of articles lately about the great new features (and removed limitations) in PostgreSQL 9.1. Unless you're a regular reader of pgsql-hackers, you could almost forget about the fact that PostgreSQL 9.2 development is in full swing. In fact, there's a CommitFest going on right now and we could use a few more reviewers.
Many of the features that were submitted to this CommitFest are small improvements - minor fine-tuning of existing features, like generating better column names for subquery expressions, or fixing things so that LIKE can more reliably make use of indexes when non-English characters are involved. But some of the big features that will hopefully become part of PostgreSQL 9.2 are also beginning to materialize.
Many of the features that were submitted to this CommitFest are small improvements - minor fine-tuning of existing features, like generating better column names for subquery expressions, or fixing things so that LIKE can more reliably make use of indexes when non-English characters are involved. But some of the big features that will hopefully become part of PostgreSQL 9.2 are also beginning to materialize.
Friday, September 30, 2011
Scalability, in Graphical Form, Analyzed
I'm at Surge, this week, where I just listened to Baron Schwartz give a talk about scalability and performance. As usual, Baron was careful to distinguish between performance (which is how fast it is) and scalability (which is how much faster you can make it by adding more resources). One of the things Baron talked about was Neil Guenther's Universal Scalability Law, which attempts to model the behavior of complex systems (such as database systems) as you add threads, or nodes, or users.
Tuesday, September 20, 2011
Postgres Open, and What Makes a Good Talk
My last few blog posts have been all about scalability and performance, mostly because that's what I've been spending most of my time on these last few months. But I'm pleased to have another subject: Postgres Open. I've been going to PostgreSQL conferences for a few years now, but Postgres Open - which is a new conference this year - was my first experience being on the program committee.
Thursday, August 18, 2011
Index-Only Scans: Now There's a Patch
In November of 2010, I blogged about a much-requested PostgreSQL feature: index-only scans. We've made some progress! In June of this year, I committed a patch (and then, after Heikki found some bugs, another patch) to make the visibility map crash-safe. In previous releases, it was possible for the visibility map to become incorrect after a system crash, which means that it could not be relied on for anything very critical. That should be fixed now. Last week, I posted a patch for the main feature: index-only scans.
Thursday, August 04, 2011
Linux and glibc Scalability
As some of you probably already know from following the traffic on pgsql-hackers, I've been continuing to beat away at the scalability issues around PostgreSQL. Interestingly, the last two problems I found turned out, somewhat unexpectedly, not to be internal bottlenecks in PostgreSQL. Instead, they were bottlenecks with other software with which PostgreSQL was interacting during the test runs.
Monday, July 25, 2011
More On Read Scaling
In my previous blog post on read scaling out to 32 cores, I wrote a patch I recently committed to improve read scalability in PostgreSQL. It turns out, however, that there's a downside to that patch, which was uncovered in testing by Stefan Kaltenbrunner. (Fortunately, I have a fix.) And, there's an opportunity for further performance improvement by applying a similar technique to an additional type of lock. For full details, read on.
Thursday, July 21, 2011
Read Scaling Out to 32 Cores
With the exception of a week's vacation, my last month has been mostly absorbed by PostgreSQL performance work. Specifically, I've been looking at the workload generated by "pgbench -S", which essentially fires off lots and lots of SELECT queries that all do primary key lookups against a single table. Even more specifically, I've been looking at the way this workload performs on systems with many CPU cores where (I found) PostgreSQL was not able to use all of the available CPU time to answer queries. Although the single-core performance was around 4,300 tps, performance with 36 clients (on a 24-core server) was only about 36,000 tps.
Research revealed that performance was being limited mostly by PostgreSQL's lock manager. Each SELECT query needed to lock the table being queried - and its index - against a concurrent DROP operation. Since PostgreSQL 8.2, the lock manager has been partitioned: a lock request against a database object will be assigned to one of 16 "partitions", and lock requests against objects that fall into different partitions can proceed in parallel. Unfortunately, that doesn't help much in this case, because only two objects are being locked: the table, and its index. Most of the traffic therefore targets just two of the sixteen lock manager partitions. Furthermore, because the query itself is so trivial, the rate of lock and unlock requests is extremely high - on a more complex query, the bottleneck wouldn't be as severe.
Research revealed that performance was being limited mostly by PostgreSQL's lock manager. Each SELECT query needed to lock the table being queried - and its index - against a concurrent DROP operation. Since PostgreSQL 8.2, the lock manager has been partitioned: a lock request against a database object will be assigned to one of 16 "partitions", and lock requests against objects that fall into different partitions can proceed in parallel. Unfortunately, that doesn't help much in this case, because only two objects are being locked: the table, and its index. Most of the traffic therefore targets just two of the sixteen lock manager partitions. Furthermore, because the query itself is so trivial, the rate of lock and unlock requests is extremely high - on a more complex query, the bottleneck wouldn't be as severe.
Monday, June 13, 2011
False Contention
For the last few weeks, I've been buried in PostgreSQL performance analysis, mostly focusing on the workload generated by "pgbench -S" under high concurrency. In other words, lots and lots of very simple SELECT statements on a single table. Such workloads can generate serious internal contention within PostgreSQL on systems within many CPU cores.
But it's interesting to note that most of the contention points I've so far identified are what might be called "false contention". The transactions generated by this workload need to perform operations such as:
- acquire an AccessShareLock on the target relation and its index to guard against a concurrent drop or schema change
- acquire an ExclusiveLock on their VXID in case another transaction wishes to wait for transaction end
- read the list of pending "invalidation" events, which are generally created by DDL
- read the list of in-progress XIDs, to generate a snapshot
- find the root index block for the table's primary key index in the shared buffer pool
- momentarily pin the block containing the root index page into the buffer pool, so that it can't be evicted while we're examining it
- momentarily lock the root index page, so that no new items can be added until we've decided which downlink to follow
Now, in fact, in this workload, there is no concurrent DDL against the relevant table and index, or any other one; no one is attempting to wait for the completion of any VXID; the list of in-progress XIDs can easily be simultaneously read by many backends; and the root index page is in no danger either of being modified or of being evicted. In short, the problem is not that there are resource conflicts, but that verifying that no resource conflicts exist is itself eating up too many resources.
But it's interesting to note that most of the contention points I've so far identified are what might be called "false contention". The transactions generated by this workload need to perform operations such as:
- acquire an AccessShareLock on the target relation and its index to guard against a concurrent drop or schema change
- acquire an ExclusiveLock on their VXID in case another transaction wishes to wait for transaction end
- read the list of pending "invalidation" events, which are generally created by DDL
- read the list of in-progress XIDs, to generate a snapshot
- find the root index block for the table's primary key index in the shared buffer pool
- momentarily pin the block containing the root index page into the buffer pool, so that it can't be evicted while we're examining it
- momentarily lock the root index page, so that no new items can be added until we've decided which downlink to follow
Now, in fact, in this workload, there is no concurrent DDL against the relevant table and index, or any other one; no one is attempting to wait for the completion of any VXID; the list of in-progress XIDs can easily be simultaneously read by many backends; and the root index page is in no danger either of being modified or of being evicted. In short, the problem is not that there are resource conflicts, but that verifying that no resource conflicts exist is itself eating up too many resources.
Monday, June 06, 2011
Reducing Lock Contention
In a recent blog post on Performance Optimization, I mentioned that Noah Misch and I had discussed some methods of reducing the overhead of frequent relation locks. Every transaction that touches a given table or index locks it and, at commit time, unlocks it. This adds up to a lot of locking and unlocking, which ends up being very significant on machines with many CPU cores. I ended up spending a good chunk of last week hacking on this problem, with very promising results: I have a prototype patch that improves throughput on a SELECT-only pgbench test by about 3.5x on a system with 24 cores. Not bad for a couple days work.
Friday, May 27, 2011
Open Source Licensing
I can't resist the opportunity to comment on the FSF's guidelines - apparently just published - for how to choose a license for your work. A story on this was posted on Slashdot this morning. The FSF's guidelines are a little more nuanced than they could be; for example, they recommend contributing code to existing projects under the licenses used by those projects, rather than starting a giant war with the maintainers of those projects. And if you're including trivial test programs in your work, or your work is shorter than the text of the GPL itself, you might just want to put the work in the public domain. Very sensible!
But that's about as far as they're willing to go. For example, the section on contributing to existing projects suggests that, if you're making major enhancements to a non-GPL program, you might want to fork it and release your changes under the GPL. In other words, in the FSF's opinion, sometimes you should start a giant war with the maintainers. In the open source world, forks are a part of life, and people are free to choose the licenses they want, but this seems awfully cavalier to me. Forks are really damaging. Maintaining a large and unified developer community has value; having those people spend their time developing, rather than arguing about licensing, is a good thing.
But that's about as far as they're willing to go. For example, the section on contributing to existing projects suggests that, if you're making major enhancements to a non-GPL program, you might want to fork it and release your changes under the GPL. In other words, in the FSF's opinion, sometimes you should start a giant war with the maintainers. In the open source world, forks are a part of life, and people are free to choose the licenses they want, but this seems awfully cavalier to me. Forks are really damaging. Maintaining a large and unified developer community has value; having those people spend their time developing, rather than arguing about licensing, is a good thing.
Thursday, May 26, 2011
Performance Optimization
There have been several lively discussions this week on possible performance optimizations for PostgreSQL 9.2. PostgreSQL 9.1 is still in beta, and there's still plenty of bug-fixing going on there, but the number of people who need to and can be involved in that process is not as large as it was a month or two ago. So it's a good time to start thinking about development for PostgreSQL 9.2. Not a lot of code is being written yet, which is probably just right for where we are in the development cycle, but we're kicking the tires of various ideas and trying to figure out what projects are worth spending time on. Here are a few that I'm excited about right now.
Monday, May 23, 2011
PostgreSQL 9.2 Development: CommitFest Managers Needed
There's an old joke that project managers don't know how to do anything; they only know how to do it better. Of course, the project manager is the butt of the joke: how can you know how to do something better, if you don't know how to do it in the first place?
Wednesday, May 11, 2011
PostgreSQL 9.1 In Review
As the PostgreSQL 9.1 development cycle winds down to a close (we're now in beta), I find myself reflecting on what we got done during this development cycle, and what we didn't get done. Just over a year ago, I published a blog post entitled Big Ideas, basically asking for feedback on what we should attempt to tackle in PostgreSQL 9.1. The feedback was overwhelming, and led to a follow-on blog post summarizing the most-requested features. Here's the short version: Of the 10 or 11 most requested features, we implemented (drum roll, please) one. Credit for granular collation support goes to Peter Eisentraut, with much additional hackery by Tom Lane.
You can find a second list of possible 9.1 features in the minutes of the PGCon 2010 Developer Meeting. Of the 30 features listed there, we got, by my count, 12.5, or more than 40%. While even that may seem like a somewhat small percentage, a dozen or so major enhancements to the development process is nothing to sneeze at. And certainly, if you were a feature longing to be implemented, you'd much prefer to be on the second list than the first.
You can find a second list of possible 9.1 features in the minutes of the PGCon 2010 Developer Meeting. Of the 30 features listed there, we got, by my count, 12.5, or more than 40%. While even that may seem like a somewhat small percentage, a dozen or so major enhancements to the development process is nothing to sneeze at. And certainly, if you were a feature longing to be implemented, you'd much prefer to be on the second list than the first.
Thursday, May 05, 2011
shared_buffers on 32-bit systems
Every once in a while, I read something that's so obvious after the fact that I wonder why I didn't think of it myself. A recent thread on pgsql-performance had that effect. The PostgreSQL documentation gives some rough guidelines for tuning shared_buffers, which recommends (on UNIX-like systems) 25% of system memory up to a maximum of 8GB. There are similar guidelines on the PostgreSQL wiki page, in the article on Tuning Your PostgreSQL Server.
What became obvious to me in reading the thread - and probably should have been obvious to me all along - is that this advice only makes sense if you're running a 64-bit build of PostgreSQL, because, on a 32-bit build, each process is limited to 4GB of address space, of which (at least on Linux) 1GB is reserved for the kernel. That means that no matter how much physical memory the machine has, each PostgreSQL backend will be able to address at most 3GB of data. That includes (most significantly) shared_buffers, but also the process executable text, stack, and heap, including backend-local memory allocations for sorting and hashing, various internal caches that each backend process uses, local buffers for accessing temporary relations, and so on. And 3GB is really a theoretical upper limit, if everything is packed optimally into the available address space: I'm not sure how close you can get to that limit in practice before things start breaking.
So, if you're running a 32-bit PostgreSQL on UNIX-like operating system, you probably need to limit shared_buffers to at most 2GB or 2.5GB. If that is less than about 25% of your system memory, you should seriously consider an upgrade to a 64-bit PostgreSQL.
What became obvious to me in reading the thread - and probably should have been obvious to me all along - is that this advice only makes sense if you're running a 64-bit build of PostgreSQL, because, on a 32-bit build, each process is limited to 4GB of address space, of which (at least on Linux) 1GB is reserved for the kernel. That means that no matter how much physical memory the machine has, each PostgreSQL backend will be able to address at most 3GB of data. That includes (most significantly) shared_buffers, but also the process executable text, stack, and heap, including backend-local memory allocations for sorting and hashing, various internal caches that each backend process uses, local buffers for accessing temporary relations, and so on. And 3GB is really a theoretical upper limit, if everything is packed optimally into the available address space: I'm not sure how close you can get to that limit in practice before things start breaking.
So, if you're running a 32-bit PostgreSQL on UNIX-like operating system, you probably need to limit shared_buffers to at most 2GB or 2.5GB. If that is less than about 25% of your system memory, you should seriously consider an upgrade to a 64-bit PostgreSQL.
Tuesday, April 26, 2011
A Change of Role
I'm pleased to report that I have changed roles within EnterpriseDB and am now officially a member of EnterpriseDB's support and professional services team. I have a number of other responsibilities as well, so I will not be spending all of my time on support, but I have been and will continue to spend a significant chunk of time on it each day. When I've talked about this with people, they sometimes look at me as if I have two heads. After all, I am a backend server developer, and developers as a breed are not known for enjoying support.
Tuesday, April 19, 2011
PostgreSQL East, and The MySQL Conference and Expo
Last month, I attended (and spoke at) PostgreSQL East in New York City, which this year featured a MongoDB track. This past week, I was in Santa Clara at the O'Reilly MySQL Conference & Expo, which had a substantial PostgreSQL track this year, where I also spoke.
Both conferences had some very good talks. The first talk I attended at the MySQL conference turned out to be one of the best - it was entitled Linux and H/W optimizations for MySQL. I had a little difficulty understanding Yoshinori Matsunobu's accent at times, but the slides were excellent, and very detailed. Some of his more interesting findings: (1) SSDs speed things up both on the master and on replication slaves, but the speedup is larger on the slaves; so it's useful to put hard disks on the master and SSDs on the slave to make it possible for single-threaded recovery there to keep up with the master; (2) while SSDs are much faster for random access, they are actually slower for sequential access and fsync, so a RAID array with a battery-backed or flash-backed write cache may still be a better option in those cases, (3) Fusion I/O drives were FAR faster than Intel drives, (4) the Intel Nehalem architecture was much faster than the AMD Opteron architecture when used in combination with SSDs, and (5) HyperThreading helps more in SSD environments than it does otherwise, because the system, overall, becomes more heavily CPU-bound; for the same reasons, mutex contention hurts more.
Another very good talk was Peter Zaitsev's discussion of Innodb and XtraDB Architecture and Performance Optimization, which gave me the feeling of looking into a sort of carnival mirror, where you recognize yourself, but it's all distorted. Two of the problems that give PostgreSQL DBAs heartburn - bloat, and checkpoint I/O spikes (and less frequently, purge not keeping up a la vacuum not keeping up) - are apparently problems for MySQL as well, though with significantly different details. I'm not even going to attempt to summarize the differences, or say which problem is worse or occurs more often, because I honestly have no idea. I was a bit surprised to hear dump-and-reload recommended to recover from certain worst-case scenarios, though.
There were other good talks, too, which helped me understand what's going on in the world of MySQL forks. Apparently, the Drizzle team is busy removing features that they consider half-baked and modularizing the code so that it is easier to understand and improve, while the MariaDB team is busy adding optimizer features, including support for hash joins and persistent statistics. From what I understand, the MySQL optimizer has typically worked by gathering statistics through on-the-fly index probes, which can be a problem in some situations. It's not so easy to categorize the work that Oracle is doing, but it seems to involve a fair amount of filing down of rough edges, and various improvements to replication, including, perhaps most significantly, parallel replication apply.
At PostgreSQL East, I think my favorite talk was Ken Rosensteel's talk, somewhat misleadingly titled Large Customers Want PostgreSQL, Too. This talk turned to be about migrating a large Oracle mainframe application to use PostgreSQL, and the challenges faced during that migration. He, or his team, built an Oracle-to-PostgreSQL converter for stored procedures; it was interesting to see that they got bitten by our bizarre casting rules around the smallint data type. They also ended up doing some very interesting work optimizing the performance of ECPG for small FETCH statements; these are areas of the code that I think don't normally get a lot of attention, and it was great to hear about the optimization work that got done.
I was disappointed that Jon Hoffman's talk on Experiences with Postgres and MongoDB at foursquare.com got cancelled; I think that would have been an interesting talk. I did have an opportunity to attend Jake Luciani's talk Comparing the Apache Cassandra Architecutre to PostgreSQL, which turned out to be more about Cassandra than PostgreSQL, but was nevertheless interesting. I would have been interested to hear a more technical talk, though, about how problems like distributed serialization anomalies and distributed checkpointing are handled.
Next month, I'll be speaking at PGCon 2011 on Using The PostgreSQL System Catalogs and How To Get Your PostgreSQL Patch Accepted. And after that, unlike Bruce, I'm going to stay home for a few months!
Both conferences had some very good talks. The first talk I attended at the MySQL conference turned out to be one of the best - it was entitled Linux and H/W optimizations for MySQL. I had a little difficulty understanding Yoshinori Matsunobu's accent at times, but the slides were excellent, and very detailed. Some of his more interesting findings: (1) SSDs speed things up both on the master and on replication slaves, but the speedup is larger on the slaves; so it's useful to put hard disks on the master and SSDs on the slave to make it possible for single-threaded recovery there to keep up with the master; (2) while SSDs are much faster for random access, they are actually slower for sequential access and fsync, so a RAID array with a battery-backed or flash-backed write cache may still be a better option in those cases, (3) Fusion I/O drives were FAR faster than Intel drives, (4) the Intel Nehalem architecture was much faster than the AMD Opteron architecture when used in combination with SSDs, and (5) HyperThreading helps more in SSD environments than it does otherwise, because the system, overall, becomes more heavily CPU-bound; for the same reasons, mutex contention hurts more.
Another very good talk was Peter Zaitsev's discussion of Innodb and XtraDB Architecture and Performance Optimization, which gave me the feeling of looking into a sort of carnival mirror, where you recognize yourself, but it's all distorted. Two of the problems that give PostgreSQL DBAs heartburn - bloat, and checkpoint I/O spikes (and less frequently, purge not keeping up a la vacuum not keeping up) - are apparently problems for MySQL as well, though with significantly different details. I'm not even going to attempt to summarize the differences, or say which problem is worse or occurs more often, because I honestly have no idea. I was a bit surprised to hear dump-and-reload recommended to recover from certain worst-case scenarios, though.
There were other good talks, too, which helped me understand what's going on in the world of MySQL forks. Apparently, the Drizzle team is busy removing features that they consider half-baked and modularizing the code so that it is easier to understand and improve, while the MariaDB team is busy adding optimizer features, including support for hash joins and persistent statistics. From what I understand, the MySQL optimizer has typically worked by gathering statistics through on-the-fly index probes, which can be a problem in some situations. It's not so easy to categorize the work that Oracle is doing, but it seems to involve a fair amount of filing down of rough edges, and various improvements to replication, including, perhaps most significantly, parallel replication apply.
At PostgreSQL East, I think my favorite talk was Ken Rosensteel's talk, somewhat misleadingly titled Large Customers Want PostgreSQL, Too. This talk turned to be about migrating a large Oracle mainframe application to use PostgreSQL, and the challenges faced during that migration. He, or his team, built an Oracle-to-PostgreSQL converter for stored procedures; it was interesting to see that they got bitten by our bizarre casting rules around the smallint data type. They also ended up doing some very interesting work optimizing the performance of ECPG for small FETCH statements; these are areas of the code that I think don't normally get a lot of attention, and it was great to hear about the optimization work that got done.
I was disappointed that Jon Hoffman's talk on Experiences with Postgres and MongoDB at foursquare.com got cancelled; I think that would have been an interesting talk. I did have an opportunity to attend Jake Luciani's talk Comparing the Apache Cassandra Architecutre to PostgreSQL, which turned out to be more about Cassandra than PostgreSQL, but was nevertheless interesting. I would have been interested to hear a more technical talk, though, about how problems like distributed serialization anomalies and distributed checkpointing are handled.
Next month, I'll be speaking at PGCon 2011 on Using The PostgreSQL System Catalogs and How To Get Your PostgreSQL Patch Accepted. And after that, unlike Bruce, I'm going to stay home for a few months!
Thursday, April 07, 2011
One-Way Tuning
I've been noticing that many of the parameters in postgresql.conf only ever need to be adjusted in one direction from the default.
shared_buffers, temp_buffers, work_mem, maintenance_work_mem, checkpoint_segments, checkpoint_timeout, and checkpoint_completion_target, join_collapse_limit, and from_collapse_limit only ever need to be adjusted upward from the default.
bgwriter_delay, seq_page_cost, and random_page_cost only ever need to be adjusted downward from the default.
Does this mean we should change some of the defaults?
shared_buffers, temp_buffers, work_mem, maintenance_work_mem, checkpoint_segments, checkpoint_timeout, and checkpoint_completion_target, join_collapse_limit, and from_collapse_limit only ever need to be adjusted upward from the default.
bgwriter_delay, seq_page_cost, and random_page_cost only ever need to be adjusted downward from the default.
Does this mean we should change some of the defaults?
Thursday, March 31, 2011
CommitFests and Meritocracy
Yesterday, I wrote a blog post on whether and to what extend the PostgreSQL community is a welcoming community, in which I quoted some remarks that Selena Deckelmann made, and she responded with her own post on where meritocracy fails. I want to just recap a few things that may not be totally obvious to casual observers of our community; and then I have a few remarks about meritocracy.
Wednesday, March 30, 2011
A Welcoming Community?
I attended PostgreSQL East last week and, as usual, the most interesting discussions were the ones that nobody planned. Ed Boyajian, the CEO here at EnterpriseDB where I work, threw out a remark to the effect that the PostgreSQL community is a meritocracy. Selena Deckelmann wasn't convinced, "not to say that we don't do try hard, and do pretty well." During the closing session, Josh Drake made some remarks to the effect that, in effect, the process for submitting a patch to PostgreSQL is long, difficult, and fraught with arbitrary and capricious rejection, which caused me to rise to my feet and object that that while it's not perfect, I think we do pretty well, and better than we used to. Josh agreed with some of what I have to say, and I see his point, too. But it got me thinking about the whole way that we operate as a development community, and what is good and bad about it.
Monday, March 21, 2011
Working Toward PostgreSQL 9.1beta
I'm pleased to report that we seem to be making good progress toward PostgreSQL 9.1beta. As we did for PostgreSQL 9.0, we are maintaining a list of open items that we need either to address or to decide that they aren't so important after all (there are also some issues that are being discussed on the mailing list that aren't reflected in that open items list). What we're really doing here is trying to stabilize the code, particularly with respect to the features committed at the very end of the development cycle. So far, the main problem areas appear to be (1) collation support, (2) changes to streaming replication, especially the new synchronous replication feature, and (3) the new serializable isolation level. But we are making rapid progress in sorting out the loose ends, and I feel a lot better about the release now than I did even a week ago.
Barring objections or unforeseen problems, I'm hoping to bundle an alpha5 release a week from today, on Monday, March 28. We haven't discussed the time line for beta yet, other than to hope that we'll be able to get there in April. I still think that's a realistic timeline, although there is quite a bit of work that must be done between now and then for us to hit it. In the meantime, if you're in a position to build from source, please keep testing and sending in bug reports. Thanks!
Barring objections or unforeseen problems, I'm hoping to bundle an alpha5 release a week from today, on Monday, March 28. We haven't discussed the time line for beta yet, other than to hope that we'll be able to get there in April. I still think that's a realistic timeline, although there is quite a bit of work that must be done between now and then for us to hit it. In the meantime, if you're in a position to build from source, please keep testing and sending in bug reports. Thanks!
Wednesday, March 16, 2011
Query Planner Enhancements in PostgreSQL 9.1
PostgreSQL has an awesome query planner. And it keeps getting better. postgresql.org has a survey asking people Which PostgreSQL 9.1 Feature Are You Most Excited About? (to vote on it, if it's still open when you read this, go here). For some reason, query planner features tend not to be listed when people talk about what they're most excited about in a new release, but there's ample room for excitement. If a particular query planner feature happens to apply to your use case, it's possible to see multiple-order-of-magnitude speedups, avoiding the need for a major application redesign or even an expensive hardware purchase. This is a big deal.
Thursday, March 10, 2011
Advice for Google Summer of Code Students (and other prospective contributors)
Periodically, someone whose name I've ever seen before contacts me -- or posts to pgsql-hackers -- to say that either (1) they've already written a great patch for PostgreSQL and they'd like to know how to get it committed or (2) they're interested in writing a patch to implement some major new feature in PostgreSQL. Some of these people are prospective Google Summer of Code students, while others are researchers or other people who, for whatever reason, are interested in PostgreSQL. I am always thrilled to see more people take an interest in PostgreSQL, but unfortunately I've seen a number of people who seemed very smart and promising crash and burn when in terms of actually making a successful contribution of code. In fact, no matter how promising things seem at the outset, the failure rate - in my experience - is very close to 100%.
Thursday, March 03, 2011
More Musings on Logical Replication
There were some interesting comments on my previous blog post on logical replication, which have inspired me to write a bit more about this topic. Exactly how should this actually work? Every time a tuple is inserted, updated, or deleted, we already write to the transaction log the entire contents of any new tuple, and the physical position of any old tuple. It would be nice to piggyback somehow on the work that's already being done there, but it's hard to see exactly how. I see three main problems.
Tuesday, March 01, 2011
Troubleshooting Stuck VACUUMs
Over the years, I've occasionally encountered situations where VACUUM fails to make progress, and not fully understood why that was happening. Recently, I've come to a better understanding of how lock conflicts can result in VACUUM stalling out either at the beginning of the table, or part-way through. (If you're not already familiar with the types of locks that PostgreSQL uses, you may find it helpful to read through my earlier blog post on locking in PostgreSQL before reading on.)
Saturday, February 26, 2011
The Case For Logical Replication
Replication, as it exists in PostgreSQL today, is physical replication. That is, the disk files as they exist on the standby are essentially byte-for-byte identical to the ones on the master (with the exception that the same hint bits may not be set). When a change is made on the master, the write-ahead log is streamed to the standby, which makes the same change to the corresponding disk block on the standby. The alternative is logical replication, in which what is transferred is not an instruction to write a certain sequence of bytes at a certain location, but the information that a certain tuple was inserted, or that a table with a given schema was created.
Thursday, February 17, 2011
Working in the Community
The PostgreSQL community makes decisions by consensus. The craziness of this approach should be readily apparent. The United States Senate has come under withering criticism for rules which permit one or a small number of Senators to hold up the business of the chamber to such an extent that it's difficult to get anything done. Forty-one senators can filibuster just about anything, to the frustration of the fifty-nine who want to have a vote and get on with it; and sometimes just one Senator can make himself or herself a near-complete obstacle to progress.
Keeping Local Git Branches Up To Date
Because I spend most of my time working on the master branch of the PostgreSQL git repository, I prefer to work with just a single clone. The PostgreSQL wiki page Committing with Git describes several ways of using multiple clones and/or git-new-workdir, but my personal preference is to just use one clone. Most of the time, I keep the main branch checked out, but every once in a while I check out one of the back-branches to look at something, or to back-patch. (If you're unfamiliar with the PostgreSQL workflow, note that we do not merge into our official branches; we always rebase, so that there are no merge commits in our official repository. You may or may not like this workflow, but it works for us.)
One small annoyance is that "git pull" doesn't leave my clone in the state I want. Say I have the master branch checked out. "git pull" will update all of my remote tracking branches, but it will only update the local branch that I currently have checked out. This is annoying, first of all because if I later type "git log REL9_0_STABLE" I'll only get the commits since the last time I checked out and pulled that branch, rather than as I intended the latest state of the upstream, and secondly because it leads to spurious griping when I later do "git push": it complains that the old branches can't be pushed because it wouldn't be a fast-forward merge. This is of course a little silly: since my branch tip is an ancestor of the tracking branch, it would be more reasonable to conclude that I haven't updated it than to imagine I meant to clobber the origin.
One small annoyance is that "git pull" doesn't leave my clone in the state I want. Say I have the master branch checked out. "git pull" will update all of my remote tracking branches, but it will only update the local branch that I currently have checked out. This is annoying, first of all because if I later type "git log REL9_0_STABLE" I'll only get the commits since the last time I checked out and pulled that branch, rather than as I intended the latest state of the upstream, and secondly because it leads to spurious griping when I later do "git push": it complains that the old branches can't be pushed because it wouldn't be a fast-forward merge. This is of course a little silly: since my branch tip is an ancestor of the tracking branch, it would be more reasonable to conclude that I haven't updated it than to imagine I meant to clobber the origin.
Tuesday, February 01, 2011
MySQL vs. PostgreSQL, Part 2: VACUUM vs. Purge
Almost two months ago, I wrote part one of what I indicated would be an occasional series of blog posts comparing the architecture of PostgreSQL to that of MySQL. Here's part two. Please note that the caveats set forth in part one apply to this and all future installments as well, so if you haven't read part one already, please click on the link above and read at least the first two paragraphs before reading this post.
Monday, January 24, 2011
PostgreSQL 9.1: Neat Stuff is Coming
Up until the last month or so, the development arc for PostgreSQL 9.1 was looking fairly humdrum. Quite a bit of good refactoring and some good general enhancements, but not a lot of sizzle. That seems to be changing now. Of course, in one sense, that's a bad thing: we're only three weeks from the end of the development cycle, and ideally it would have been nicer to have some of these big patches land just a little bit sooner. Still, I feel pretty good about the quality of what's been committed.
Thursday, January 20, 2011
Locking in PostgreSQL
Have you ever had one of those annoying problems where a query, or some kind of maintenance task such as vacuum, seems to hang, without making any discernable foreign progress, basically forever? If it's a foreground task, you typically think to yourself "oh, that's taking longer than normal" - but then you start to become suspicious as to whether anything's happening at all. You fire up top, or strace, or gdb, or iostat, or some combination of those tools, and eventually decide that, indeed, nothing is happening.
Tuesday, January 18, 2011
What Kind of Replication Do You Need?
As you probably know by now if you're a regular reader of Planet PostgreSQL, or if you've read the PostgreSQL 9.0 release notes, PostgreSQL 9.0 offers a much-improved form of built-in replication. In PostgreSQL 8.4 and prior releases, replication was possible via WAL shipping, but the delay between master and standby could be several minutes, or even longer, depending on configuration parameters. This delay has been essentially eliminated in PostgreSQL 9.0, which allows the write-ahead log to be streamed from master to standby as it's generated.
But it's still asynchronous replication - as opposed synchronous replication, which has been proposed for inclusion in PostgreSQL 9.1. If the master crashes just an instant after processing a COMMIT statement, the client will believe that the transaction has committed, but the slave won't know about it. If the slave is then promoted to become the new master, the transaction is gone forever.
But it's still asynchronous replication - as opposed synchronous replication, which has been proposed for inclusion in PostgreSQL 9.1. If the master crashes just an instant after processing a COMMIT statement, the client will believe that the transaction has committed, but the slave won't know about it. If the slave is then promoted to become the new master, the transaction is gone forever.
Friday, January 14, 2011
Why SQL/MED is Cool
One of the big patches that is in the works for PostgreSQL 9.1 -- and will hopefully but not for sure make the cut -- is a series of patches that implement basic SQL/MED functionality for PostgreSQL. What is SQL/MED and why should you care?
Tuesday, January 04, 2011
PostgreSQL 9.1: Big Patches
About three weeks ago, I wrote a blog post about the forthcoming end of the PostgreSQL 9.1 development cycle, and the many large and important features for which we have patches outstanding. Since we now have just 11 days to go before the beginning of the forth and final CommitFest, this seems like a good time to revisit the progress we've made. Here again is the list of features from my previous post: