Wednesday, June 14, 2023

The PostgreSQL Documentation and the Limitations of Community

In my opinion, the PostgreSQL documentation is simultaneously excellent and fairly poor, and both its excellence and its shortcomings are direct results of the process by which the documentation is produced. The PostgreSQL documentation is stored in the same git repository as the source code, and anyone who patches the source code so as to change documented behavior must also patch the documentation to match.

This means that nearly all documentation updates are made by the developer who is most familiar with what is changing in the code, or sometimes by another developer who has studied those changes closely. Therefore, the documentation is usually extremely accurate. Sure, there are oversights, but it would be incredible to discover that some PostgreSQL command has a documented option which doesn't actually exist, or that a parameter which is documented to take a string argument actually takes an integer or a Boolean. Typically, the descriptions of what SQL statements do and how that behavior is changed by parameter settings or options passed to the command itself are crisp and precise.

We're particularly good at tables and lists. Every SQL command that exists is listed in the documentation index, and anybody who adds a new SQL command will add it to the list, and write a documentation page for it that looks just like all of the other ones. Every configuration parameter that exists is listed in the documentation with a description of what it does using semi-standard phraseology. When someone creates a new one, they're bound to add it to the list and describe it using the same phraseology that's already used for the existing parameters. Corrections to this kind of documentation are often on the level of re-alphabetizing things that have fallen out of alphabetical order, or fixing some bit of markup that wasn't done in the same way as all of the other entries. Grown men slink away in shame when it is pointed out to them that the parameter type is listed as bool when it actually should be enum.

The strengths of this process are also its weaknesses. A developer is, by definition, someone who spends the majority of their time doing development, which is to say writing code. Updating the documentation becomes a task that must be completed so that the code one has written can get committed so that one can move on to the next project and write some more code. If a change to the documentation would be beneficial but is unrelated to any particular patch, it's not likely to get done.

The results are, in a certain sense, pretty comical. Pop over to the documentation index and find the page that describes the work_mem parameter and the page that describes the CREATE TABLE command. If you're anything like me, this is actually quite easy for you, because you know that the work_mem documentation is going to be buried under "Server Configuration" and CREATE TABLE is going to be under "SQL Commands" and, knowing that, you'll have no trouble. But notice that those incredibly important chapters of the documentation get just the same amount of space in the top-level documentation index as "Background Worker Processes" and "Backup Manifest Format," just to pick a couple examples of chapters that I have personally added in faithful observance to the community process. Those equally-prominent chapters have got to be of interest to a comparatively tiny number of users.

But, you know, I was a loyal servant of the community process. I was asked to document that stuff, and I did, and I put it in the documentation in the place where it most logically seemed to go. The fact that the overall structure of the documentation probably isn't for the best is not my fault, nor is correcting it my responsibility. And it's not anyone else's responsibility, either.

Nonetheless, people do try, from time to time. Most efforts fall into one of two categories. Sometimes, someone realizes that a certain section of the documentation has become woefully out of date, often because it emphasizes whatever piece of technology existed first and needs to be rewritten to emphasize some newer innovation that is better (examples from table partitioning, pg_basebackup). Other times, people suggest structural adjustments (example). Such changes sometimes go through smoothly, but many of them prove to be controversial and take much more work to get committed than a documentation entry for a new parameter or a new SQL command.

It's not difficult to understand why this happens. If I add a new feature to do a certain thing to PostgreSQL, I am the expert on that feature. There's nobody else who knows better than I do what the documentation for that feature ought to say. My work might have shortcomings just like anyone else's, but especially if I'm just adding new entries to tables that already contain dozens or hundreds of existing entries, how much difference of opinion can there reasonably be? It's more likely that reading the documentation will cause someone to take issue with the design of the feature itself than it is that they won't like the way it's documented.

But if on the other hand I propose some change to documentation that has existed for a long time, or some kind of structural change, there's a lot more room for disagreement. Because the change isn't strictly mechanical, the right answer is a lot more subjective. And because it's a change to existing content rather than the addition of new content, many more people will be familiar with it and have opinions on how it ought to be changed, if at all. Consequently, even when some developer does take time away from writing code to try to make some larger change to the documentation, it's often an uphill battle to get anything done, and people typically have to be content with small improvements.

I hope I'm convincing you that this mixture of extreme rigor when it comes to mechanical updates and laxity when it comes to broader changes is inherent in the way that the community works. It is neither good nor bad; it just is. If we adopted some other process, it would have its own set of advantages and disadvantages, and it's anybody's guess whether we'd come out ahead. Personally, my guess is that most other things we could do would come out overall worse than what we are currently doing, so I have no particular process change to propose. However, I do think it's worth thinking about why we get the results that we do, both good and bad.

2 comments:

  1. The process is definitely discouraging for operators/advocates, which are probably the people best positioned to help make documentation improvements on subsequent releases, as working with users and out in the field you don't build up the political clout needed to affect change even though you have the knowledge / skills for presenting the information more clearly. I still think the Postgres documentation is better than most though, so the trade-offs seem worth it, but it is worth being open about what those trade-offs are.

    ReplyDelete
  2. I am weirdly happy to read that. I feel less alone advocating for a change in the pg documentation.
    I particularly appreciated the part about the patching process and you underlining the fact that it's in fact easier to document a new unknown feature, no pushback than an existing one...

    I just gave (3 times) a talk about the PostgreSQL documentation and how we could improve its quality. My point of view is from my documentation expertise. My work includes writing, reading and critiquing documentation.

    Basically in my talk, I support the idea that the PG documentation is written by expert, for expert. The missing part is the first step.

    Slides are here: https://pgday.fr/docs/2023/04_Lets-make-it-better-now-together.pdf

    And now that I did the talk...I am morally obliged to contribute...

    ReplyDelete