Monday, November 29, 2021

Collation Stability

When PostgreSQL needs to sort strings, it relies on either the operating system (by default) or the ICU collation library (if your PostgreSQL has been built with support for ICU and you have chosen to use an ICU-based collation) to tell it in what order the strings ought to be sorted. Unfortunately, operating system behaviors are confusing and inconsistent, and they change relatively frequently for reasons that most people can't understand. That's a problem for PostgreSQL users, especially PostgreSQL users who create indexes on text columns. The first step in building a btree index to sort the data, and if this sort order differs from the one used for later index lookups, data that is actually present in the index may not be found, and your queries may return wrong answers.

A user specifies the desired collation - which determines the sort ordering that is used - using the COLLATE keyword followed by a collation name. Except on Windows, collations have names like en_US, meaning English as spoken in the United States, or fr_CA, meaning French as spoken in Canada. Unfortunately, the same collation may behave differently on different operation systems, or even releases of the same operating system, or even perhaps from one run of the program to the next. glibc doesn't provide any information to PostgreSQL when the sort rules change. You just may suddenly realize, after a libc update, that your multi-terabyte index urgently needs a REINDEX, and that in the meantime queries are returning wrong answers. That's pretty unfortunate.

On Windows or FreeBSD, or when using collations provided by the ICU library, the situation is in theory slightly better, because they do expose some concept of a collation version. Unfortunately, the system Thomas Munro designed to detect changes in the collation version had enough problems that it had to be reverted. Perhaps that can be improved in the future, but it won't help Linux users unless they use ICU, and it will only tell you after the fact that your indexes are now broken. What we'd really like is to stop them from breaking in the first place, but it seems that nobody really has a great idea about how to do that. One approach would be to have PostgreSQL itself provide collation definitions, and just be very careful about versioning, but that seems like a lot of work to create and maintain. 

Jeremy Schneider created a table showing how the sort ordering produced by en_US has changed across Ubuntu versions. (Twitter discussion here.) I think it's first of all interesting that this list of differences is being derived from an experimental approach, rather than for example by reading the release notes. Most people are not going to be able or willing to run lengthy experiments to detect behavior changes in low-level operating system code. Even the experimental approach seems to require a lot more knowledge of how Unicode sorting works than I possess myself. Who is to say that the 14 patterns used by this tool cover all of the interesting cases? Certainly not I!

It's also interesting that the sort order shown is fairly inscrutable. Even though this is the en_US collation and I am a native speaker of English from the United States, I'm completely unable to understand why those strings sort as they do. Therefore, I don't accept that this collation reflects the sensibilities of actual English speakers. It's just an arbitrary rule. Such rules are needed so that arbitrary strings can be sorted, but I strongly suspect that the average PostgreSQL user would be much happier if the operating system kept using the same arbitrarily rule rather than continually changing it.

Since the glibc maintainers don't seem too interested in providing more stable collations or even information about when collation definitions change, I suspect that, in the long term, we're going to need to find an answer other than relying on glibc. It does not really matter to me whether ðŸ·B sorts before or after ðŸ·ðŸ·B or ðŸ·ðŸ·BB, but I sure want it to work the same way on Tuesday as it did on Monday.

4 comments:

  1. > I think it's first of all interesting that this list of differences is being derived from an experimental approach, rather than for example by reading the release notes.

    I do intend to put together a blog post, but you beat me to it! For the record, I (and some colleagues) started with release notes and git log messages. This was a handy place to review glibc changes online:

    https://abi-laboratory.pro/index.php?view=timeline&l=glibc

    Ultimately I felt that I needed to compliment it with the experimental approach. Two bigger reasons that come to mind:

    * First, because I wasn't confident that I was correctly identifying all impacted locales and characters from the change logs. Git blame will be more reliable, but I'll need to identify every line of code for that approach to work comprehensively.

    * Second, at the end of the day, nobody is compiling glibc themselves - we use operating systems. Every linux distribution layers their own patches on top of glibc. I'll have to review all of those patches in addition to the glibc changelogs.

    In the end, I think both approaches are important to understand things. One could say that the experimental approach is "top-down" rather than the source-code-investigation "bottom-up" approach. With very large and complicated code bases like glibc and postgresql, I never fully trust even my own reading of the code, and I usually try to verify my understanding with some kind of testing.

    ---

    By the way - MySQL (and Maria) do maintain their own collations and properly version them. As far as I know, all the major commercial databases do as well. PostgreSQL is the only major relational DB I'm aware of which uses OS collation quite like this.

    https://github.com/mysql/mysql-server/tree/8.0/share/charsets

    ReplyDelete
    Replies
    1. Interesting. Looking forward to reading your blog post when you write it; hope I didn't steal too much of your thunder.

      Delete
  2. I agree that nobody really cares too much about having the latest collation changes - especially not if they're specifically using PostgreSQL. Unfortunately for PostgreSQL users, that goes both ways: nobody else cares if collations *are* upgraded -- as Robert points out, these are obscure behavioral changes, for the most part. The view that collations can change more or less without notice is the view of glibc, as far as I can tell. I'm basing this on remarks from Carlos O'Donnell on a bug report some years back.

    I don't see any reason to believe that glibc and others will revise their view of this issue any time soon. And so it seems to me that we should just accept that, and focus on a more local solution: collations that are (at least in a certain sense) under the control of PostgreSQL.

    There will be many concerns about this kind of thing, which are no doubt legitimate. We certainly shouldn't vendor ICU in PostgreSQL itself. Surely there's some middle ground between vendoring ICU, and the current status quo.

    Here's what I have in mind: maybe we could facilitate work that happens within a given packaging ecosystem (e.g. official PGDG .deb packages). It'd be quite possible for individual packagers to "vendor" their own ICU versions. There is no reason why they couldn't support multiple versions of ICU at any given time, with their own deprecation policy. The ICU library versions all symbol names, so the requirement on the PostgreSQL side would be linking to multiple ICU versions at the same time.

    ReplyDelete
  3. And it can cause index corruption, in context of replication, if primary and standby have different version of glibc and you have to activate your standby (real case happened...)

    ReplyDelete