A user specifies the desired collation - which determines the sort ordering that is used - using the COLLATE keyword followed by a collation name. Except on Windows, collations have names like en_US, meaning English as spoken in the United States, or fr_CA, meaning French as spoken in Canada. Unfortunately, the same collation may behave differently on different operation systems, or even releases of the same operating system, or even perhaps from one run of the program to the next. glibc doesn't provide any information to PostgreSQL when the sort rules change. You just may suddenly realize, after a libc update, that your multi-terabyte index urgently needs a REINDEX, and that in the meantime queries are returning wrong answers. That's pretty unfortunate.
On Windows or FreeBSD, or when using collations provided by the ICU library, the situation is in theory slightly better, because they do expose some concept of a collation version. Unfortunately, the system Thomas Munro designed to detect changes in the collation version had enough problems that it had to be reverted. Perhaps that can be improved in the future, but it won't help Linux users unless they use ICU, and it will only tell you after the fact that your indexes are now broken. What we'd really like is to stop them from breaking in the first place, but it seems that nobody really has a great idea about how to do that. One approach would be to have PostgreSQL itself provide collation definitions, and just be very careful about versioning, but that seems like a lot of work to create and maintain.
Jeremy Schneider created a table showing how the sort ordering produced by en_US has changed across Ubuntu versions. (Twitter discussion here.) I think it's first of all interesting that this list of differences is being derived from an experimental approach, rather than for example by reading the release notes. Most people are not going to be able or willing to run lengthy experiments to detect behavior changes in low-level operating system code. Even the experimental approach seems to require a lot more knowledge of how Unicode sorting works than I possess myself. Who is to say that the 14 patterns used by this tool cover all of the interesting cases? Certainly not I!
It's also interesting that the sort order shown is fairly inscrutable. Even though this is the en_US collation and I am a native speaker of English from the United States, I'm completely unable to understand why those strings sort as they do. Therefore, I don't accept that this collation reflects the sensibilities of actual English speakers. It's just an arbitrary rule. Such rules are needed so that arbitrary strings can be sorted, but I strongly suspect that the average PostgreSQL user would be much happier if the operating system kept using the same arbitrarily rule rather than continually changing it.
Since the glibc maintainers don't seem too interested in providing more stable collations or even information about when collation definitions change, I suspect that, in the long term, we're going to need to find an answer other than relying on glibc. It does not really matter to me whether 🍷B sorts before or after 🍷🍷B or 🍷🍷BB, but I sure want it to work the same way on Tuesday as it did on Monday.