If you use psql's \l command to list all the databases in your PostgreSQL cluster, you'll notice a column in the output labelled "Collate"; on my system, that column has the value "en_US.UTF-8". This means that when, for example, you sort strings, you'll use the "English" sort order rather than the traditional "C" sort order. You might ask: what's the difference?
In the "C" sort order, all capital letters come before all lower-case letters, whereas in en_US.UTF8, a comes before A which comes before b which comes before B, and so on. In other words, every collation can have its own rules for sorting strings, consistent with the way that the people who speak that language like to alphabetize things; or at least with the way that the people who wrote the locale definitions for that language think that they like to alphabetize things. Collations are OS-dependent: some operating systems don't support them at all, while Windows has a completely different naming convention from every other operating system, and probably different behaviors as well.
Now, being able sort strings according to the rules of some particular language is definitely cool, and useful, and PostgreSQL 9.1 has some improvements in this area that allow you to change the collation for individual columns or even on the fly for one particular query, using the COLLATE keyword. What's less cool is that collation-aware comparisons are considerably slower than the traditional "C" sort order. I recently did some testing of sorting 100,000 randomly-generated strings using PostgreSQL 9.2devel and found that it took about 80% longer using the en_US than it did using "C"; and I've heard of cases where the penalty was substantially more than that (this particular test was run on a glibc-derived system, but things don't seem to be much better on BSD-derived systems, even though strcoll() is implemented quite differently there). These speed penalties not only affect the cost of sorting result-sets, but also the costs of building and accessing B-tree indexes, which can be, and by default are, collation-aware.
If you happen to need the particular sorting behavior that collation-aware sorting and comparisons provide, then you may find this price worth paying, but I suspect there are a lot of people out there who are paying it more or less accidentally and don't really care very much about the underlying sorting behavior. If, for example, all of your queries are based on equality, and you don't use greater-than or less-than tests, then it doesn't matter what collation is in use. You might as well use "C" instead of whatever your local default may be, because it's faster. On PostgreSQL 9.1+, you can do this by including COLLATE "C" in your CREATE INDEX statement, and on any recent version you can also do it by writing text_pattern_ops after the column name. Such indexes have another benefit, as well: they can be used to optimize LIKE searches against constant patterns that do not begin with a wildcard.
It would be possible to reduce the speed penalty associated with collation-aware comparisons given better operating system support. We use the C function strcoll() to compare strings, but this function requires the strings being compared to be NUL-terminated. PostgreSQL stores strings internally with no trailing NUL-byte, so we're forced to copy the input strings and add a NUL to each one before calling strcoll(). It would be much nicer for PostgreSQL if the operating system exposed a function that allowed a length for each string to be specified in bytes, and guaranteed that only that number of bytes would be read. It appears that, currently, no operating system provides such a function. Also, PostgreSQL needs any two strings that are not byte-for-byte identical to compare as unequal, regardless of the collation in use. Since strcoll() does not guarantee this on all collations on all platforms, we compare the strings using strcmp() whenever strcoll() says they are equal. Fortunately, this branch isn't taken often, but it would be better if we didn't need it at all. An OS-level API allowing us to ask "can this collation ever return equality for two not-byte-for-byte identical strings?" might be useful; or just some kind of iron-clad guarantee that such a thing will never happen. What would help even more is a faster version of strcoll(); the testing I've done seems to indicate that copying the input strings is annoying, but the biggest problem is that strcoll() itself is nowhere near as fast as strcmp(). That may be hard to fix, though - whereas providing a better API should be a simple matter of programming.
There's been some discussion of using the ICU library as an alternative to or in addition to the operating system's collation support. Some (perhaps somewhat out-of-date?) notes on that can be found at http://wiki.postgresql.org/wiki/Todo:ICU. While it appears to address at least the first of the problems mentioned in the previous paragraph, it seems that it doesn't have full support for UTF-8, which PostgreSQL uses heavily. We'd basically be replacing one kind of impedance mismatch with another, which might or might not work out to an improvement, but certainly isn't ideal.