Todo:ICU
From pgwiki
WARNING: This page has been migrated to the PostgreSQL Wiki. Please do not edit this page or your changes may be lost!
Contents |
Collation using ICU
Currently PostgreSQL relies on the underlying operating system to provide collation support. This means that each platform has a slightly different way of doing collation. Some like the BSDs do not support UTF-8 collation. Some like glibc have fairly complete collation support.
ICU is described at http://icu.sourceforge.net/
It's available under what appears to be a BSD-compatible license [[1]]
References
Status
An patch for ICU support has existed for several years in FreeBSD ports [[2]]. It only covers a small part of what is necessary.
Overview of changes required
Basically, the collation support provided by ICU needs to be done either instead of the current system support, or in addition to. Given that client programs will probably be using the system collation, at the very least it should be an option.
ICU supports a very flexible scheme for collation. In addition to just providing standard collations for many languages, it also allows users to customise collations and create their own. This does create the question about how to represent the collation to the server.
The native format of ICU is UTF-16 (not to be confused with UCS-2 which doesn't support all of Unicode). Now, PostgreSQL doesn't support this encoding at all (for various reasons) and we do want to avoid the overhead of converting every string to UTF-16 before comparison.
Fortunatly ICU provides an alternative, iterators. An iterator is configured on a string which returns one character at a time. What will need to happen for non-Unicode encodings is that they will have to convert the characters to their unicode equivalent. PostgreSQL already contains all the necessary tables to do this.
For large scale sorting ICU recommends doing a conversion from the strings to sort-keys which can be compared with just memcmp(). This is the same machanism as in POSIX with strxfrm(). However, currently PostgreSQL has no way to store these sort-keys, but see the related TODO item Rethinking datatypes.
More to come...

