PATCH: CITEXT 2.0

From: David E(dot) Wheeler <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PATCH: CITEXT 2.0
Date: 2008-06-28 01:22:25
Message-ID: 4013F1AE-FE1B-427B-8C23-1A5681DA297E@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Howdy,

[N.B.: I tried to send this a while ago but it didn't get delivered,
I'm assuming because, with the uncompressed patch, the email was too
big for -hackers. So this is a re-send with the patch gzip'd. Sorry
for any duplication].

Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:

* I had originally called it lctext, as it's not a true case-
insensitive type, but just converts strings to lowercase before
comparing them. I changed it to citext at Tom Lane's suggestion, to
ease compatibility for users of the original citext module on pgFoundry.

* Differences from the original citext are:
+ Locale-aware lowercasing of strings, rather than just lowercasing
ASCII characters.
+ No implicit casts from text to citext except via assignment.
+ A few more functions overloaded
+ Works with 8.3 and CVS head

* Many thanks to whoever added str_tolower() to formatting.c. If I had
known about that, I could have saved myself a lot of grief! My
original implementation for 8.3.1 had copied a lot of code from
oracle_compat.c to get things working. With this patch, I've
eliminated a whole lot of code, as I can now just call str_tolower().
So thank you for that! I'll probably keep my original in my personal
Subversion repository, but don't now about releasing it if it will be
accepted as a contrib module for 8.4.

* All comparisons simply convert the strings to be compared to
lowercase using str_tolower(). I've made no other optimizations,
though I'm sure someone with more experience with collations and such
could add them.

* The regression test uses a new module I've created, now on
pgFoundry, called pgtap. It should just work. sql/citext.sql adds
plpgsql to the database and then includes pgtap.sql, which has the
test functions in it.

* I wrote the tests assuming a collation of en_US.UTF-8. I expect it'd
work with most West European languages, and maybe all languages other
than the C locale, but I'm not sure. YMMV. If there's a way to
generalize it and still be able to test the locale awareness, that
would be great. What locales do the build farm servers use?

* In the documentation, I've pitched this type as a replacement for
the use of LOWER() in ad-hoc queries, while also stipulating that this
is not a "true" case-insensitive text type, and is furthermore less
efficient than just TEXT (though I'm sure more efficient than ad-hock
LOWER()s). I've also mentioned a few other caveats, including casts
for TEXT that don't work for citext and non-case-insensitive matching
in replace(), regexp_replace(), and a few others.

* I wrote all the code here myself, but of course used the original
citext implementation (which is case-insensitive only for ASCII
characters) for inspiration and guidance. Thanks to Donald Fraser for
that original implementation.

I've compiled the CVS checkout, run its regressions, then built and
installed the citext module (hence my discovery of the deprecation of
wstring_lower and the addition of str_tolower -- should the
declaration of the former be removed from formatting.c?), and all
tests passed as of an hour ago.

I of course welcome feedback, advice, insults, commiserations, and
just about any mode of comment on this patch. Please let me know if I
need to provide any additional information.

Best,

David

Attachment Content-Type Size
citext.patch.gz application/x-gzip 16.0 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2008-06-28 02:34:26 Re: Vacuuming leaked temp tables (once again)
Previous Message Michael Paesold 2008-06-28 00:32:50 Re: Vacuuming leaked temp tables (once again)