PATCH: CITEXT 2.0

Lists: pgsql-hackers
From: David E(dot) Wheeler <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PATCH: CITEXT 2.0
Date: 2008-06-28 01:22:25
Message-ID: 4013F1AE-FE1B-427B-8C23-1A5681DA297E@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Howdy,

[N.B.: I tried to send this a while ago but it didn't get delivered,
I'm assuming because, with the uncompressed patch, the email was too
big for -hackers. So this is a re-send with the patch gzip'd. Sorry
for any duplication].

Please find attached a patch adding a locale-aware, case-insensitive
text type, called citext, as a contrib module. A few notes:

* I had originally called it lctext, as it's not a true case-
insensitive type, but just converts strings to lowercase before
comparing them. I changed it to citext at Tom Lane's suggestion, to
ease compatibility for users of the original citext module on pgFoundry.

* Differences from the original citext are:
+ Locale-aware lowercasing of strings, rather than just lowercasing
ASCII characters.
+ No implicit casts from text to citext except via assignment.
+ A few more functions overloaded
+ Works with 8.3 and CVS head

* Many thanks to whoever added str_tolower() to formatting.c. If I had
known about that, I could have saved myself a lot of grief! My
original implementation for 8.3.1 had copied a lot of code from
oracle_compat.c to get things working. With this patch, I've
eliminated a whole lot of code, as I can now just call str_tolower().
So thank you for that! I'll probably keep my original in my personal
Subversion repository, but don't now about releasing it if it will be
accepted as a contrib module for 8.4.

* All comparisons simply convert the strings to be compared to
lowercase using str_tolower(). I've made no other optimizations,
though I'm sure someone with more experience with collations and such
could add them.

* The regression test uses a new module I've created, now on
pgFoundry, called pgtap. It should just work. sql/citext.sql adds
plpgsql to the database and then includes pgtap.sql, which has the
test functions in it.

* I wrote the tests assuming a collation of en_US.UTF-8. I expect it'd
work with most West European languages, and maybe all languages other
than the C locale, but I'm not sure. YMMV. If there's a way to
generalize it and still be able to test the locale awareness, that
would be great. What locales do the build farm servers use?

* In the documentation, I've pitched this type as a replacement for
the use of LOWER() in ad-hoc queries, while also stipulating that this
is not a "true" case-insensitive text type, and is furthermore less
efficient than just TEXT (though I'm sure more efficient than ad-hock
LOWER()s). I've also mentioned a few other caveats, including casts
for TEXT that don't work for citext and non-case-insensitive matching
in replace(), regexp_replace(), and a few others.

* I wrote all the code here myself, but of course used the original
citext implementation (which is case-insensitive only for ASCII
characters) for inspiration and guidance. Thanks to Donald Fraser for
that original implementation.

I've compiled the CVS checkout, run its regressions, then built and
installed the citext module (hence my discovery of the deprecation of
wstring_lower and the addition of str_tolower -- should the
declaration of the former be removed from formatting.c?), and all
tests passed as of an hour ago.

I of course welcome feedback, advice, insults, commiserations, and
just about any mode of comment on this patch. Please let me know if I
need to provide any additional information.

Best,

David

Attachment Content-Type Size
citext.patch.gz application/x-gzip 16.0 KB

From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-01 14:26:41
Message-ID: 486A3EA1.5030009@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> Howdy,
>

Howdy,

>
> Please find attached a patch adding a locale-aware, case-insensitive
> text type, called citext, as a contrib module. A few notes:

What is benefit to have this type when collation per database will be
implemented? It seems to me that its overlapped feature? Definition of collation
should allow to setup case sensitivity. Only advantage what I see there is that
you can combine case sensitive and case insensitive text in one database.
However, it will be solved when collation per column will be implemented.

Zdenek


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-01 14:38:32
Message-ID: 3736.1214923112@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> However, it will be solved when collation per column will be implemented.

Well, yeah, but that seems a very long way off. In the meantime a lot
of people use the existing pgfoundry citext module.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-01 16:33:19
Message-ID: E37A99DD-50B6-46EE-8255-AB199D1080E4@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 1, 2008, at 07:38, Tom Lane wrote:

>> However, it will be solved when collation per column will be
>> implemented.
>
> Well, yeah, but that seems a very long way off. In the meantime a lot
> of people use the existing pgfoundry citext module.

And even more of us have written queries using LOWER(col) = LOWER(?),
which is just a PITA.

Best,

David


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-02 12:03:00
Message-ID: 486B6E74.3080309@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> On Jul 1, 2008, at 07:38, Tom Lane wrote:
>
>>> However, it will be solved when collation per column will be
>>> implemented.
>>
>> Well, yeah, but that seems a very long way off. In the meantime a lot
>> of people use the existing pgfoundry citext module.
>
> And even more of us have written queries using LOWER(col) = LOWER(?),
> which is just a PITA.
>

OK. I'm going to review your patch.

Zdenek


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-02 14:10:36
Message-ID: 2F44A283-C91C-4801-8F13-2F273AE4D2DF@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Yay!

Best,

David

Sent from my iPhone

On Jul 2, 2008, at 5:03, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> wrote:

> David E. Wheeler napsal(a):
>> On Jul 1, 2008, at 07:38, Tom Lane wrote:
>>>> However, it will be solved when collation per column will be
>>>> implemented.
>>>
>>> Well, yeah, but that seems a very long way off. In the meantime a
>>> lot
>>> of people use the existing pgfoundry citext module.
>> And even more of us have written queries using LOWER(col) = LOWER
>> (?), which is just a PITA.
>
> OK. I'm going to review your patch.
>
> Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-02 16:13:52
Message-ID: 486BA940.4060101@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> Howdy,

Howdy

> Please find attached a patch adding a locale-aware, case-insensitive
> text type, called citext, as a contrib module. A few notes:

I went through your code and I have following comments/questions:

1) formating

Please, do not type space before asterix:
char * lcstr, * rcstr -> char *lcstr, *rcstr

and do not put extra space in a brackets
citextcmp( left, right ) -> citextcmp(left, right)

Other seems to me OK (remove 8.2 version mention at the bottom)

2) citextcmp function returns int but citext_cmp returns int32. I think
citextcmp should returns int32 as well.

3) citext_larger, citext_smaller function have memory leak. You don't call
PG_FREE_IF_COPY on unused text.

I'm not sure if return value in these function should be duplicated or not.

4) Operator = citext_eq is not correct. See comment
http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac

There must be difference between equality and collation for example in Czech
language 'láska' and 'laská' are different word it means that 'láska' !=
'laská'. But there is no difference in collation order. See Unicode Universal
Collation Algorithm for detail.

5) There are several commented out lines in CREATE OPERATOR statement mostly
related to NEGATOR. Is there some reason for that? Also OPERATOR || has probably
wrong negator.

6) You use pgTAP for unit test. It is something what should be widely discussed.
It is new approach and I'm not correct person to say if it good or not.

I see there two problems:
At first point there is not clear licensing
(https://svn.kineticode.com/pgtap/trunk/README.pgtap).
And, It should be implemented as a general framework by my opinion not only as a
part of citext contrib module.

Please, let me know when you will have updated code and I will start deep testing.

With regards Zdenek


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-02 16:39:35
Message-ID: 200807021839.35632.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Dienstag, 1. Juli 2008 schrieb Tom Lane:
> Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM> writes:
> > However, it will be solved when collation per column will be implemented.
>
> Well, yeah, but that seems a very long way off. In the meantime a lot
> of people use the existing pgfoundry citext module.

Indeed, but why isn't this code put there as well?


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-02 19:18:49
Message-ID: 486BD499.2020104@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I went through your code and I have following comments/questions:

one more comment:

7) Hash opclass is absent. Hash opclass framework is used for hash join.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 00:21:13
Message-ID: 18912.1215044473@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> Am Dienstag, 1. Juli 2008 schrieb Tom Lane:
>> Well, yeah, but that seems a very long way off. In the meantime a lot
>> of people use the existing pgfoundry citext module.

> Indeed, but why isn't this code put there as well?

Well, actually, that might be the best thing to do with it. But it
would be sensible to give it a code review anyway, no?

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 04:38:32
Message-ID: 55AA0795-AA11-4D44-B7F6-90B3A933365F@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 2, 2008, at 09:13, Zdenek Kotala wrote:

> I went through your code and I have following comments/questions:

Thanks. I'm on a short family vacation, so it will probably be Monday
before I can submit a new patch. I got a few replies below, though.

> 1) formating
>
> Please, do not type space before asterix:
> char * lcstr, * rcstr -> char *lcstr, *rcstr
>
> and do not put extra space in a brackets
> citextcmp( left, right ) -> citextcmp(left, right)

Okay.

> Other seems to me OK (remove 8.2 version mention at the bottom)

Um, are you referring to this (at the top):

+// PostgreSQL 8.2 Magic.
+#ifdef PG_MODULE_MAGIC
+PG_MODULE_MAGIC;
+#endif

That's gotta be there (though I can of course remove the comment).

> 2) citextcmp function returns int but citext_cmp returns int32. I
> think citextcmp should returns int32 as well.

Yeah, I'm a bit confused by this. I followed what's in varlena.c on
this. The comparison functions are documented supposed to return 1, 0,
or -1, which of course would be covered by int. But varstr_cmp(),
which ultimately returns the value, returns all kinds of numbers. So,
perhaps someone could say what it's *supposed* to be?

> 3) citext_larger, citext_smaller function have memory leak. You
> don't call PG_FREE_IF_COPY on unused text.
>
> I'm not sure if return value in these function should be duplicated
> or not.

These I also duplicated from varlena.c, and I asked about a potential
memory leak in a previous email. If there's a leak here, would there
not also be a leak in varlena.c?

> 4) Operator = citext_eq is not correct. See comment http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac

So should citextcmp() call strncmp() instead of varst_cmp()? The
latter is what I saw in varlena.c.

> There must be difference between equality and collation for example
> in Czech language 'láska' and 'laská' are different word it means
> that 'láska' != 'laská'. But there is no difference in collation
> order. See Unicode Universal Collation Algorithm for detail.

I'll leave the collation stuff to the functions I call (*far* from my
specialty), but I'll add a test for this and make sure it works as
expected. Um, although, with what collation should it be tested? The
tests I wrote assume en_US.UTF-8.

> 5) There are several commented out lines in CREATE OPERATOR
> statement mostly related to NEGATOR. Is there some reason for that?

I copied it from the original citext.sql. Not sure what effect it has.

> Also OPERATOR || has probably wrong negator.

Right, good catch.

> 6) You use pgTAP for unit test. It is something what should be
> widely discussed. It is new approach and I'm not correct person to
> say if it good or not.

Sure. I created a pgFoundry project for it here:

http://pgtap.projects.postgresql.org/

> I see there two problems:
> At first point there is not clear licensing (https://svn.kineticode.com/pgtap/trunk/README.pgtap
> ).

That's a standard revised BSD license.

> And, It should be implemented as a general framework by my opinion
> not only as a part of citext contrib module.

It is. I just put the file in citext so that the tests can use it.
It's distributed on pgFoundry and maintained at the SVN URL quoted in
the file.

> Please, let me know when you will have updated code and I will start
> deep testing.

Will do.

Best,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 04:39:13
Message-ID: C99EBF12-0015-4689-871C-71358E10CE0D@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 2, 2008, at 09:39, Peter Eisentraut wrote:

>> Well, yeah, but that seems a very long way off. In the meantime a
>> lot
>> of people use the existing pgfoundry citext module.
>
> Indeed, but why isn't this code put there as well?

It could be, but this is *such* a common need (and few even know about
pgFoundry), that I thought it'd be worthwhile to get it distributed as
part of contrib.

Best,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 04:40:12
Message-ID: 10379B93-1A72-4C1C-BF71-1B11EFBFFB52@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 2, 2008, at 12:18, Teodor Sigaev wrote:

>> 7) Hash opclass is absent. Hash opclass framework is used for hash
>> join.

Thanks. I haven't seen docs on this. The original citext only supports
btree, and I don't remember reading about creating a hash opclass in
the Douglass book, though I probably missed it. Anyone got a link for
me to read to make it happen?

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 04:41:22
Message-ID: 3A2F4AB1-588D-4BBF-9CDB-DE5BC33FBB1F@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 2, 2008, at 17:21, Tom Lane wrote:

>> Indeed, but why isn't this code put there as well?
>
> Well, actually, that might be the best thing to do with it. But it
> would be sensible to give it a code review anyway, no?

Obviously, I would argue that contrib is a good place for it, hence
the patch. I can say more on my reasoning if you'd like. But even if
it doesn't end up in contrib, I'm extremely grateful for the code
reviews.

Best,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 05:14:14
Message-ID: 23892.1215062054@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 2, 2008, at 09:13, Zdenek Kotala wrote:
>> Please, do not type space before asterix:
>> char * lcstr, * rcstr -> char *lcstr, *rcstr
>>
>> and do not put extra space in a brackets
>> citextcmp( left, right ) -> citextcmp(left, right)

> Okay.

Note that this sort of stuff will mostly be fixed by pg_indent,
whether or not David does anything about it. But certainly
conforming to the project style to begin with will cause less
pain to reviewers' eyeballs.

> Um, are you referring to this (at the top):

> +// PostgreSQL 8.2 Magic.
> +#ifdef PG_MODULE_MAGIC
> +PG_MODULE_MAGIC;
> +#endif

Here however is an outright bug: we do not allow // comments, because we
still support ANSI-spec compilers that don't recognize them.

> Yeah, I'm a bit confused by this. I followed what's in varlena.c on
> this. The comparison functions are documented supposed to return 1, 0,
> or -1, which of course would be covered by int. But varstr_cmp(),
> which ultimately returns the value, returns all kinds of numbers. So,
> perhaps someone could say what it's *supposed* to be?

btree cmp functions are allowed to return int32 negative, zero, or
positive. There is *not* any requirement that negative or positive
results be exactly -1 or +1. However, if you are comparing values
that are int32 or wider then you can't just return their difference
because it might overflow.

>> 3) citext_larger, citext_smaller function have memory leak. You
>> don't call PG_FREE_IF_COPY on unused text.

> These I also duplicated from varlena.c, and I asked about a potential
> memory leak in a previous email. If there's a leak here, would there
> not also be a leak in varlena.c?

The "leak" is irrelevant for larger/smaller. The only place where it's
actually useful to do PG_FREE_IF_COPY is in a btree or hash index
support function. In other cases you can assume that you're being
called in a memory context that's too short-lived for it to matter.

>> 5) There are several commented out lines in CREATE OPERATOR
>> statement mostly related to NEGATOR. Is there some reason for that?

> I copied it from the original citext.sql. Not sure what effect it has.

http://developer.postgresql.org/pgdocs/postgres/xoper-optimization.html

regards, tom lane


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 07:19:26
Message-ID: 486C7D7E.5080505@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Douglass book, though I probably missed it. Anyone got a link for me to
> read to make it happen?

Hash opclass is 5-times simpler that btree one :)

CREATE FUNCTION citext_hash(mchar)
RETURNS int4
AS 'MODULE_PATHNAME'
LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;

CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE mchar USING hash AS
OPERATOR 1 = (citext, citext),
FUNCTION 1 citext_hash(citext);

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 07:31:06
Message-ID: 486C803A.9070003@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> CREATE FUNCTION citext_hash(*citext*)

> DEFAULT FOR TYPE *citext* USING hash AS

Oops, citext of course.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 16:01:17
Message-ID: CF78D60C-8410-42AE-BB27-30A7A092E620@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 2, 2008, at 22:14, Tom Lane wrote:

> Note that this sort of stuff will mostly be fixed by pg_indent,
> whether or not David does anything about it. But certainly
> conforming to the project style to begin with will cause less
> pain to reviewers' eyeballs.

Yeah, I'll change it. I'm JAPH, so kind of made up the formatting as I
went, though I did try to copy the style in varlena.c.

>> +// PostgreSQL 8.2 Magic.
>> +#ifdef PG_MODULE_MAGIC
>> +PG_MODULE_MAGIC;
>> +#endif
>
> Here however is an outright bug: we do not allow // comments,
> because we
> still support ANSI-spec compilers that don't recognize them.

Forgot about that. I'll change it for the next version.

> btree cmp functions are allowed to return int32 negative, zero, or
> positive. There is *not* any requirement that negative or positive
> results be exactly -1 or +1. However, if you are comparing values
> that are int32 or wider then you can't just return their difference
> because it might overflow.

Thanks for the explanation. I'll make sure that they're both int32.

> The "leak" is irrelevant for larger/smaller. The only place where
> it's
> actually useful to do PG_FREE_IF_COPY is in a btree or hash index
> support function. In other cases you can assume that you're being
> called in a memory context that's too short-lived for it to matter.

So would that be for any function used by

CREATE OPERATOR CLASS citext_ops
DEFAULT FOR TYPE CITEXT USING btree AS
OPERATOR 1 < (citext, citext),
OPERATOR 2 <= (citext, citext),
OPERATOR 3 = (citext, citext),
OPERATOR 4 >= (citext, citext),
OPERATOR 5 > (citext, citext),
FUNCTION 1 citext_cmp(citext, citext);

? (And then the btree operator and function to be added, of course.)

>
>
>>> 5) There are several commented out lines in CREATE OPERATOR
>>> statement mostly related to NEGATOR. Is there some reason for that?
>
>> I copied it from the original citext.sql. Not sure what effect it
>> has.
>
> http://developer.postgresql.org/pgdocs/postgres/xoper-
> optimization.html

Thanks. Sounds like it'd be valuable to have them in there. I'll add
tests, as well.

Best,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 16:03:10
Message-ID: 4822F40A-617E-47CB-AD6E-0DDEAB25E15D@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 3, 2008, at 00:19, Teodor Sigaev wrote:

>> Hash opclass is 5-times simpler that btree one :)
>
> CREATE FUNCTION citext_hash(mchar)
> RETURNS int4
> AS 'MODULE_PATHNAME'
> LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;
>
> CREATE OPERATOR CLASS citext_ops
> DEFAULT FOR TYPE mchar USING hash AS
> OPERATOR 1 = (citext, citext),
> FUNCTION 1 citext_hash(citext);

Thanks. What would citext_hash() look like? I don't see a text_hash()
to borrow from anywhere in src/.

Thanks,

David


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-03 16:53:15
Message-ID: 20080703165315.GB18252@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler wrote:
> On Jul 3, 2008, at 00:19, Teodor Sigaev wrote:
>
>>> Hash opclass is 5-times simpler that btree one :)
>>
>> CREATE FUNCTION citext_hash(mchar)
>> RETURNS int4
>> AS 'MODULE_PATHNAME'
>> LANGUAGE C IMMUTABLE RETURNS NULL ON NULL INPUT;
>>
>> CREATE OPERATOR CLASS citext_ops
>> DEFAULT FOR TYPE mchar USING hash AS
>> OPERATOR 1 = (citext, citext),
>> FUNCTION 1 citext_hash(citext);
>
> Thanks. What would citext_hash() look like? I don't see a text_hash() to
> borrow from anywhere in src/.

See hash_any(). I assume the difficulty is making sure that
hash("FOO") = hash("foo") ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-05 05:39:54
Message-ID: C563AF78-6F19-4791-A4FB-44D8C047B16F@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 3, 2008, at 09:53, Alvaro Herrera wrote:

>> Thanks. What would citext_hash() look like? I don't see a
>> text_hash() to
>> borrow from anywhere in src/.
>
> See hash_any(). I assume the difficulty is making sure that
> hash("FOO") = hash("foo") ...

Great, big help, thank you. So does this look sensible?

Datum
citext_hash(PG_FUNCTION_ARGS)
{
char *txt;
char *str;
Datum result;

txt = cilower( PG_GETARG_TEXT_PP(0) );
str = VARDATA_ANY(txt);

result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt));

/* Avoid leaking memory for toasted inputs */
PG_FREE_IF_COPY(txt, 0);
pfree( str );

return result;
}

And how might I be able to test that it actually works?

Best,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-05 05:54:45
Message-ID: 3816229E-1D37-48A6-8584-280139AA9171@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 2, 2008, at 22:14, Tom Lane wrote:

> The "leak" is irrelevant for larger/smaller. The only place where
> it's
> actually useful to do PG_FREE_IF_COPY is in a btree or hash index
> support function. In other cases you can assume that you're being
> called in a memory context that's too short-lived for it to matter.

Stupid question: for the btree index support function, is that *only*
the function referenced in the OPERATOR CLASS, or does it also apply
to functions that implement the operators in that class? IOW, do I
need to worry about memory leaks in citext_eq, citext_ne, citext_gt,
etc., or only in citext_cmp()?

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-05 06:06:36
Message-ID: DFA4BCE4-233A-435E-BA96-70D7AD7FB330@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Replying to myself, but I've made some local changes (see other
messages) and just wanted to follow up on some of my own comments.

On Jul 2, 2008, at 21:38, David E. Wheeler wrote:

>> 4) Operator = citext_eq is not correct. See comment http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac
>
> So should citextcmp() call strncmp() instead of varst_cmp()? The
> latter is what I saw in varlena.c.

I'm guessing that the answer is "no," since varstr_cmp() uses
strncmp() internally, as appropriate to the locale. Correct?

>> There must be difference between equality and collation for example
>> in Czech language 'láska' and 'laská' are different word it means
>> that 'láska' != 'laská'. But there is no difference in collation
>> order. See Unicode Universal Collation Algorithm for detail.
>
> I'll leave the collation stuff to the functions I call (*far* from
> my specialty), but I'll add a test for this and make sure it works
> as expected. Um, although, with what collation should it be tested?
> The tests I wrote assume en_US.UTF-8.

I added this test and is passes:

SELECT isnt( 'láska'::citext, 'laská'::citext, 'Diffrent accented
characters should not be equivalent' );

>> 5) There are several commented out lines in CREATE OPERATOR
>> statement mostly related to NEGATOR. Is there some reason for that?
>
> I copied it from the original citext.sql. Not sure what effect it has.

I restored these (and one of them was wrong anyway).

>> Also OPERATOR || has probably wrong negator.
>
> Right, good catch.

Stupid question: What would the negation of || actually be? There
isn't one is, there?

Thanks!

David


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-05 09:58:04
Message-ID: 87mykwwt3n.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:

> On Jul 3, 2008, at 09:53, Alvaro Herrera wrote:
>
>>> Thanks. What would citext_hash() look like? I don't see a text_hash() to
>>> borrow from anywhere in src/.
>>
>> See hash_any(). I assume the difficulty is making sure that
>> hash("FOO") = hash("foo") ...
>
> Great, big help, thank you. So does this look sensible?
>
> txt = cilower( PG_GETARG_TEXT_PP(0) );
> str = VARDATA_ANY(txt);
>
> result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt));

I thought your data type implemented a locale dependent collation, not just
a case insensitive collation. That is, does this hash agree with your
citext_eq on strings like "foo bar" <=> "foobar" and "fooß" <=> "fooss" ?

You may have to use strxfrm

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-05 09:58:39
Message-ID: 87iqvkwt2o.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:

> do I need to worry about memory leaks in citext_eq, citext_ne, citext_gt,
> etc.,

yes

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-05 15:13:20
Message-ID: 9666.1215270800@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
>>> Also OPERATOR || has probably wrong negator.
>>
>> Right, good catch.

> Stupid question: What would the negation of || actually be? There
> isn't one is, there?

Per the docs, NEGATOR is only sensible for operators returning boolean.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Teodor Sigaev" <teodor(at)sigaev(dot)ru>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-06 00:46:39
Message-ID: DD2B2B80-66FD-46B7-9D5B-0AF94C264E55@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 5, 2008, at 02:58, Gregory Stark wrote:

>> txt = cilower( PG_GETARG_TEXT_PP(0) );
>> str = VARDATA_ANY(txt);
>>
>> result = hash_any((unsigned char *) str, VARSIZE_ANY_EXHDR(txt));
>
> I thought your data type implemented a locale dependent collation,
> not just
> a case insensitive collation. That is, does this hash agree with your
> citext_eq on strings like "foo bar" <=> "foobar" and "fooß" <=>
> "fooss" ?

CITEXT is basically intended to replace all those queries that do
`WHERE LOWER(col) = LOWER(?)` by doing it internally. That's it. It's
locale-aware to the same extent that `LOWER()` is (and that citext 1.0
is not, since it only compares ASCII characters case-insensitively).
And I expect that it does, in fact, agree with your examples, in that
all the current tests for = and <> pass:

try=# select 'foo bar' = 'foobar';
?column?
----------
f

try=# SELECT 'fooß' = 'fooss';
?column?
----------
f

> You may have to use strxfrm

In the patch against CVS HEAD, it uses str_tolower() in formatting.c.

Best,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-06 00:46:52
Message-ID: C10A2A69-B77E-4B82-B9D7-338330045E70@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 5, 2008, at 02:58, Gregory Stark wrote:

>> do I need to worry about memory leaks in citext_eq, citext_ne,
>> citext_gt,
>> etc.,
>
> yes

Thanks.

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-06 00:47:11
Message-ID: D9C7E6BA-EE2B-42C5-B219-2604E1FE59EA@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 5, 2008, at 08:13, Tom Lane wrote:

>> Stupid question: What would the negation of || actually be? There
>> isn't one is, there?
>
> Per the docs, NEGATOR is only sensible for operators returning
> boolean.

Message received. Many thanks, Tom, as usual.

Best,

David


From: David E(dot) Wheeler <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: PATCH: CITEXT 2.0 v2
Date: 2008-07-06 06:31:20
Message-ID: ACEC459B-CB5B-4B71-87FA-55E6A649C17C@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 27, 2008, at 18:22, David E. Wheeler wrote:

> Please find attached a patch adding a locale-aware, case-insensitive
> text type, called citext, as a contrib module.

Here is a new version of the patch, with the following changes:

* Fixed formatting to be more like core.
* Added appropriate NEGATORs to operators.
* Removed NEGATOR from the || operator.
* Added hash index function and operator class.
* The = operator now supports HASHES and MERGES.
* citext_cmp and citextcmp both return int32.
* Changed // comments to /* comments */.
* Added test confirming láska'::citext <> 'laská'::citext.
* A few other organizational, formatting, and pasto fixes.
* Updated the FAQ entry on case-insensitive queries to recommend
citext (it would, of course, need to be translated).

Stuff I was asked about but didn't change:

* citext_cmp() still uses varstr_cmp() instead of strncmp(). When I
tried the latter, everything seemed to be equivalent.
* citext_smaller() and citext_larger() don't have memory leaks, says
Tom, so I added no calls to PG_FREE_IF_COPY().

Thank you everyone for your feedback and suggestions!

Best,

David

Attachment Content-Type Size
citext2.patch.gz application/x-gzip 16.6 KB

From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 07:46:10
Message-ID: 4871C9C2.8040307@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> Replying to myself, but I've made some local changes (see other
> messages) and just wanted to follow up on some of my own comments.
>
> On Jul 2, 2008, at 21:38, David E. Wheeler wrote:
>
>>> 4) Operator = citext_eq is not correct. See comment
>>> http://doxygen.postgresql.org/varlena_8c.html#8621d064d14f259c594e4df3c1a64cac
>>>
>>
>> So should citextcmp() call strncmp() instead of varst_cmp()? The
>> latter is what I saw in varlena.c.
>
> I'm guessing that the answer is "no," since varstr_cmp() uses strncmp()
> internally, as appropriate to the locale. Correct?

You have to use varstr_cmp in citextcmp. Your code is correct, because for
< <= >= > operators you need collation sensible function.

You need to change only citext_cmp function to use strncmp() or call texteq
function.

>>> There must be difference between equality and collation for example
>>> in Czech language 'láska' and 'laská' are different word it means
>>> that 'láska' != 'laská'. But there is no difference in collation
>>> order. See Unicode Universal Collation Algorithm for detail.
>>
>> I'll leave the collation stuff to the functions I call (*far* from my
>> specialty), but I'll add a test for this and make sure it works as
>> expected. Um, although, with what collation should it be tested? The
>> tests I wrote assume en_US.UTF-8.
>
> I added this test and is passes:
>
> SELECT isnt( 'láska'::citext, 'laská'::citext, 'Diffrent accented
> characters should not be equivalent' );

I'm think that this test will work correctly for en_US.UTF-8 at any time. I
guess the test make sense only when Czech collation (cs_CZ.UTF-8) is selected,
but unfortunately, you cannot change collation during your test :(.

I think, Best solution for now is to keep the test and add comment about
recommended collation for this test.

Zdenek


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 14:41:59
Message-ID: 48722B37.4020901@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> On Jun 27, 2008, at 18:22, David E. Wheeler wrote:
>
>> Please find attached a patch adding a locale-aware, case-insensitive
>> text type, called citext, as a contrib module.
>
> Here is a new version of the patch, with the following changes:
>
> * Fixed formatting to be more like core.
> * Added appropriate NEGATORs to operators.
> * Removed NEGATOR from the || operator.
> * Added hash index function and operator class.
> * The = operator now supports HASHES and MERGES.
> * citext_cmp and citextcmp both return int32.
> * Changed // comments to /* comments */.
> * Added test confirming láska'::citext <> 'laská'::citext.
> * A few other organizational, formatting, and pasto fixes.
> * Updated the FAQ entry on case-insensitive queries to recommend citext
> (it would, of course, need to be translated).
>
> Stuff I was asked about but didn't change:
>
> * citext_cmp() still uses varstr_cmp() instead of strncmp(). When I
> tried the latter, everything seemed to be equivalent.

citext_eq (operator =) should use strncmp and citext_cmp() should use varstr_cmp().

> * citext_smaller() and citext_larger() don't have memory leaks, says
> Tom, so I added no calls to PG_FREE_IF_COPY().

Yeah, it is correct. FGMR does clean job for you.

However, It seems to me that code is ok now (exclude citex_eq). I think there
two open problems/questions:

1) regression test -

a) I think that regresion is not correct. It depends on en_US locale, but it
uses characters which is not in related character repertoire. It means comparing
is not defined and I guess it could generate different result on different OS -
depends on locale implementation.

b) pgTap is something new. Need make a decision if this framework is
acceptable or not.

2) contrib vs. pgFoundry

There is unresolved answer if we want to have this in contrib or not. Good to
mention that citext type will be obsoleted with full collation implementation in
a future. I personally prefer to keep it on pgFoundry because it is temporally
workaround (by my opinion), but I can live with contrib module as well.

Zdenek


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 15:01:35
Message-ID: 48722FCF.3040208@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Zdenek Kotala wrote:
>
>
> 2) contrib vs. pgFoundry
>
> There is unresolved answer if we want to have this in contrib or not.
> Good to mention that citext type will be obsoleted with full collation
> implementation in a future. I personally prefer to keep it on
> pgFoundry because it is temporally workaround (by my opinion), but I
> can live with contrib module as well.
>
>
>

Is there a concrete plan to get to full collation support (i.e. per
column) in any known time frame? If not, then I think a citext module
would be acceptable.

What does still bother me is its performance. I'd like to know if any
measurement has been done of using citext vs. a functional index on
lower(foo).

cheers

andrew


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 17:09:44
Message-ID: 2F472DC1-CB44-476C-ACFF-C2807C21751F@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 00:46, Zdenek Kotala wrote:

> You have to use varstr_cmp in citextcmp. Your code is correct,
> because for
> < <= >= > operators you need collation sensible function.
>
> You need to change only citext_cmp function to use strncmp() or call
> texteq function.

I'm confused. What's the difference between strncmp() and
varstr_cmp()? And why would I use one in one place and the other
elsewhere? Shouldn't they be used consistently?

> I'm think that this test will work correctly for en_US.UTF-8 at any
> time. I guess the test make sense only when Czech collation
> (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change
> collation during your test :(.

No, I was wondering before what locale was used for initdb on the
build farm. I mean, how are locale-aware things really tested?

> I think, Best solution for now is to keep the test and add comment
> about recommended collation for this test.

Yep, that's what it does.

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 17:20:03
Message-ID: 2FB38675-96B8-4AF6-88F0-9EE271016A38@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:

> However, It seems to me that code is ok now (exclude citex_eq). I
> think there two open problems/questions:
>
> 1) regression test -
>
> a) I think that regresion is not correct. It depends on en_US
> locale, but it uses characters which is not in related character
> repertoire. It means comparing is not defined and I guess it could
> generate different result on different OS - depends on locale
> implementation.

That I don't know about. The test requires en_US.UTF-8, at least at
this point. How are tests run on the build farm? And how else could I
ensure that comparisons are case-insensitive for non-ASCII characters
other than requiring a Unicode locale? Or is it just an issue for the
sort order tests? For those, I could potentially remove accented
characters, just as long as I'm verifying in other tests that
comparisons are case-insensitive (without worrying about collation).

> b) pgTap is something new. Need make a decision if this framework
> is acceptable or not.

Well, from the point of view of `make installcheck`, it's invisible.
I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to
discuss it further here though, if folks are interested.

> 2) contrib vs. pgFoundry
>
> There is unresolved answer if we want to have this in contrib or
> not. Good to mention that citext type will be obsoleted with full
> collation implementation in a future. I personally prefer to keep it
> on pgFoundry because it is temporally workaround (by my opinion),
> but I can live with contrib module as well.

I second what Andrew has said in reply to this issue. I'll also say
that, since people *so* often end up using `WHERE LOWER(col) =
LOWER(?)`, that it'd be very valuable to have citext in contrib,
especially since so few folks seem to even know about pgFoundry, let
alone be able to find it. I mean, look at these search results:

http://www.google.com/search?q=PostgreSQL%20case-insensitive%20text

My blog entry about this patch is hit #3. pgFoundry (and CITEXT 1) is
#7. Last time I did a query like this, it didn't turn up at all.

Belive me, I'd love for pgFoundry (or something like it) to become the
CPAN for PostgreSQL. But without some love and SEO, I don't think
that's gonna happen.

Besides, CITEXT 2 would be a PITA to maintain for both 8.3 and 8.4,
given the changes in the string comparison API in CVS.

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 17:20:57
Message-ID: AA140A00-93A7-4071-AA49-FD7E47ECEFD3@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

> What does still bother me is its performance. I'd like to know if
> any measurement has been done of using citext vs. a functional index
> on lower(foo).

That's a good question. I can whip up a quick test by populating a
column full of data and trying both, but I'm no performance testing
expert. Anyone else know more about such performance testing who can
help out?

Many Thanks,

David


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 18:44:32
Message-ID: 20080707184432.GB4681@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler wrote:
> On Jul 7, 2008, at 00:46, Zdenek Kotala wrote:
>
>> You have to use varstr_cmp in citextcmp. Your code is correct,
>> because for
>> < <= >= > operators you need collation sensible function.
>>
>> You need to change only citext_cmp function to use strncmp() or call
>> texteq function.
>
> I'm confused. What's the difference between strncmp() and varstr_cmp()?
> And why would I use one in one place and the other elsewhere? Shouldn't
> they be used consistently?

Not necessarily -- see texteq. I think the point is that varstr_cmp()
is a lot slower.

>> I'm think that this test will work correctly for en_US.UTF-8 at any
>> time. I guess the test make sense only when Czech collation
>> (cs_CZ.UTF-8) is selected, but unfortunately, you cannot change
>> collation during your test :(.
>
> No, I was wondering before what locale was used for initdb on the build
> farm. I mean, how are locale-aware things really tested?

They aren't AFAIK.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 18:47:23
Message-ID: F5ECEEC2-6AFE-499B-8DBE-65166E2E92EB@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 11:44, Alvaro Herrera wrote:

>> I'm confused. What's the difference between strncmp() and
>> varstr_cmp()?
>> And why would I use one in one place and the other elsewhere?
>> Shouldn't
>> they be used consistently?
>
> Not necessarily -- see texteq. I think the point is that varstr_cmp()
> is a lot slower.

Then why shouldn't I use strncmp() for all comparisons?

>> No, I was wondering before what locale was used for initdb on the
>> build
>> farm. I mean, how are locale-aware things really tested?
>
> They aren't AFAIK.

Ow. Pity.

Best,

David


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 18:54:02
Message-ID: 20080707185402.GC4681@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler wrote:
> On Jul 7, 2008, at 11:44, Alvaro Herrera wrote:
>
>>> I'm confused. What's the difference between strncmp() and
>>> varstr_cmp()?
>>> And why would I use one in one place and the other elsewhere?
>>> Shouldn't
>>> they be used consistently?
>>
>> Not necessarily -- see texteq. I think the point is that varstr_cmp()
>> is a lot slower.
>
> Then why shouldn't I use strncmp() for all comparisons?

I have no idea :-) -- because it's not locale-aware perhaps.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 18:54:53
Message-ID: 4872667D.4030502@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:
>
>> However, It seems to me that code is ok now (exclude citex_eq). I
>> think there two open problems/questions:
>>
>> 1) regression test -
>>
>> a) I think that regresion is not correct. It depends on en_US locale,
>> but it uses characters which is not in related character repertoire.
>> It means comparing is not defined and I guess it could generate
>> different result on different OS - depends on locale implementation.
>
> That I don't know about. The test requires en_US.UTF-8, at least at this
> point. How are tests run on the build farm? And how else could I ensure
> that comparisons are case-insensitive for non-ASCII characters other
> than requiring a Unicode locale? Or is it just an issue for the sort
> order tests? For those, I could potentially remove accented characters,
> just as long as I'm verifying in other tests that comparisons are
> case-insensitive (without worrying about collation).

Hmm, it is complex area and I'm not sure if anybody on planet know correct
answer :-). I think that best approach is to keep it as is and when a problem
occur then it will be fixed.

>> b) pgTap is something new. Need make a decision if this framework is
>> acceptable or not.
>
> Well, from the point of view of `make installcheck`, it's invisible.
> I've submitted a talk proposal for pgDay.US on ptTAP. I'm happy to
> discuss it further here though, if folks are interested.

Yeah, it is invisible, but question is why don't put it as a framework to common
place. But it is for another discussion.

>> 2) contrib vs. pgFoundry
>>
>> There is unresolved answer if we want to have this in contrib or not.
>> Good to mention that citext type will be obsoleted with full collation
>> implementation in a future. I personally prefer to keep it on
>> pgFoundry because it is temporally workaround (by my opinion), but I
>> can live with contrib module as well.
>
> I second what Andrew has said in reply to this issue. I'll also say
> that, since people *so* often end up using `WHERE LOWER(col) =
> LOWER(?)`, that it'd be very valuable to have citext in contrib,
> especially since so few folks seem to even know about pgFoundry, let
> alone be able to find it. I mean, look at these search results:

I understand it but there is parallel project which should solve this problem
completely I guess in "close" future (2-3years). Afterward this module will be
obsolete and it will takes times to remove it from contrib. It seems to me that
have citext in contrib only for two releases is not optimal solution.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 18:57:38
Message-ID: 48726722.9000301@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andrew Dunstan napsal(a):
>
>
> Zdenek Kotala wrote:
>>
>>
>> 2) contrib vs. pgFoundry
>>
>> There is unresolved answer if we want to have this in contrib or not.
>> Good to mention that citext type will be obsoleted with full collation
>> implementation in a future. I personally prefer to keep it on
>> pgFoundry because it is temporally workaround (by my opinion), but I
>> can live with contrib module as well.
>>
>>
>>
>
> Is there a concrete plan to get to full collation support (i.e. per
> column) in any known time frame? If not, then I think a citext module
> would be acceptable.

Collation per database should be part of 8.4. Radek Strnad works on it as a SoC
project. He plans to continue on this work and implement full support as his
Master of Thesis in 2-3 years time frame.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:03:15
Message-ID: FB5EDA85-66DA-4028-8401-18F35DFFD981@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:

>> Then why shouldn't I use strncmp() for all comparisons?
>
> I have no idea :-) -- because it's not locale-aware perhaps.

Could someone who does have an idea answer these questions:

* Does it need to be locale-aware or not?
* Should I use strncmp() or varstr_cmp() to compare strings?
* Shouldn't it use one or the other, but not both?

Sorry, I'm just confused about the "correct" thing to do here. If
someone who knows the definitive answers could weigh in, I'd be happy
to make the adjustment.

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 19:06:08
Message-ID: FEC06BA2-A40D-4AA4-B087-F2E767BF9A37@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 11:54, Zdenek Kotala wrote:

> Hmm, it is complex area and I'm not sure if anybody on planet know
> correct answer :-). I think that best approach is to keep it as is
> and when a problem occur then it will be fixed.

Regression tests are really important, though.

>>> b) pgTap is something new. Need make a decision if this framework
>>> is acceptable or not.
>> Well, from the point of view of `make installcheck`, it's
>> invisible. I've submitted a talk proposal for pgDay.US on ptTAP.
>> I'm happy to discuss it further here though, if folks are interested.
>
> Yeah, it is invisible, but question is why don't put it as a
> framework to common place. But it is for another discussion.

It is in a common place as a project, on pgFoundry. Whether the core
team wants to use it for testing other parts of PostgreSQL (core or
contrib) and therefore put it in a central location is, as you say, a
separate conversation. It'd be easy to move it in such a case, of
course.

> I understand it but there is parallel project which should solve
> this problem completely I guess in "close" future (2-3years).
> Afterward this module will be obsolete and it will takes times to
> remove it from contrib. It seems to me that have citext in contrib
> only for two releases is not optimal solution.

I guess that'd be the reason to keep it on pgFoundry, but I have two
comments:

* 2-3 years is a *long* time in Internet time.

* There is on guarantee that it will be finished in that time or,
indeed ever (no disrespect to Radek Strnad, it's just there are always
unforeseen issues).

Thanks,

David


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 19:10:45
Message-ID: 162867790807071210r4435f7f4iad5478eaf32d751e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2008/7/7 Zdenek Kotala <Zdenek(dot)Kotala(at)sun(dot)com>:
> David E. Wheeler napsal(a):
>>
>> On Jul 7, 2008, at 07:41, Zdenek Kotala wrote:
>>
>>> However, It seems to me that code is ok now (exclude citex_eq). I think
>>> there two open problems/questions:
>>>
>>> 1) regression test -
>>>
>>> a) I think that regresion is not correct. It depends on en_US locale,
>>> but it uses characters which is not in related character repertoire. It
>>> means comparing is not defined and I guess it could generate different
>>> result on different OS - depends on locale implementation.
>>
>> That I don't know about. The test requires en_US.UTF-8, at least at this
>> point. How are tests run on the build farm? And how else could I ensure that
>> comparisons are case-insensitive for non-ASCII characters other than
>> requiring a Unicode locale? Or is it just an issue for the sort order tests?
>> For those, I could potentially remove accented characters, just as long as
>> I'm verifying in other tests that comparisons are case-insensitive (without
>> worrying about collation).
>
> Hmm, it is complex area and I'm not sure if anybody on planet know correct
> answer :-). I think that best approach is to keep it as is and when a
> problem occur then it will be fixed.
>

Maybe we can have some "locale" test outside our regress tests -

>>> b) pgTap is something new. Need make a decision if this framework is
>>> acceptable or not.
>>
>> Well, from the point of view of `make installcheck`, it's invisible. I've
>> submitted a talk proposal for pgDay.US on ptTAP. I'm happy to discuss it
>> further here though, if folks are interested.
>
> Yeah, it is invisible, but question is why don't put it as a framework to
> common place. But it is for another discussion.
>
>>> 2) contrib vs. pgFoundry
>>>
>>> There is unresolved answer if we want to have this in contrib or not.
>>> Good to mention that citext type will be obsoleted with full collation
>>> implementation in a future. I personally prefer to keep it on pgFoundry
>>> because it is temporally workaround (by my opinion), but I can live with
>>> contrib module as well.
>>
>> I second what Andrew has said in reply to this issue. I'll also say that,
>> since people *so* often end up using `WHERE LOWER(col) = LOWER(?)`, that
>> it'd be very valuable to have citext in contrib, especially since so few
>> folks seem to even know about pgFoundry, let alone be able to find it. I
>> mean, look at these search results:
>
> I understand it but there is parallel project which should solve this
> problem completely I guess in "close" future (2-3years). Afterward this
> module will be obsolete and it will takes times to remove it from contrib.
> It seems to me that have citext in contrib only for two releases is not
> optimal solution.
>
> Zdenek
>
>
Regards
Pavel

>
> --
> Zdenek Kotala Sun Microsystems
> Prague, Czech Republic http://sun.com/postgresql
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:13:38
Message-ID: 48726AE2.3070201@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:
>
>>> Then why shouldn't I use strncmp() for all comparisons?
>>
>> I have no idea :-) -- because it's not locale-aware perhaps.
>
> Could someone who does have an idea answer these questions:
>
> * Does it need to be locale-aware or not?
> * Should I use strncmp() or varstr_cmp() to compare strings?
> * Shouldn't it use one or the other, but not both?
>
> Sorry, I'm just confused about the "correct" thing to do here. If
> someone who knows the definitive answers could weigh in, I'd be happy to
> make the adjustment.

I'm sorry. I meant bttext()
http://doxygen.postgresql.org/varlena_8c-source.html#l01270

bttext should use in citextcmp function. It uses strcol function.

And citext_eq should be implemented as texteq:

http://doxygen.postgresql.org/varlena_8c-source.html#l01164

I'm sorry for confusion I'm exchange bttext and varstr_cmp. :(

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:21:39
Message-ID: 2679EDDF-7BAB-465B-866A-7B94C6975306@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 12:13, Zdenek Kotala wrote:

> I'm sorry. I meant bttext() http://doxygen.postgresql.org/varlena_8c-source.html#l01270
>
> bttext should use in citextcmp function. It uses strcol function.

I've no idea what bttext has to do with anything. Sorry if I'm being
slow here.

> And citext_eq should be implemented as texteq:
>
> http://doxygen.postgresql.org/varlena_8c-source.html#l01164
>
> I'm sorry for confusion I'm exchange bttext and varstr_cmp. :(

Okay, I see that text_cmp() uses varstr_cmp():

http://doxygen.postgresql.org/varlena_8c-source.html#l01139

While texteq() and textne() use strncmp():

http://doxygen.postgresql.org/varlena_8c-source.html#l01164
http://doxygen.postgresql.org/varlena_8c-source.html#l01187

The other operator functions, like text_lt(), use text_cmp() and
therefore varstr_cmp():

http://doxygen.postgresql.org/varlena_8c-source.html#01210

My question is: why? Shouldn't they all use the same function for
comparison? I'm happy to dupe this implementation for citext, but I
don't understand it. Should not all comparisons be executed
consistently?

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 19:23:10
Message-ID: 9C3AF371-8F91-4AD8-8531-3B5B1016A2F8@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 12:10, Pavel Stehule wrote:

> Maybe we can have some "locale" test outside our regress tests -

I think that would be useful.

Best,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:26:05
Message-ID: D0094897-A812-416E-8E8F-3193ACC55BFC@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 12:21, David E. Wheeler wrote:

> My question is: why? Shouldn't they all use the same function for
> comparison? I'm happy to dupe this implementation for citext, but I
> don't understand it. Should not all comparisons be executed
> consistently?

Let me try to answer my own question by citing this comment:

/*
* Since we only care about equality or not-equality, we can avoid
all the
* expense of strcoll() here, and just do bitwise comparison.
*/

So, the upshot is that the = and <> operators are not locale-aware,
yes? They just do byte comparisons. Is that really the way it should
be? I mean, could there not be strings that are equivalent but have
different bytes?

Thanks,

David


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:36:09
Message-ID: 162867790807071236j7a69a345nff997e9acc9f66b7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2008/7/7 David E. Wheeler <david(at)kineticode(dot)com>:
> On Jul 7, 2008, at 11:54, Alvaro Herrera wrote:
>
>>> Then why shouldn't I use strncmp() for all comparisons?
>>
>> I have no idea :-) -- because it's not locale-aware perhaps.
>
> Could someone who does have an idea answer these questions:
>
> * Does it need to be locale-aware or not?

yes!

> * Should I use strncmp() or varstr_cmp() to compare strings?
> * Shouldn't it use one or the other, but not both?
>
> Sorry, I'm just confused about the "correct" thing to do here. If someone
> who knows the definitive answers could weigh in, I'd be happy to make the
> adjustment.
>
> Thanks,
>
> David
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:38:07
Message-ID: 354DF721-0E52-4B8B-8CB6-C16E5B99D6EE@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 12:36, Pavel Stehule wrote:

>> * Does it need to be locale-aware or not?
>
> yes!

texteq() in varlena.c does not seem to be.

Best,

David


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:46:55
Message-ID: 487272AF.5070002@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler napsal(a):
> On Jul 7, 2008, at 12:21, David E. Wheeler wrote:
>
>> My question is: why? Shouldn't they all use the same function for
>> comparison? I'm happy to dupe this implementation for citext, but I
>> don't understand it. Should not all comparisons be executed consistently?
>
> Let me try to answer my own question by citing this comment:
>
> /*
> * Since we only care about equality or not-equality, we can avoid
> all the
> * expense of strcoll() here, and just do bitwise comparison.
> */
>
> So, the upshot is that the = and <> operators are not locale-aware, yes?
> They just do byte comparisons. Is that really the way it should be? I
> mean, could there not be strings that are equivalent but have different
> bytes?

Correct. The problem is complex. It works fine only for normalized string. But
postgres now assume that all utf8 strings are normalized.

If you need to implement < <= >= > operators you need to use strcol which take
care of locale collation.

See unicode collation algorithm http://www.unicode.org/reports/tr10/

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)sun(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 19:48:50
Message-ID: 162867790807071248x18a98688xfdf082806fd5cda5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2008/7/7 David E. Wheeler <david(at)kineticode(dot)com>:
> On Jul 7, 2008, at 12:36, Pavel Stehule wrote:
>
>>> * Does it need to be locale-aware or not?
>>
>> yes!
>
> texteq() in varlena.c does not seem to be.
>

it's case sensitive and it's +/- binary compare

Regards
Pavel Stehule

> Best,
>
> David
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 20:10:14
Message-ID: 22075.1215461414@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> So, the upshot is that the = and <> operators are not locale-aware,
> yes? They just do byte comparisons. Is that really the way it should
> be? I mean, could there not be strings that are equivalent but have
> different bytes?

We intentionally force such strings to be considered non-equal.
See varstr_cmp, and if you like see the archives from back when
that was put in.

The = and <> operators are in fact consistent with the behavior of
varstr_cmp (and had better be!); they're just optimized a bit.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 20:13:15
Message-ID: 5170D103-1110-4302-895C-5669AEA4D236@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 13:10, Tom Lane wrote:

> We intentionally force such strings to be considered non-equal.
> See varstr_cmp, and if you like see the archives from back when
> that was put in.
>
> The = and <> operators are in fact consistent with the behavior of
> varstr_cmp (and had better be!); they're just optimized a bit.

Thank you, Tom. I'll post my updated patch shortly.

In the meantime, can anyone suggest an easy way to populate a table
full of random strings so that I can do a bit of benchmarking?

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc: pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 20:15:03
Message-ID: 8E2D49F2-E366-4504-9428-0AB6F35468FA@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 12:46, Zdenek Kotala wrote:

>> So, the upshot is that the = and <> operators are not locale-aware,
>> yes? They just do byte comparisons. Is that really the way it
>> should be? I mean, could there not be strings that are equivalent
>> but have different bytes?
>
> Correct. The problem is complex. It works fine only for normalized
> string. But postgres now assume that all utf8 strings are normalized.

I see. So binary equivalence is okay, in that case.

> If you need to implement < <= >= > operators you need to use strcol
> which take care of locale collation.

Which varstr_cmp() does, I guess. It's what textlt uses, for example.

> See unicode collation algorithm http://www.unicode.org/reports/tr10/

Wow, that looks like a fun read.

Best,

David


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 20:59:16
Message-ID: 87mykt9zrv.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:

> On Jul 7, 2008, at 12:21, David E. Wheeler wrote:
>
>> My question is: why? Shouldn't they all use the same function for
>> comparison? I'm happy to dupe this implementation for citext, but I don't
>> understand it. Should not all comparisons be executed consistently?
>
> Let me try to answer my own question by citing this comment:
>
> /*
> * Since we only care about equality or not-equality, we can avoid all
> the
> * expense of strcoll() here, and just do bitwise comparison.
> */
>
> So, the upshot is that the = and <> operators are not locale-aware, yes? They
> just do byte comparisons. Is that really the way it should be? I mean, could
> there not be strings that are equivalent but have different bytes?

There could be strings that strcoll returns 0 for even though they're not
identical. However that caused problems in Postgres so we decided that only
equal strings should actually compare equal. So if strcoll returns 0 then we
do a bytewise comparison to impose an arbitrary ordering.

Of course the obvious case of two equivalent strings with different bytes
would be two strings which differ only in case in a collation which doesn't
distinguish based on case. So you obviously can't take this route for citext.

I don't think you have to worry about the problem that cause Postgres to make
this change. IIRC it was someone comparing strings like paths and usernames
and getting false positives because they were in a Turkish locale which found
certain sequences of characters to be insignificant for ordering. Someone
who's using a citext data type has obviously decided that's precisely the kind
of behaviour they want.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's RemoteDBA services!


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 21:05:24
Message-ID: BAC3302A-D88F-49C9-9E7D-596C347D9649@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 13:59, Gregory Stark wrote:

> Of course the obvious case of two equivalent strings with different
> bytes
> would be two strings which differ only in case in a collation which
> doesn't
> distinguish based on case. So you obviously can't take this route
> for citext.

Well, to be fair, citext isn't imposing a collation. It's just calling
str_tolower() on strings before passing them on to varstr_cmp() or
strncmp() to compare.

> I don't think you have to worry about the problem that cause
> Postgres to make
> this change. IIRC it was someone comparing strings like paths and
> usernames
> and getting false positives because they were in a Turkish locale
> which found
> certain sequences of characters to be insignificant for ordering.
> Someone
> who's using a citext data type has obviously decided that's
> precisely the kind
> of behaviour they want.

Hrm. So in your opinion, strncmp() could be used for all comparisons
by citext, rather than varstr_cmp()?

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 23:24:27
Message-ID: 551B62DD-77F5-4CA3-9C6E-14E38A64EF26@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 08:01, Andrew Dunstan wrote:

> What does still bother me is its performance. I'd like to know if
> any measurement has been done of using citext vs. a functional index
> on lower(foo).

Okay, here's a start. The attached script inserts random strings of
1-10 space-delimited words into text and citext columns, and then
compares the performance of queries with and without indexes. The
output for me is as follows:

Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 254.254 ms
SELECT * FROM try WHERE citext = 'food';
Time: 288.535 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.385 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 236.186 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 235.818 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 1.260 ms
SELECT * FROM try WHERE citext = 'food';
Time: 277.755 ms

Test LIKE and ILIKE
SELECT * FROM try WHERE LOWER(text) LIKE LOWER('C%');
Time: 209.073 ms
SELECT * FROM try WHERE citext ILIKE 'C%';
Time: 238.430 ms
SELECT * FROM try WHERE citext LIKE 'C%';
Time: 238.685 ms
benedict%

So for some reason, after adding the indexes, the queries against the
CITEXT column aren't using them. Furthermore, the `lower(text) LIKE
lower(?)` query isn't using *its* index. Huh?

So this leaves me with two questions:

1. For what reason would the query against the citext column *not* use
the index?

2. Is there some way to get the CITEXT index to behave like a LOWER()
index, that is, so that its value is stored using the result of the
str_tolower() function, thus removing some of the overhead of
converting the values for each row fetched from the index? (Does this
question make any sense?)

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 23:26:36
Message-ID: 108F85E1-3E18-4298-9371-102F5D6EDC9B@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

And here is the script. D'oh!

Thanks,

David

Attachment Content-Type Size
try.sql application/octet-stream 27 bytes
unknown_filename text/plain 2.1 KB

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-07 23:37:00
Message-ID: F8B7A37C-45E7-416A-9AFC-714404A13627@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

No, *really*

Sheesh, sorry.

David

Attachment Content-Type Size
try.sql application/octet-stream 2.4 KB
unknown_filename text/plain 2.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-07 23:58:07
Message-ID: 25624.1215475087@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Hrm. So in your opinion, strncmp() could be used for all comparisons
> by citext, rather than varstr_cmp()?

I thought the charter of this type was to work like lower(textcol).
If that's so, you certainly can't use strncmp, because that would result
in sort orderings totally different from lower()'s result. Even without
that argument, for most multibyte cases you'd get a pretty arbitrary,
user-unfriendly sort ordering.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-08 00:03:34
Message-ID: 1D9FF20F-FFBF-4584-B6F0-D33B54B9E2CB@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 16:58, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> Hrm. So in your opinion, strncmp() could be used for all comparisons
>> by citext, rather than varstr_cmp()?
>
> I thought the charter of this type was to work like lower(textcol).

Correct.

> If that's so, you certainly can't use strncmp, because that would
> result
> in sort orderings totally different from lower()'s result. Even
> without
> that argument, for most multibyte cases you'd get a pretty arbitrary,
> user-unfriendly sort ordering.

Now I'm confused again. :-( Whether or not I use strncmp() or
varstr_cmp(), I first lowercase the value to be compared using
str_tolower(). What Zdenek has said is, that aside, just as for the
TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for
everything else. Are you saying something different?

I could use some examples to play with in order to ensure that things
are behaving as they should. I'll add regression tests for them.

Thanks,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-08 00:18:40
Message-ID: 26034.1215476320@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 7, 2008, at 16:58, Tom Lane wrote:
>> If that's so, you certainly can't use strncmp, because that would
>> result
>> in sort orderings totally different from lower()'s result. Even
>> without
>> that argument, for most multibyte cases you'd get a pretty arbitrary,
>> user-unfriendly sort ordering.

> Now I'm confused again. :-( Whether or not I use strncmp() or
> varstr_cmp(), I first lowercase the value to be compared using
> str_tolower(). What Zdenek has said is, that aside, just as for the
> TEXT type, I should use strncmp() for = and <>, and varstr_cmp() for
> everything else. Are you saying something different?

No, but you were: you proposed using strncmp for everything.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org, "Zdenek Kotala" <Zdenek(dot)Kotala(at)Sun(dot)COM>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Subject: Re: PATCH: CITEXT 2.0
Date: 2008-07-08 00:26:24
Message-ID: 372ED376-DF77-4810-A7B5-27DD1D0BD772@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 17:18, Tom Lane wrote:

> No, but you were: you proposed using strncmp for everything.

Yes, that's right. I was trying to understand why I wouldn't use just
one or the other. And the answer turned out to be, you wouldn't,
except that strncmp() is an desirable optimization for = and <>. So
I've changed only those.

Phew, I think I'm clear now. Thanks!

DAvid


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-08 00:34:39
Message-ID: 910F6761-133F-4F67-9CA2-1E8771E638EF@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks to help from RhodiumToad on IRC, I got some things improved here:

On Jul 7, 2008, at 16:24, David E. Wheeler wrote:

> So for some reason, after adding the indexes, the queries against
> the CITEXT column aren't using them. Furthermore, the `lower(text)
> LIKE lower(?)` query isn't using *its* index. Huh?

I never knew what one needed to use the text_pattern_ops operator
class to index a column for use with LIKE! I had no clue. Would that
work for a citext column, too, since it's essentially the same as TEXT?

> So this leaves me with two questions:
>
> 1. For what reason would the query against the citext column *not*
> use the index?

It turns out that it did use the index if I put `SET enable_seqscan =
false;` into my script. So with RhodiumToad's direction, I added some
`RESTRICT` and `JOIN` clauses to my comparison operators (copying them
from ip4r). So now I have:

CREATE OPERATOR = (
LEFTARG = CITEXT,
RIGHTARG = CITEXT,
COMMUTATOR = =,
NEGATOR = <>,
PROCEDURE = citext_eq,
RESTRICT = eqsel,
JOIN = eqjoinsel,
HASHES,
MERGES
);

CREATE OPERATOR <> (
LEFTARG = CITEXT,
RIGHTARG = CITEXT,
NEGATOR = =,
COMMUTATOR = <>,
PROCEDURE = citext_ne,
RESTRICT = neqsel,
JOIN = neqjoinsel
);

CREATE OPERATOR < (
LEFTARG = CITEXT,
RIGHTARG = CITEXT,
NEGATOR = >=,
COMMUTATOR = >,
PROCEDURE = citext_lt,
RESTRICT = scalarltsel,
JOIN = scalarltjoinsel
);

CREATE OPERATOR <= (
LEFTARG = CITEXT,
RIGHTARG = CITEXT,
NEGATOR = >,
COMMUTATOR = <=,
PROCEDURE = citext_le,
RESTRICT = scalarltsel,
JOIN = scalarltjoinsel
);

CREATE OPERATOR >= (
LEFTARG = CITEXT,
RIGHTARG = CITEXT,
NEGATOR = <,
COMMUTATOR = <=,
PROCEDURE = citext_ge,
RESTRICT = scalargtsel,
JOIN = scalargtjoinsel
);

CREATE OPERATOR > (
LEFTARG = CITEXT,
RIGHTARG = CITEXT,
NEGATOR = <=,
COMMUTATOR = <,
PROCEDURE = citext_gt,
RESTRICT = scalargtsel,
JOIN = scalargtjoinsel
);

With this change, the index was used:

Loading words from dictionary.
Inserting into the table.

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 261.295 ms
SELECT * FROM try WHERE citext = 'food';
Time: 289.304 ms
Time: 1228.961 ms

Adding indexes...

Test =.
SELECT * FROM try WHERE LOWER(text) = LOWER('food');
Time: 2.018 ms
SELECT * FROM try WHERE citext = 'food';
Time: 0.788 ms

Seems to be faster than the LOWER() version, too, which makes me
happy. The output from EXPLAIN ANALYZE:

try=# EXPLAIN ANALYZE SELECT * FROM try WHERE citext = 'food';

QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx_try_citext on try (cost=0.00..8.31 rows=1
width=119) (actual time=0.324..0.324 rows=0 loops=1)
Index Cond: (citext = 'food'::citext)
Total runtime: 0.377 ms
(3 rows)

try=# EXPLAIN ANALYZE SELECT * FROM try WHERE LOWER(text) =
LOWER('food');

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on try (cost=28.17..1336.10 rows=500 width=119)
(actual time=0.170..0.170 rows=0 loops=1)
Recheck Cond: (lower(text) = 'food'::text)
-> Bitmap Index Scan on idx_try_text (cost=0.00..28.04 rows=500
width=0) (actual time=0.168..0.168 rows=0 loops=1)
Index Cond: (lower(text) = 'food'::text)
Total runtime: 0.211 ms
(5 rows)

So my only other question related to this is:

* Are the above RESTRICT and JOIN functions the ones to use, or is
there some way to make use of those used by the TEXT type that would
be more appropriate?

> 2. Is there some way to get the CITEXT index to behave like a
> LOWER() index, that is, so that its value is stored using the result
> of the str_tolower() function, thus removing some of the overhead of
> converting the values for each row fetched from the index? (Does
> this question make any sense?)

Given the performance with an index, I think that this is moot, yes?
There is, of course, much more overhead for a table scan.

Best,

David


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-08 06:54:58
Message-ID: 20080708065458.GB16671@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
> I guess that'd be the reason to keep it on pgFoundry, but I have two
> comments:
>
> * 2-3 years is a *long* time in Internet time.

There have been patches over the years, but they tend not to get looked
at. Recently someone pulled up the COLLATE patch from a couple of years
ago but it didn't get much feedback either. (I can't find the link
right now).

It's disappointing that the discussions get hung up on the ICU library
when it's not required or even needed for COLLATE support. My original
patch never even mentioned it.

I note that Firebird added COLLATE using ICU a few years back now. I
think PostgreSQL is the only large DBMS to not support it.

> * There is on guarantee that it will be finished in that time or,
> indeed ever (no disrespect to Radek Strnad, it's just there are always
> unforeseen issues).

I think that with concerted coding effort it could be done in 2-3
months (judging by how long it took to write the first version). The
problem is it needs some planner kung-fu which not many people have.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-08 07:12:07
Message-ID: 48731347.8050802@sun.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout napsal(a):
> On Mon, Jul 07, 2008 at 12:06:08PM -0700, David E. Wheeler wrote:
>> I guess that'd be the reason to keep it on pgFoundry, but I have two
>> comments:
>>
>> * 2-3 years is a *long* time in Internet time.
>
> There have been patches over the years, but they tend not to get looked
> at. Recently someone pulled up the COLLATE patch from a couple of years
> ago but it didn't get much feedback either. (I can't find the link
> right now).

I know about it. I have printed your proposal on my desk. I think It is linked
from TODO list.

> It's disappointing that the discussions get hung up on the ICU library
> when it's not required or even needed for COLLATE support. My original
> patch never even mentioned it.
>
> I note that Firebird added COLLATE using ICU a few years back now. I
> think PostgreSQL is the only large DBMS to not support it.

Complete agree. Collation missing support is big problem for many users.

>> * There is on guarantee that it will be finished in that time or,
>> indeed ever (no disrespect to Radek Strnad, it's just there are always
>> unforeseen issues).
>
> I think that with concerted coding effort it could be done in 2-3
> months (judging by how long it took to write the first version). The
> problem is it needs some planner kung-fu which not many people have.

I agree that 2-3 months on fulltime is good estimation, problem is that you need
kung-fu master which has time to do it :(. What we currently have is student
which works on it in free time.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Subject: Re: PATCH: CITEXT 2.0 v2
Date: 2008-07-09 16:49:14
Message-ID: 0B48CBE7-5C4E-4FD6-93FA-6E729CD7EC75@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 7, 2008, at 12:06, David E. Wheeler wrote:

>> I understand it but there is parallel project which should solve
>> this problem completely I guess in "close" future (2-3years).
>> Afterward this module will be obsolete and it will takes times to
>> remove it from contrib. It seems to me that have citext in contrib
>> only for two releases is not optimal solution.
>
> I guess that'd be the reason to keep it on pgFoundry, but I have two
> comments:
>
> * 2-3 years is a *long* time in Internet time.
>
> * There is on guarantee that it will be finished in that time or,
> indeed ever (no disrespect to Radek Strnad, it's just there are
> always unforeseen issues).

One other thing that occurred to me yesterday: Given that the feature
will ultimately support column-level collations, I suspect that it
will be much easier to migrate CITEXT to a case-insensitive collation
(perhaps using an updated CITEXT contrib module that just does so
transparently) than to migrate application code from using LOWER() all
over the place to not using. One transition requires a change to the
schema, the other requires a change to application code, of which
there is generally a lot more.

Best,

David