Re: PATCH: CITEXT 2.0 v4

Lists: pgsql-hackers
From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PATCH: CITEXT 2.0 v4
Date: 2008-07-16 05:23:21
Message-ID: 34AAF859-C4A5-4E6A-A20C-E836631EE32F@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Howdy,

I've attached a new patch with the latest revisions of for the citext
contrib module patch. The changes include:

* Using strlen() to pass string lengths to the comparison function,
since lowercasing the value can change the length. Per Tom Lane.
* Made citextcmp consistently return int32, per Tom Lane.
* Made the hash index function return the proper value, per Tom Lane.
* Removed the COMMENTs and GRANTs from citext.sql.in.
* Added a cast function from bpchar to citext, as suggested by Tom Lane.
* Set the storage type for CITEXT to "extended", to ensure that it will
be toastable. Per Tom Lane.
* Fixed the COMMUTATOR of <=.
* Changed the cast from citext to bpchar from implicit to assignment.
This eliminates ambiguous function resolutions.
* Eliminated superflous functions, per Tom Lane.
* Removed unnecessary `OPERATOR()` calls in NEGATORs and the like.
* Added binary in/out functions. Per Tom Lane
* Added an explicit shell type to make the output a bit quieter.
* Converted tests to pure SQL and omitted multibyte tests (though a
few remain commented-out).
* Reorganized and expanded the documentation a bit.

This version is far better than I started with, and I'm very grateful
for the feedback.

Now, I have a few remaining questions to ask, mostly just to get your
opinions:

* The README for citext 1.0 on pgFoundry says:

> I had to make a decision on casting between types for regular
> expressions and
> decided that if any parameter is of citext type then case
> insensitive applies.
> For example applying regular expressions with a varchar and a citext
> will
> produce a case-insensitive result.
>
> Having thought about this afterwards I realised that since we have
> the option
> to use case-insensitive results with regular expressions I should
> have left the
> behaviour exactly as text and then you have the best of both
> worlds... oh well
> not hard to change for any of you perfectionists!

I followed the original and made all the regex and LIKE comparisons
case-insensitive. But maybe I should not have? Especially since the
regular expression functions (e.g., regexp_replace()) and a few non-
regex functions (e.g., replace()) still don't behave case-insensitively?

* If the answer is "no", how can I make those functions behave case-
insensitively? (See the "TODO" tests.)

* Should there be any other casts? To and from name, perhaps?

Thanks!

David

Attachment Content-Type Size
citext4.patch.gz application/x-gzip 10.2 KB

From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-16 17:54:25
Message-ID: 15397E1D-2897-4863-A48D-7B1FB9753354@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 15, 2008, at 22:23, David E. Wheeler wrote:

> * The README for citext 1.0 on pgFoundry says:
>
>> I had to make a decision on casting between types for regular
>> expressions and
>> decided that if any parameter is of citext type then case
>> insensitive applies.
>> For example applying regular expressions with a varchar and a
>> citext will
>> produce a case-insensitive result.
>>
>> Having thought about this afterwards I realised that since we have
>> the option
>> to use case-insensitive results with regular expressions I should
>> have left the
>> behaviour exactly as text and then you have the best of both
>> worlds... oh well
>> not hard to change for any of you perfectionists!
>
> I followed the original and made all the regex and LIKE comparisons
> case-insensitive. But maybe I should not have? Especially since the
> regular expression functions (e.g., regexp_replace()) and a few non-
> regex functions (e.g., replace()) still don't behave case-
> insensitively?

I was thinking about this a bit last night and wanted to fill things
out a bit.

As a programmer, I find Donald Fraser's hindsight to be more
appealing, because at least that way I have the option to do matching
against CITEXT strings case-sensitively when I want to.

OTOH, if what we want is to have CITEXT work more like a case-
insensitive collation, then the expectation from the matching
operators and functions might be different. Does anyone have any idea
whether regex and LIKE matching against a string in a case-insensitive
collation would match case-insensitively or not? If so, then maybe the
regex and LIKE ops and funcs *should* match case-insensitively? If
not, or if only for some collations, then I would think not.

Either way, I know of no way, currently, to allow functions like
replace(), split_part(), strpos(), and translate() to match case-
insensitiely, even if we wanted to. Anyone have any ideas?

> * If the answer is "no", how can I make those functions behave case-
> insensitively? (See the "TODO" tests.)
>
> * Should there be any other casts? To and from name, perhaps?

Thanks,

David


From: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-16 18:20:14
Message-ID: 200807161420.14824.xzilla@users.sourceforge.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday 16 July 2008 13:54:25 David E. Wheeler wrote:
> On Jul 15, 2008, at 22:23, David E. Wheeler wrote:
> > * The README for citext 1.0 on pgFoundry says:
> >> I had to make a decision on casting between types for regular
> >> expressions and
> >> decided that if any parameter is of citext type then case
> >> insensitive applies.
> >> For example applying regular expressions with a varchar and a
> >> citext will
> >> produce a case-insensitive result.
> >>
> >> Having thought about this afterwards I realised that since we have
> >> the option
> >> to use case-insensitive results with regular expressions I should
> >> have left the
> >> behaviour exactly as text and then you have the best of both
> >> worlds... oh well
> >> not hard to change for any of you perfectionists!
> >
> > I followed the original and made all the regex and LIKE comparisons
> > case-insensitive. But maybe I should not have? Especially since the
> > regular expression functions (e.g., regexp_replace()) and a few non-
> > regex functions (e.g., replace()) still don't behave case-
> > insensitively?
>
> I was thinking about this a bit last night and wanted to fill things
> out a bit.
>
> As a programmer, I find Donald Fraser's hindsight to be more
> appealing, because at least that way I have the option to do matching
> against CITEXT strings case-sensitively when I want to.
>
> OTOH, if what we want is to have CITEXT work more like a case-
> insensitive collation, then the expectation from the matching
> operators and functions might be different. Does anyone have any idea
> whether regex and LIKE matching against a string in a case-insensitive
> collation would match case-insensitively or not? If so, then maybe the
> regex and LIKE ops and funcs *should* match case-insensitively? If
> not, or if only for some collations, then I would think not.
>
> Either way, I know of no way, currently, to allow functions like
> replace(), split_part(), strpos(), and translate() to match case-
> insensitiely, even if we wanted to. Anyone have any ideas?
>
> > * If the answer is "no", how can I make those functions behave case-
> > insensitively? (See the "TODO" tests.)
> >
> > * Should there be any other casts? To and from name, perhaps?
>

AIUI, your propsing the following:

select 'x'::citext = 'X'::citext;
?column?
----------
t
(1 row)

select 'x'::citext ~ 'X'::citext;
?column?
----------
f
(1 row)

I understand the desire for flexibility, but the above seems wierd to me.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-16 18:38:41
Message-ID: ED790C97-9248-4D3B-B382-FD07D1721568@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 16, 2008, at 11:20, Robert Treat wrote:

>> I was thinking about this a bit last night and wanted to fill things
>> out a bit.
>>
>> As a programmer, I find Donald Fraser's hindsight to be more
>> appealing, because at least that way I have the option to do matching
>> against CITEXT strings case-sensitively when I want to.
>>
>> OTOH, if what we want is to have CITEXT work more like a case-
>> insensitive collation, then the expectation from the matching
>> operators and functions might be different. Does anyone have any idea
>> whether regex and LIKE matching against a string in a case-
>> insensitive
>> collation would match case-insensitively or not? If so, then maybe
>> the
>> regex and LIKE ops and funcs *should* match case-insensitively? If
>> not, or if only for some collations, then I would think not.
>>
>> Either way, I know of no way, currently, to allow functions like
>> replace(), split_part(), strpos(), and translate() to match case-
>> insensitiely, even if we wanted to. Anyone have any ideas?
>>
>>> * If the answer is "no", how can I make those functions behave case-
>>> insensitively? (See the "TODO" tests.)
>>>
>>> * Should there be any other casts? To and from name, perhaps?
>>
>
> AIUI, your propsing the following:
>
> select 'x'::citext = 'X'::citext;
> ?column?
> ----------
> t
> (1 row)
>
> select 'x'::citext ~ 'X'::citext;
> ?column?
> ----------
> f
> (1 row)
>
> I understand the desire for flexibility, but the above seems wierd
> to me.

That's what Donald Fraser suggested, and I see some value in that, but
wanted to get some other opinions. And you're right, that does seem a
bit weird.

The trouble is that, right now:

template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o');
regexp_replace
----------------
fxx
(1 row)

So there's an inconsistency there. I don't know how to make that work
case-insensitively.

Best,

David


From: Michael Paesold <mpaesold(at)gmx(dot)at>
To: David E(dot) Wheeler <david(at)kineticode(dot)com>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-17 10:45:42
Message-ID: 2DF54270-10B1-4687-869F-4AD77D3A7CB8@gmx.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Am 16.07.2008 um 20:38 schrieb David E. Wheeler:
>
> The trouble is that, right now:
>
> template1=# select regexp_replace( 'fxx'::citext, 'X'::citext, 'o');
> regexp_replace
> ----------------
> fxx
> (1 row)
>
> So there's an inconsistency there. I don't know how to make that
> work case-insensitively.

Wouldn't it be possible to create a variant of regexp_replace, i.e.
regexp_replace(citext,citext,text), which would again lower-case the
first two arguments before passing the input to
regexp_replace(text,text,text)?

Best Regards
Michael Paesold


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Michael Paesold <mpaesold(at)gmx(dot)at>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-17 17:07:31
Message-ID: 80B7A7E8-CABD-4663-B13E-2B39324F89CD@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 17, 2008, at 03:45, Michael Paesold wrote:

> Wouldn't it be possible to create a variant of regexp_replace, i.e.
> regexp_replace(citext,citext,text), which would again lower-case the
> first two arguments before passing the input to
> regexp_replace(text,text,text)?

Sure, but then you end up with this:

template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
regexp_replace
----------------
foo
(1 row)

Which is just wrong. I'm going to look at the regex C functions today
and see if there's an easy way to just always pass them the 'i' flag,
which would do the trick. That still won't help replace(),
split_part(), or translate(), however.

Best,

David


From: Michael Paesold <mpaesold(at)gmx(dot)at>
To: David E(dot) Wheeler <david(at)kineticode(dot)com>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-18 08:39:23
Message-ID: B33BE1AA-D015-4428-A93D-8B305AE9050B@gmx.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler writes:

> On Jul 17, 2008, at 03:45, Michael Paesold wrote:
>
>> Wouldn't it be possible to create a variant of regexp_replace, i.e.
>> regexp_replace(citext,citext,text), which would again lower-case
>> the first two arguments before passing the input to
>> regexp_replace(text,text,text)?
>
> Sure, but then you end up with this:
>
> template1=# select regexp_replace( 'Fxx'::citext, 'X'::citext, 'o');
> regexp_replace
> ----------------
> foo
> (1 row)

Yeah, you are right, I see. :-)

> Which is just wrong. I'm going to look at the regex C functions
> today and see if there's an easy way to just always pass them the
> 'i' flag, which would do the trick. That still won't help replace(),
> split_part(), or translate(), however.

Calling regex functions with the case-insensitivity option would be
great. It should also be possible to rewrite replace() into
regexp_replace() by first escaping the regex meta characters.

Actually re-implementing those functions in a case insensitive way
would still be an option, but of course some amount of work. The
question is, how much use case there is.

Best Regards
Michael Paesold


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Michael Paesold <mpaesold(at)gmx(dot)at>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-18 16:53:31
Message-ID: D490366D-19C5-4126-8869-9C20A6B2E42D@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 18, 2008, at 01:39, Michael Paesold wrote:

> Calling regex functions with the case-insensitivity option would be
> great. It should also be possible to rewrite replace() into
> regexp_replace() by first escaping the regex meta characters.
>
> Actually re-implementing those functions in a case insensitive way
> would still be an option, but of course some amount of work. The
> question is, how much use case there is.

Not much for me. I might use the regex functions, but would be happy
to manually pass the "i" flag.

However, if someone with a lot more C and Pg core knowledge wanted to
sit down with me for a couple hours next week and help me bang out
these functions, that would be great. I'd love to have the
implementation be that much more complete.

I do believe that, as it stands now in the v4 patch, citext is pretty
close to ready, and certainly commit-able.

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Michael Paesold <mpaesold(at)gmx(dot)at>, Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-21 18:55:51
Message-ID: 70AA56DA-5789-41F9-AC24-FE2F219F1017@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 18, 2008, at 09:53, David E. Wheeler wrote:

> However, if someone with a lot more C and Pg core knowledge wanted
> to sit down with me for a couple hours next week and help me bang
> out these functions, that would be great. I'd love to have the
> implementation be that much more complete.

I've implemented fixes for the regexp_* functions and strpos() in pure
SQL, like so:

CREATE OR REPLACE FUNCTION regexp_matches( citext, citext ) RETURNS
TEXT[] AS '
SELECT regexp_matches( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_matches( citext, citext, text )
RETURNS TEXT[] AS '
SELECT regexp_matches( $1::text, $2::text, CASE WHEN strpos($3,
''c'') = 0 THEN $3 || ''i'' ELSE $3 END );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text )
returns TEXT AS '
SELECT regexp_replace( $1::text, $2::text, $3, ''i'');
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_replace( citext, citext, text,
text ) returns TEXT AS '
SELECT regexp_replace( $1::text, $2::text, $3, CASE WHEN
strpos($4, ''c'') = 0 THEN $4 || ''i'' ELSE $4 END);
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext )
RETURNS TEXT[] AS '
SELECT regexp_split_to_array( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_array( citext, citext,
text ) RETURNS TEXT[] AS '
SELECT regexp_split_to_array( $1::text, $2::text, CASE WHEN
strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext )
RETURNS SETOF TEXT AS '
SELECT regexp_split_to_table( $1::text, $2::text, ''i'' );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION regexp_split_to_table( citext, citext,
text ) RETURNS SETOF TEXT AS '
SELECT regexp_split_to_table( $1::text, $2::text, CASE WHEN
strpos($3, ''c'') = 0 THEN $3 || ''i'' ELSE $3 END );
' LANGUAGE SQL IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION strpos( citext, citext ) RETURNS INT AS '
SELECT strpos( LOWER( $1::text ), LOWER( $2::text ) );
' LANGUAGE SQL IMMUTABLE STRICT;

Not so bad, though it'd be nice to have C functions that just did
these things. Still not case-insensitive are:

-- replace()
-- split_part()
-- translate()

So, anyone at OSCON this week want to help me with these? Or to
convert the above functions to C? Greg? Bruce?

Thanks,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Michael Paesold <mpaesold(at)gmx(dot)at>
Cc: Robert Treat <xzilla(at)users(dot)sourceforge(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: PATCH: CITEXT 2.0 v4
Date: 2008-07-22 20:54:20
Message-ID: 65071C7E-0CE0-489C-B294-D1C6DAD49D16@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul 18, 2008, at 01:39, Michael Paesold wrote:

> Calling regex functions with the case-insensitivity option would be
> great. It should also be possible to rewrite replace() into
> regexp_replace() by first escaping the regex meta characters.
>
> Actually re-implementing those functions in a case insensitive way
> would still be an option, but of course some amount of work. The
> question is, how much use case there is.

I've figured out how to make all the functions work using SQL function
workarounds, converting things and re-dispatching to the text versions
as appropriate. They work quite well, and can be converted to C later
if that becomes a requirement.

Meanwhile, on the question of whether or not regular expression and
LIKE comparisons *should* match case-insensitively, I have a couple
more observations:

* Thinking about how a true case-insensitive collation would work, I'm
quite certain that it would match case-insensitively. Anything else
would just be unexpected, because in a case-insensitive collation,
lowercase characters are, in practice, identical to uppercase
characters. As far as matching is concerned, there is no difference
between them. So the matching operators and functions against CITEXT
should follow that assumption.

* I tried a few matches on MySQL, where the collation is case-
insensitive by default, and it confirms my impression:

mysql> select 'Foo' regexp 'o$';
+-------------------+
| 'Foo' regexp 'o$' |
+-------------------+
| 1 |
+-------------------+
1 row in set (0.00 sec)

mysql> select 'Foo' regexp 'O$';
+-------------------+
| 'Foo' regexp 'O$' |
+-------------------+
| 1 |
+-------------------+
1 row in set (0.00 sec)

mysql> select 'Foo' like '%o';
+-----------------+
| 'Foo' like '%o' |
+-----------------+
| 1 |
+-----------------+
1 row in set (0.00 sec)

mysql> select 'Foo' like '%O';
+-----------------+
| 'Foo' like '%O' |
+-----------------+
| 1 |
+-----------------+
1 row in set (0.00 sec)

I'll grant that MySQL may not be the best model for how things should
work, but it's something, at least. Anyone else got access to another
database with case-insensitive collations to see how LIKE and regular
expressions work?

Thanks,

David