Quick Links

Re: PATCH: CITEXT 2.0 v3

Lists:	pgsql-hackers

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	PATCH: CITEXT 2.0 v3
Date:	2008-07-08 01:03:56
Message-ID:	890EA230-DA04-4D65-996F-5E7107690BE8@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached is a new version of a patch to add a CITEXT contrib module.
Changes since v2:

* Optimized citext_eq() and citext_ne() (the = and <> operators,
respectively) by having them use strncmp() instead of varstr_cmp().
Per discussion.

* Added `RESTRICT` and `JOIN` clauses to the comparison operators (=,
<>, <, >, <=, >=). These improve statistics estimations, increasing
the liklihood that indices will be used.

Thanks!

David

Attachment	Content-Type	Size
citext3.patch.gz	application/x-gzip	19.2 KB

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	David E(dot) Wheeler <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-09 20:28:35
Message-ID:	8709D06C-AAFE-4712-85B0-852964369BC8@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I guess you're all just blown away by the perfection of this patch? ;-)

Best,

David

On Jul 7, 2008, at 18:03, David E. Wheeler wrote:

> Attached is a new version of a patch to add a CITEXT contrib module.
> Changes since v2:
>
> * Optimized citext_eq() and citext_ne() (the = and <> operators,
> respectively) by having them use strncmp() instead of varstr_cmp().
> Per discussion.
>
> * Added `RESTRICT` and `JOIN` clauses to the comparison operators
> (=, <>, <, >, <=, >=). These improve statistics estimations,
> increasing the liklihood that indices will be used.
>
> Thanks!
>
> David

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-09 20:40:54
Message-ID:	20080709204054.GL3946@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David E. Wheeler wrote:
> I guess you're all just blown away by the perfection of this patch? ;-)

The problem is that we're in the middle of a commitfest, so everybody is
busy reviewing other patches (in theory at least).

One thing that jumps at me is pgTAP usage, as Zdenek said. I understand
that it's neat and all that, but we can't include the tests because they
won't run unless one installs pgTAP which seems a nonstarter. So if you
want the tests in the repository along the rest of the stuff, they
really should use pg_regress.

It's not even difficult to use. Have a look at contrib/ltree/sql and
contrib/ltree/expected for examples.

If you want to push for pgTAP in core, that's fine, but it's a separate
discussion.

The other possibility being, of course, that you are proposing citext to
live on pgFoundry.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-09 20:50:44
Message-ID:	BD40FA86-AD63-4331-96C8-EFD3F8E82C0B@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

> The problem is that we're in the middle of a commitfest, so
> everybody is
> busy reviewing other patches (in theory at least).

Of course.

> One thing that jumps at me is pgTAP usage, as Zdenek said. I
> understand
> that it's neat and all that, but we can't include the tests because
> they
> won't run unless one installs pgTAP which seems a nonstarter. So if
> you
> want the tests in the repository along the rest of the stuff, they
> really should use pg_regress.

It does use pg_regress. The test just load the included pgtap.sql file
to get the tap functions, and then away they go. If you run `make
installcheck` it works.

> It's not even difficult to use. Have a look at contrib/ltree/sql and
> contrib/ltree/expected for examples.
>
> If you want to push for pgTAP in core, that's fine, but it's a
> separate
> discussion.

Agreed. I've sent a couple of messages in a thread about that, the
latest this morning.

> The other possibility being, of course, that you are proposing
> citext to
> live on pgFoundry.

I'm not, actually. I mean, I have an updated version for 8.3, but it'd
be quite a pita to maintain them both, since the api for lowercasing
text is so much simpler in 8.4.

Best,

David

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-09 21:01:03
Message-ID:	20080709210103.GM3946@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David E. Wheeler wrote:
> On Jul 9, 2008, at 13:40, Alvaro Herrera wrote:

>> One thing that jumps at me is pgTAP usage, as Zdenek said. I
>> understand that it's neat and all that, but we can't include the
>> tests because they won't run unless one installs pgTAP which seems a
>> nonstarter. So if you want the tests in the repository along the
>> rest of the stuff, they really should use pg_regress.
>
> It does use pg_regress. The test just load the included pgtap.sql file
> to get the tap functions, and then away they go. If you run `make
> installcheck` it works.

Uh-huh, OK, that's fine then I guess.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-11 19:37:31
Message-ID:	4877B67B.6010009@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David E. Wheeler napsal(a):
> Attached is a new version of a patch to add a CITEXT contrib module.
> Changes since v2:
>
> * Optimized citext_eq() and citext_ne() (the = and <> operators,
> respectively) by having them use strncmp() instead of varstr_cmp(). Per
> discussion.
>
> * Added `RESTRICT` and `JOIN` clauses to the comparison operators (=,
> <>, <, >, <=, >=). These improve statistics estimations, increasing the
> liklihood that indices will be used.

I'm sorry for late response. I'm little bit busy this week. I quickly look on
your latest changes and it seems to me that everything is OK. I'm going to
change status to ready for commit. After discussion with Pavel Stehule I changed
meaning about pgFoundry vs. contrib. Contrib is OK for me.

Maybe some native speaker should review documentation.

Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-11 19:52:54
Message-ID:	C4DF10C4-C6ED-4843-8464-CCA8EB3CC7A9@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:

> I'm sorry for late response. I'm little bit busy this week. I
> quickly look on your latest changes and it seems to me that
> everything is OK. I'm going to change status to ready for commit.
> After discussion with Pavel Stehule I changed meaning about
> pgFoundry vs. contrib. Contrib is OK for me.

Thank you, Zdenek. Have you had a chance to try citext yet? Or did you
just read the source?

Best,

David

From:	Zdenek Kotala <Zdenek(dot)Kotala(at)Sun(dot)COM>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-11 20:02:19
Message-ID:	4877BC4B.7030801@sun.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David E. Wheeler napsal(a):
> On Jul 11, 2008, at 12:37, Zdenek Kotala wrote:
>
>> I'm sorry for late response. I'm little bit busy this week. I quickly
>> look on your latest changes and it seems to me that everything is OK.
>> I'm going to change status to ready for commit. After discussion with
>> Pavel Stehule I changed meaning about pgFoundry vs. contrib. Contrib
>> is OK for me.
>
> Thank you, Zdenek. Have you had a chance to try citext yet? Or did you
> just read the source?

I tested version two on Solaris/SPARC and Sun studio compiler. I checked last
version only quickly (comparing your changes).

With regards Zdenek

--
Zdenek Kotala Sun Microsystems
Prague, Czech Republic http://sun.com/postgresql

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-11 21:31:58
Message-ID:	27083F56-6EEA-47BB-ABA5-E905D27AC713@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 11, 2008, at 13:02, Zdenek Kotala wrote:

>> Thank you, Zdenek. Have you had a chance to try citext yet? Or did
>> you just read the source?
>
> I tested version two on Solaris/SPARC and Sun studio compiler. I
> checked last version only quickly (comparing your changes).

Thanks.

I just updated my performance test script (attached) by increasing the
number of rows tested by an order of magnitude. So now it creates
1,000,000 rows, queries them, adds indexes, and then queries them
again. Unfortunately, CITEXT seems to have a memory leak somewhere,
because when I index the CITEXT column, it fails with "ERROR: out of
memory". So obviously something's not getting cleaned up. Here's the
btree indexing function:

Datum
citext_cmp(PG_FUNCTION_ARGS)
{
text *left = PG_GETARG_TEXT_PP(0);
text *right = PG_GETARG_TEXT_PP(1);
int32 result;

result = citextcmp(left, right);

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_INT32(result);
}

And here's citextcmp():

citextcmp (text *left, text *right)
{
char *lcstr, *rcstr;
int result;

lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

result = varstr_cmp(
lcstr,
VARSIZE_ANY_EXHDR(left),
rcstr,
VARSIZE_ANY_EXHDR(right)
);

pfree(lcstr);
pfree(rcstr);
return result;
}

Can anyone see where I'm failing to free up memory? Might it be in
some other function?

Thanks!

David

Attachment	Content-Type	Size
try.sql	application/octet-stream	2.4 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-11 23:45:44
Message-ID:	444.1215819944@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> lcstr = str_tolower(VARDATA_ANY(left), VARSIZE_ANY_EXHDR(left));
> rcstr = str_tolower(VARDATA_ANY(right), VARSIZE_ANY_EXHDR(right));

> result = varstr_cmp(
> lcstr,
> VARSIZE_ANY_EXHDR(left),
> rcstr,
> VARSIZE_ANY_EXHDR(right)
> );

Not sure about a memory leak, but this code is clearly wrong if
str_tolower results in a change of physical string length (clearly
possible in Turkish, for instance, and probably in some other
locales too). You need to do strlen's on the lowercased strings.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-11 23:49:10
Message-ID:	C4D10B4C-DCF7-4377-8E82-A1FC95B75BC9@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 11, 2008, at 16:45, Tom Lane wrote:

> Not sure about a memory leak, but this code is clearly wrong if
> str_tolower results in a change of physical string length (clearly
> possible in Turkish, for instance, and probably in some other
> locales too). You need to do strlen's on the lowercased strings.

Ah, great point. Thanks.

Anyone else got an idea on the memory leak?

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 02:18:12
Message-ID:	4164.1215829092@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Unfortunately, CITEXT seems to have a memory leak somewhere,

I tried it here and didn't see any apparent memory leak, on two
different systems. What platform are you testing on, and with
what encoding/locale settings?

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 02:35:42
Message-ID:	8582534D-277E-4E60-B2A0-0617AD3CBD3C@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 11, 2008, at 19:18, Tom Lane wrote:

> I tried it here and didn't see any apparent memory leak, on two
> different systems. What platform are you testing on, and with
> what encoding/locale settings?

PostgreSQL 8.3.3 on i386-apple-darwin9.4.0, compiled by GCC i686-
apple-darwin9-gcc-4.0.1 (G

That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8. This is using my
version for 8.3.3 from svn.

https://svn.kineticode.com/citext/trunk/

I'll give it a try myself with the patch against CVS in a bit.

Thanks,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 03:10:44
Message-ID:	4861.1215832244@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 11, 2008, at 19:18, Tom Lane wrote:
>> I tried it here and didn't see any apparent memory leak, on two
>> different systems. What platform are you testing on, and with
>> what encoding/locale settings?

> That's Mac OS X 10.5.4 "Leopard." Using en_US.UTF-8.

Oh, got one of those too ... [ time passes ... ] Nope, no leak
seen here either, though you could possibly fry an egg on my laptop
right now ...

Also, I notice that the Mac is the *only* one of the three systems
on which your regression tests pass. You're depending way too much
on platform-specific behavior there, I fear.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 03:22:58
Message-ID:	5047.1215832978@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Here's an updated version of citext.c. Some changes cosmetic, some
not so much ...

regards, tom lane

Attachment	Content-Type	Size
unknown_filename	text/plain	6.0 KB

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 04:27:15
Message-ID:	DB091FF6-925E-483A-B072-0067A255A9C8@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 11, 2008, at 20:10, Tom Lane wrote:

> Oh, got one of those too ... [ time passes ... ] Nope, no leak
> seen here either, though you could possibly fry an egg on my laptop
> right now ...

Yeah, gotta love that, right?

So the issue must be with my version for 8.3.x, meaning, likely, my
custom cilower() function. I'll have another look at it. Thanks for
giving it a whirl on your end.

> Also, I notice that the Mac is the *only* one of the three systems
> on which your regression tests pass. You're depending way too much
> on platform-specific behavior there, I fear.

Hrm. I wonder what that could be. I'll have to give it a shot on my
Ubuntu box and find out. Thanks.

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 04:32:45
Message-ID:	ADB0A9C1-044C-4783-974E-21002AC58941@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 11, 2008, at 20:22, Tom Lane wrote:

> Here's an updated version of citext.c. Some changes cosmetic, some
> not so much ...

Thanks! Good catch on my missing all of the s/int/int32/ stuff related
to citextcmp(). Stupid oversites on my part. I'll submit a new patch
with these changes shortly.

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 19:17:59
Message-ID:	23294.1215890279@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments:

* An explicit comment explaining that you're piggybacking on the I/O
functions (and some others) for "text" wouldn't be out of place.

* Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
(If you needed them, you'd need them on a lot more than these two.)
I'd be inclined to lose the COMMENTs on the functions too; again
these are about the least useful ones to comment on out of the
whole module.

* You should provide binary I/O (send/receive) functions, if you want
this to be an industrial-strength module. It's easy since you can
piggyback on text's.

* The type declaration needs to say storage = extended, else the type
won't be toastable.

* The cast from bpchar to citext cannot be WITHOUT FUNCTION;
it needs to invoke rtrim1. Compare the bpchar to text cast.

* <= is surely not its own commutator. You might try running the
opr_sanity regression test on this module to see if it finds any
other silliness. (Procedure: insert the citext definition script
into the serial_schedule list just ahead of opr_sanity, run tests,
make sure you understand the reason for any diffs in the opr_sanity
result. There will be at least one from the uses of text-related
functions for citext.)

* I think you can and should drop all of the "size" functions and
a lot of the "miscellaneous" functions: anyplace where the implicit
coercion to text would serve, you don't need a piggyback function,
and introducing one just creates risks of
can't-resolve-ambiguous-function failures. The overloaded miscellaneous
functions are only justifiable to the extent that it's important to
preserve "citext-ness" of the result of a function, which seems at
best dubious for many of these. It is likewise pretty pointless AFAICS
to introduce regex functions taking citext pattern arguments.

* Don't use the OPERATOR() notation when you don't need to.
It's just clutter.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 21:50:24
Message-ID:	EAE94ABE-4926-4660-9E01-33071648A13F@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 12, 2008, at 12:17, Tom Lane wrote:

> BTW, I looked at the SQL file (citext.sql.in) a bit. Some comments:

Thanks a million for these, Tom. I greatly appreciate it.

> * An explicit comment explaining that you're piggybacking on the I/O
> functions (and some others) for "text" wouldn't be out of place.

I've added SQL comments. Were you talking about a COMMENT?

> * Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
> (If you needed them, you'd need them on a lot more than these two.)
> I'd be inclined to lose the COMMENTs on the functions too; again
> these are about the least useful ones to comment on out of the
> whole module.

I wondered about that; those were copied from CITEXT 1. I've removed
all GRANTs and COMMENTs.

> * You should provide binary I/O (send/receive) functions, if you want
> this to be an industrial-strength module. It's easy since you can
> piggyback on text's.

I'm confused. Is that not what the citextin and citextout functions are?

> * The type declaration needs to say storage = extended, else the type
> won't be toastable.

Ah, good, thanks.

> * The cast from bpchar to citext cannot be WITHOUT FUNCTION;
> it needs to invoke rtrim1. Compare the bpchar to text cast.

Where do I find that? I have trouble finding the SQL that creates the
core types. :-(

> * <= is surely not its own commutator.

Changed to >=.

> You might try running the
> opr_sanity regression test on this module to see if it finds any
> other silliness. (Procedure: insert the citext definition script
> into the serial_schedule list just ahead of opr_sanity, run tests,
> make sure you understand the reason for any diffs in the opr_sanity
> result. There will be at least one from the uses of text-related
> functions for citext.)

Thanks. Added to my list.

> * I think you can and should drop all of the "size" functions and
> a lot of the "miscellaneous" functions: anyplace where the implicit
> coercion to text would serve, you don't need a piggyback function,
> and introducing one just creates risks of
> can't-resolve-ambiguous-function failures. The overloaded
> miscellaneous
> functions are only justifiable to the extent that it's important to
> preserve "citext-ness" of the result of a function, which seems at
> best dubious for many of these. It is likewise pretty pointless
> AFAICS
> to introduce regex functions taking citext pattern arguments.

I added most of those as I wrote tests and they failed to find the
functions. Once I added the functions, they worked. But I'll do an
audit to make sure that I didn't inadvertantly leave in any unneeded
ones (I'm happy to have less code :-)).

> * Don't use the OPERATOR() notation when you don't need to.
> It's just clutter.

Sorry, don't know what you're referring to here. CREATE OPERATOR
appears to require parens…

Thanks for the great feedback! I'll work on getting things all
straightened out and a new patch in tonight.

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 21:57:27
Message-ID:	26072.1215899847@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

There was some discussion earlier about whether the proposed regression
tests for citext are suitable for use in contrib or not. After playing
with them for awhile, I have to come down very firmly on the side of
"not". I have these gripes:

1. The style is gratuitously different from every other regression test
in the system. This is not a good thing. If it were an amazingly
better way to do things, then maybe, but as far as I can tell the style
pgTAP forces on you is really pretty darn poorly suited for SQL tests.
You have to contort what could naturally be expressed in SQL as a table
result into a scalar. Plus it's redundant with the expected-output file.

2. It's ridiculously slow; at least a factor of ten slower than doing
equivalent tests directly in SQL. This is a very bad thing. Speed of
regression tests matters a lot to those of us who run them a dozen times
per day --- and I do not wish to discourage any developers who don't
work that way from learning better habits ;-)

Because of #1 and #2 I find the use of pgTAP to be a nonstarter.
I have a couple of gripes about the substance of the tests as well:

3. What you are mostly testing is not the behavior of citext itself,
but the behavior of the underlying strcoll function. This is pretty
much doomed to failure, first because the regression tests are intended
to work in multiple locales (and we are *not* giving that up for
citext), and second because the behavior of strcoll isn't all that
portable across platforms even given allegedly similar locale settings
(as we already found out yesterday). Sadly, I think you have to give up
attempts to check the interesting multibyte cases and confine yourself
to tests using ASCII strings.

4. A lot of the later test cases are equally uselessly testing whether
piggybacking over text functions works. The odds of ever finding
anything with those tests are not distinguishable from zero (unless the
underlying text function is busted, which is not your responsibility to
test). So I don't see any point in putting them into the standard
regression package. (What maybe *would* be useful to test, but you
didn't, is whether the result of a function is considered citext rather
than text.)

I suggest something more like the attached as a suitable regression
test. I got bored about halfway through and started to skim, so there
might be a few tests toward the end of the original set that would be
worth transposing into this one, but nothing jumped out at me.

regards, tom lane

Attachment	Content-Type	Size
unknown_filename	text/plain	3.0 KB

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-12 22:13:13
Message-ID:	A2AA4B32-A916-497C-9C2B-DC69B8BF952A@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 12, 2008, at 14:57, Tom Lane wrote:

> There was some discussion earlier about whether the proposed
> regression
> tests for citext are suitable for use in contrib or not. After
> playing
> with them for awhile, I have to come down very firmly on the side of
> "not". I have these gripes:

You're nothing if not thorough, Tom.

> 1. The style is gratuitously different from every other regression
> test
> in the system. This is not a good thing. If it were an amazingly
> better way to do things, then maybe, but as far as I can tell the
> style
> pgTAP forces on you is really pretty darn poorly suited for SQL tests.
> You have to contort what could naturally be expressed in SQL as a
> table
> result into a scalar. Plus it's redundant with the expected-output
> file.

Sure. I wasn't aware of the existing regression test methodology when
I wrote pgTAP and those tests. I'm fine to change them and use pgTAP
for testing my app code, rather than PostgreSQL itself.

> 2. It's ridiculously slow; at least a factor of ten slower than doing
> equivalent tests directly in SQL. This is a very bad thing. Speed of
> regression tests matters a lot to those of us who run them a dozen
> times
> per day --- and I do not wish to discourage any developers who don't
> work that way from learning better habits ;-)

Hrm. I'm wonder why it's so slow? The test functions don't really do a
lot. Anyway, I agree that they should perform well.

> Because of #1 and #2 I find the use of pgTAP to be a nonstarter.
> I have a couple of gripes about the substance of the tests as well:
>
> 3. What you are mostly testing is not the behavior of citext itself,
> but the behavior of the underlying strcoll function. This is pretty
> much doomed to failure, first because the regression tests are
> intended
> to work in multiple locales (and we are *not* giving that up for
> citext), and second because the behavior of strcoll isn't all that
> portable across platforms even given allegedly similar locale settings
> (as we already found out yesterday).

Yes, I *just* ran the tests on Ubuntu and opened my mail to ask about
the bizarre differences when I saw this message. Thanks for answering
my question before I asked it. :-)

> Sadly, I think you have to give up
> attempts to check the interesting multibyte cases and confine yourself
> to tests using ASCII strings.

Grr. Kind of defeats the purpose. Is there no infrastructure for
testing multibyte functionality? Are test database clusters all built
using SQL_ASCII and the C locale?

> 4. A lot of the later test cases are equally uselessly testing whether
> piggybacking over text functions works. The odds of ever finding
> anything with those tests are not distinguishable from zero (unless
> the
> underlying text function is busted, which is not your responsibility
> to
> test). So I don't see any point in putting them into the standard
> regression package. (What maybe *would* be useful to test, but you
> didn't, is whether the result of a function is considered citext
> rather
> than text.)

Well, I was doing test-driven development: I wrote the tests to ensure
that I had added the functions for CITEXT properly, and when they
passed, I could move on. As a unit tester, it'd feel weird for me to
drop these. I'm not testing the underlying functions; I'm making sure
that they work properly with CITEXT.

> I suggest something more like the attached as a suitable regression
> test. I got bored about halfway through and started to skim, so there
> might be a few tests toward the end of the original set that would be
> worth transposing into this one, but nothing jumped out at me.

Thanks! I'll work this in.

Best,

David

PS: I confirmed late yesterday that the memory leak I saw was with my
version for 8.3, where I had my own str_tolower(). It clearly has a
leak that I'll have to fix, but it has no bearing on the contribution
to 8.4, which has no such leak. Thanks for running the test yourself
to confirm.

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 03:30:33
Message-ID:	DCE867B9-C3CC-4368-956F-95FD9769D43B@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 12, 2008, at 15:13, David E. Wheeler wrote:

>> Sadly, I think you have to give up
>> attempts to check the interesting multibyte cases and confine
>> yourself
>> to tests using ASCII strings.
>
> Grr. Kind of defeats the purpose. Is there no infrastructure for
> testing multibyte functionality? Are test database clusters all
> built using SQL_ASCII and the C locale?

I just tried to take your modified tests and add multibyte tests that
run only on OS X with en_US.UTF-8. They worked like this:

CREATE OR REPLACE FUNCTION test_multibyte ()
RETURNS BOOLEAN AS $$
SELECT version() ~ 'apple-darwin'
AND (select setting from pg_settings where name = 'lc_collate')
= 'en_US.UTF-8';
$$ LANGUAGE SQL IMMUTABLE;

SELECT 'À'::citext = 'À'::citext WHERE test_multibyte() = true;
SELECT 'À'::citext = 'à'::citext WHERE test_multibyte() = true;

But then I realized that this would change the expected output
depending on the platform, and thus make the tests fail. This is one
reason why the inflexibility of the existing regression test model is
a drag: it limits you to testing only what works everywhere!

Grrr.

I'll remove all the multibyte character tests, but I have to say that,
as a result, the CITEXT 1 module would likely pass all such tests,
but it still isn't locale-aware. How can one add regressions to ensure
that something truly is locale-aware?

Thanks,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 04:44:18
Message-ID:	A8BF06B7-B908-4C45-8C3E-8669994DB928@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 12, 2008, at 14:50, David E. Wheeler wrote:

>> * An explicit comment explaining that you're piggybacking on the I/O
>> functions (and some others) for "text" wouldn't be out of place.
>
> I've added SQL comments. Were you talking about a COMMENT?
>
>> * Lose the GRANT EXECUTEs on the I/O functions; they're redundant.
>> (If you needed them, you'd need them on a lot more than these two.)
>> I'd be inclined to lose the COMMENTs on the functions too; again
>> these are about the least useful ones to comment on out of the
>> whole module.
>
> I wondered about that; those were copied from CITEXT 1. I've removed
> all GRANTs and COMMENTs.
>
>> * You should provide binary I/O (send/receive) functions, if you want
>> this to be an industrial-strength module. It's easy since you can
>> piggyback on text's.
>
> I'm confused. Is that not what the citextin and citextout functions
> are?
>
>> * The type declaration needs to say storage = extended, else the type
>> won't be toastable.
>
> Ah, good, thanks.
>
>> * The cast from bpchar to citext cannot be WITHOUT FUNCTION;
>> it needs to invoke rtrim1. Compare the bpchar to text cast.
>
> Where do I find that? I have trouble finding the SQL that creates
> the core types. :-(

Duh, you just told me. Added, thanks.

>> * <= is surely not its own commutator.
>
> Changed to >=.
>
>> You might try running the
>> opr_sanity regression test on this module to see if it finds any
>> other silliness. (Procedure: insert the citext definition script
>> into the serial_schedule list just ahead of opr_sanity, run tests,
>> make sure you understand the reason for any diffs in the opr_sanity
>> result. There will be at least one from the uses of text-related
>> functions for citext.)
>
> Thanks. Added to my list.
>
>> * I think you can and should drop all of the "size" functions and
>> a lot of the "miscellaneous" functions: anyplace where the implicit
>> coercion to text would serve, you don't need a piggyback function,
>> and introducing one just creates risks of
>> can't-resolve-ambiguous-function failures. The overloaded
>> miscellaneous
>> functions are only justifiable to the extent that it's important to
>> preserve "citext-ness" of the result of a function, which seems at
>> best dubious for many of these. It is likewise pretty pointless
>> AFAICS
>> to introduce regex functions taking citext pattern arguments.
>
> I added most of those as I wrote tests and they failed to find the
> functions. Once I added the functions, they worked. But I'll do an
> audit to make sure that I didn't inadvertantly leave in any unneeded
> ones (I'm happy to have less code :-)).

Of the size functions, I was able to remove only this one and keep all
of my pgTAP tests passing:

CREATE FUNCTION textlen(citext)
RETURNS int4 AS 'textlen'
LANGUAGE 'internal' IMMUTABLE STRICT;

When I deleted any of the others, I got errors like this:

psql:sql/citext.sql:865: ERROR: function length(citext) is not unique
LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||...
^
HINT: Could not choose a best candidate function. You might need to
add explicit type casts.

I think you can see now why I wrote the tests: I wanted to ensure that
CITEXT really does work (almost) just like TEXT.

I was able to eliminate *all* of the miscellaneous functions, but the
upper() and lower() functions now return TEXT instead of CITEXT, which
I don't think is exactly what we want, is it? For now, I'e left
upper() and lower() in. It just seems like more expected functionality.

>> * Don't use the OPERATOR() notation when you don't need to.
>> It's just clutter.
>
> Sorry, don't know what you're referring to here. CREATE OPERATOR
> appears to require parens…

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 17:16:06
Message-ID:	11124.1215969366@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> When I deleted any of the others, I got errors like this:

> psql:sql/citext.sql:865: ERROR: function length(citext) is not unique
> LINE 1: SELECT is( length( name ), length(name::text), 'length("' ||...
> ^
> HINT: Could not choose a best candidate function. You might need to
> add explicit type casts.

Hmm. I think what that actually means is that the cast from citext to
bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is happening
is that the system can't figure out whether to use length(text) or
length(bpchar) when presented with a citext argument. I had been
thinking yesterday that it would automatically prefer length(text)
because text is a "preferred type", but after tracing through it I see
that that doesn't happen because citext is not thought to be of the
string category. (We really need a way to let user-defined types
specify their category...)

The fact that you need all these piggyback functions is a red flag
because what it implies is that citext will not work nicely for any
situation where both text and bpchar functions have been provided
--- and that includes user-added functions, so it's hopeless to think
that you can get to a solution this way. Downgrading the cast seems
like the right thing to me.

The implicit cast to varchar is a bit worrisome because it creates the
same issue if someone has provided both varchar and text versions of a
function. However, that seems a bit pointless given the lack of
semantic difference, and I suspect that a lot of user-written functions
come only in varchar variants --- so on balance my recommendation is to
keep that one as implicit.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 17:19:56
Message-ID:	11227.1215969596@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 12, 2008, at 12:17, Tom Lane wrote:
>> * You should provide binary I/O (send/receive) functions, if you want
>> this to be an industrial-strength module. It's easy since you can
>> piggyback on text's.

> I'm confused. Is that not what the citextin and citextout functions are?

No, those are text I/O. You need analogues of textsend and textrecv
too.

>> You might try running the
>> opr_sanity regression test on this module to see if it finds any
>> other silliness. (Procedure: insert the citext definition script
>> into the serial_schedule list just ahead of opr_sanity, run tests,
>> make sure you understand the reason for any diffs in the opr_sanity
>> result. There will be at least one from the uses of text-related
>> functions for citext.)

> Thanks. Added to my list.

BTW, actually a better idea would be to put citext.sql at the front of
the list and just run the whole main regression series with it present.
typ_sanity and oidjoins might possibly find issues too.

>> * Don't use the OPERATOR() notation when you don't need to.
>> It's just clutter.

> Sorry, don't know what you're referring to here.

Some (not all) of your CREATE OPERATOR commands have things like

NEGATOR = OPERATOR(!~),

which seems unnecessary, and is certainly inconsistent.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 17:31:10
Message-ID:	11482.1215970270@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 12, 2008, at 14:57, Tom Lane wrote:
>> Sadly, I think you have to give up
>> attempts to check the interesting multibyte cases and confine yourself
>> to tests using ASCII strings.

> Grr. Kind of defeats the purpose. Is there no infrastructure for
> testing multibyte functionality?

There's some stuff under src/test/locale and src/test/mb, though it
doesn't get exercised regularly. The contrib tests are a particularly
bad place for trying to do any locale-dependent testing, because we
only support "make installcheck" which means there is no way to set
the database locale --- you *have to* work with whatever locale is
predetermined by the postmaster you're pointed at.

I don't recall the reason for not supporting "make check" in the contrib
modules; maybe it was just that preparing a test installation for each
contrib module sounded too slow? In any case, what you'd need to pursue
is having some additional tests available that are not executed by the
standard contrib regression test sequence.

(If we get to having per-database collations in 8.4 then integrating a
test with a non-default collation would get a lot easier, of course;
but for the moment I'm afraid you've got to work with what's there.)

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 22:49:32
Message-ID:	8FFA9421-A61D-48F1-A4CC-70C43425F8F6@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 13, 2008, at 10:16, Tom Lane wrote:

> Hmm. I think what that actually means is that the cast from citext to
> bpchar should be AS ASSIGNMENT rather than IMPLICIT. What is
> happening
> is that the system can't figure out whether to use length(text) or
> length(bpchar) when presented with a citext argument. I had been
> thinking yesterday that it would automatically prefer length(text)
> because text is a "preferred type", but after tracing through it I see
> that that doesn't happen because citext is not thought to be of the
> string category. (We really need a way to let user-defined types
> specify their category...)

That'd be nice.

> The fact that you need all these piggyback functions is a red flag
> because what it implies is that citext will not work nicely for any
> situation where both text and bpchar functions have been provided
> --- and that includes user-added functions, so it's hopeless to think
> that you can get to a solution this way. Downgrading the cast seems
> like the right thing to me.

Yes, that works for me. I've downgraded it and can now remove the size
functions and all the tests still pass.

> The implicit cast to varchar is a bit worrisome because it creates the
> same issue if someone has provided both varchar and text versions of a
> function. However, that seems a bit pointless given the lack of
> semantic difference, and I suspect that a lot of user-written
> functions
> come only in varchar variants --- so on balance my recommendation is
> to
> keep that one as implicit.

Yes, I agree. Thanks for tracing this out, Tom, it never would have
ocurred to me -- and now I know more than I did before!

Best,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 22:51:54
Message-ID:	CD046CA0-31F7-4718-8930-AEDEEA057E39@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 13, 2008, at 10:19, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> On Jul 12, 2008, at 12:17, Tom Lane wrote:
>>> * You should provide binary I/O (send/receive) functions, if you
>>> want
>>> this to be an industrial-strength module. It's easy since you can
>>> piggyback on text's.
>
>> I'm confused. Is that not what the citextin and citextout functions
>> are?
>
> No, those are text I/O. You need analogues of textsend and textrecv
> too.

Okay.

>> Thanks. Added to my list.
>
> BTW, actually a better idea would be to put citext.sql at the front of
> the list and just run the whole main regression series with it
> present.
> typ_sanity and oidjoins might possibly find issues too.

Also added to my list. :-)

> Some (not all) of your CREATE OPERATOR commands have things like
>
> NEGATOR = OPERATOR(!~),
>
> which seems unnecessary, and is certainly inconsistent.

Oh, I hadn't even noticed those; I'd just copied them from citext 1.
Fixed!

Best,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 23:06:09
Message-ID:	C7082EE0-A309-43B5-8AB0-6CBF7955312D@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 13, 2008, at 10:19, Tom Lane wrote:

>> I'm confused. Is that not what the citextin and citextout functions
>> are?
>
> No, those are text I/O. You need analogues of textsend and textrecv
> too.

Should those return bytea and citext, respectively? IOW, are these
right?

CREATE OR REPLACE FUNCTION citextrecv(bytea)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;

Thanks,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-13 23:13:50
Message-ID:	01FB4A75-2F28-4453-A6C0-E530D5A18B82@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 13, 2008, at 16:06, David E. Wheeler wrote:

> Should those return bytea and citext, respectively? IOW, are these
> right?
>
> CREATE OR REPLACE FUNCTION citextrecv(bytea)
> RETURNS citext
> AS 'textrecv'
> LANGUAGE 'internal' IMMUTABLE STRICT;
>
> CREATE OR REPLACE FUNCTION citextsend(citext)
> RETURNS bytea
> AS 'textsend'
> LANGUAGE 'internal' IMMUTABLE STRICT;

To judge by the User-Defined Types docs, I was close. :-) I just
changed the argument to citextrecv to "internal".

Thanks,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 01:17:35
Message-ID:	18756.1215998255@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> To judge by the User-Defined Types docs, I was close. :-) I just
> changed the argument to citextrecv to "internal".

Right. The APIs for send and recv aren't inverses. (Oh, the sins
we'll commit in the name of performance ...)

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 06:37:36
Message-ID:	06A41272-86BF-43D0-A21C-7B514EB73E74@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 13, 2008, at 10:19, Tom Lane wrote:

>>> You might try running the
>>> opr_sanity regression test on this module to see if it finds any
>>> other silliness. (Procedure: insert the citext definition script
>>> into the serial_schedule list just ahead of opr_sanity, run tests,
>>> make sure you understand the reason for any diffs in the opr_sanity
>>> result. There will be at least one from the uses of text-related
>>> functions for citext.)
>
>> Thanks. Added to my list.
>
> BTW, actually a better idea would be to put citext.sql at the front of
> the list and just run the whole main regression series with it
> present.
> typ_sanity and oidjoins might possibly find issues too.

Um, stupid question (sorry, I'm not familiar with how the regression
tests are organized): serial_schedule doesn't look like a file into
which I can insert an SQL file. How would I do that?

Thanks,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 06:39:44
Message-ID:	B23CE93F-D504-4D9A-B817-9F81A7273A98@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 13, 2008, at 10:31, Tom Lane wrote:

>> Grr. Kind of defeats the purpose. Is there no infrastructure for
>> testing multibyte functionality?
>
> There's some stuff under src/test/locale and src/test/mb, though it
> doesn't get exercised regularly. The contrib tests are a particularly
> bad place for trying to do any locale-dependent testing, because we
> only support "make installcheck" which means there is no way to set
> the database locale --- you *have to* work with whatever locale is
> predetermined by the postmaster you're pointed at.
>
> I don't recall the reason for not supporting "make check" in the
> contrib
> modules; maybe it was just that preparing a test installation for each
> contrib module sounded too slow? In any case, what you'd need to
> pursue
> is having some additional tests available that are not executed by the
> standard contrib regression test sequence.
>
> (If we get to having per-database collations in 8.4 then integrating a
> test with a non-default collation would get a lot easier, of course;
> but for the moment I'm afraid you've got to work with what's there.)

Could I supply two comparison files, one for Mac OS X with en_US.UTF-8
and one for everything else, as described in the last three paragraphs
here?

http://www.postgresql.org/docs/current/static/regress-variant.html

That way, I can at least make sure that the multibyte stuff *does*
work. Even if it's tested on only one platform, the purpose is not to
test a particular collation, but to test that CITEXT is actually
sensitive to locale.

Thanks,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 07:21:41
Message-ID:	BF661CB9-0AFE-4BFD-94FD-F9B61C2F846D@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 12, 2008, at 14:57, Tom Lane wrote:

I'd like to keep these tests, since they ensure not just that the
functions work but that they work with citext. Given what we found
with length() and friends not working when there was an implicit cast
to bpchar, I think it's valuable to continue to ensure that these
functions work as expected with citext. Otherwise someone in the
future might come along and make the cast to bpchar implicit again,
and no tests would fail to tell him/her otherwise.

These tests make good regressions.

Thanks,

David

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 13:05:23
Message-ID:	487B4F13.9050807@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David E. Wheeler wrote:
>>
>> (If we get to having per-database collations in 8.4 then integrating a
>> test with a non-default collation would get a lot easier, of course;
>> but for the moment I'm afraid you've got to work with what's there.)
>
> Could I supply two comparison files, one for Mac OS X with en_US.UTF-8
> and one for everything else, as described in the last three paragraphs
> here?
>
> http://www.postgresql.org/docs/current/static/regress-variant.html
>
> That way, I can at least make sure that the multibyte stuff *does*
> work. Even if it's tested on only one platform, the purpose is not to
> test a particular collation, but to test that CITEXT is actually
> sensitive to locale.
>
>

You can certainly add any tests you like. But the buildfarm does all its
regression tests using an install done with --no-locale. Any test that
requires some locale to work, or make sense, should probably not be in
your Makefile's REGRESS target.

Some time ago I raised the question of doing locale- and encoding-aware
regression tests, and IIRC the consensus seemed to be that it was not
worth the effort, but maybe we need to revisit that. We certainly seem
to be doing a lot more work now to get Postgres to where it needs to be
w.r.t. locale support, and we've cleaned up a lot of the encoding issues
we had, so maybe we need to reflect that in our testing regime more.

cheers

andrew

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	testing locales and encodings
Date:	2008-07-14 13:49:09
Message-ID:	20080714134909.GC4050@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan wrote:

> Some time ago I raised the question of doing locale- and encoding-aware
> regression tests, and IIRC the consensus seemed to be that it was not
> worth the effort, but maybe we need to revisit that. We certainly seem
> to be doing a lot more work now to get Postgres to where it needs to be
> w.r.t. locale support, and we've cleaned up a lot of the encoding issues
> we had, so maybe we need to reflect that in our testing regime more.

I agree. Maybe we need to unearth the src/test/mb or src/test/locale;
perhaps have the buildfarm run those tests when they get to a useful
point.

I notice that src/test/locale is not VPATH-aware, and I'm unsure if it's
got a clear distinction of locales vs. encodings (it calls koi8-r a
locale).

src/test/mb is not VPATH aware either (and it doesn't have a Makefile at
all), and is now broken by the fact that createdb refuses to create a
database with mismatching encoding.

$ LC_ALL=C sh mbregress.sh
euc_jp .. createdb: database creation failed: ERROR: encoding EUC_JP does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
sjis .. failed
euc_kr .. createdb: database creation failed: ERROR: encoding EUC_KR does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
euc_cn .. createdb: database creation failed: ERROR: encoding EUC_CN does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
euc_tw .. createdb: database creation failed: ERROR: encoding EUC_TW does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed
big5 .. failed
utf8 .. ok
mule_internal .. createdb: database creation failed: ERROR: encoding MULE_INTERNAL does not match server's locale fr_CA.UTF-8
DETAIL: The server's LC_CTYPE setting requires encoding UTF8.
failed

So this whole area would seem to need a lot of love ...

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 14:08:36
Message-ID:	7652.1216044516@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 13, 2008, at 10:19, Tom Lane wrote:
>> BTW, actually a better idea would be to put citext.sql at the front of
>> the list and just run the whole main regression series with it
>> present.
>> typ_sanity and oidjoins might possibly find issues too.

> Um, stupid question (sorry, I'm not familiar with how the regression
> tests are organized): serial_schedule doesn't look like a file into
> which I can insert an SQL file. How would I do that?

You're really making it into another test. Just copy the citext.sql
file into the sql/ subdirectory and add a "citext" entry to the schedule
file.

The last time I did this, I had to at least "touch" a corresponding
expected file (expected/citext.out) or pg_regress would go belly-up
without running the rest of the tests. There's no special need to
make the expected file match, of course, since you can just ignore
the "failure" report from that one test.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 14:24:50
Message-ID:	7889.1216045490@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Could I supply two comparison files, one for Mac OS X with en_US.UTF-8
> and one for everything else, as described in the last three paragraphs
> here?

The fallacy in that proposal is the assumption that there are only two
behaviors out there. Let me recalibrate your thoughts a bit: so far
I have tried citext on three different machines (Mac, Fedora 8, HPUX),
and I got three different answers from those tests. That's despite
endeavoring to make the database locales match ... which is less than
trivial in itself because they use three slightly different spellings of
"en_US.UTF8".

Given that you were more or less deliberately testing corner cases,
I think it's quite likely that the number of observable reactions from
N platforms would be more nearly O(N) than O(1).

In the real world, to the extent that we are able to control the locale
of the regression tests, we make it "C" --- and to a large extent we
can't control it at all, which means you have another uncontrolled
variable besides platform. So in the current universe there is
absolutely no value in submitting locale-specific tests for a contrib
module.

I see some discussion in the thread about improving the situation, but
until we are able to decouple database locale from environment locale,
I doubt we'll be able to do a whole lot about automating this kind
of test. There are too many variables at the moment.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 14:26:58
Message-ID:	7928.1216045618@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 12, 2008, at 14:57, Tom Lane wrote:
>> 4. A lot of the later test cases are equally uselessly testing whether
>> piggybacking over text functions works.

> I'd like to keep these tests, since they ensure not just that the
> functions work but that they work with citext.

It might be reasonable to test a couple of them for that purpose.
If your agenda is "test every function in the system that comes or
might come in a bpchar variant", I think that's pointless.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 17:36:48
Message-ID:	14C29177-1351-4A8B-A840-099C0EC14E49@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 06:05, Andrew Dunstan wrote:

> You can certainly add any tests you like. But the buildfarm does all
> its regression tests using an install done with --no-locale. Any
> test that requires some locale to work, or make sense, should
> probably not be in your Makefile's REGRESS target.

Well, unless the display of the characters would be broken (the build
farm databases use UTF-8, no?), the output would look like this, which
should more or less work (I'm hoping):

SELECT citext_larger( 'Â'::citext, 'ç'::citext ) = 'ç' AS t WHERE
false and test_multibyte();
t
---
(0 rows)

> Some time ago I raised the question of doing locale- and encoding-
> aware regression tests, and IIRC the consensus seemed to be that it
> was not worth the effort, but maybe we need to revisit that. We
> certainly seem to be doing a lot more work now to get Postgres to
> where it needs to be w.r.t. locale support, and we've cleaned up a
> lot of the encoding issues we had, so maybe we need to reflect that
> in our testing regime more.

Makes sense to me.

Best,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 17:48:44
Message-ID:	EC8BD896-825A-4098-9A6E-6024DBF28078@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 07:24, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> Could I supply two comparison files, one for Mac OS X with
>> en_US.UTF-8
>> and one for everything else, as described in the last three
>> paragraphs
>> here?
>
> The fallacy in that proposal is the assumption that there are only two
> behaviors out there.

Well, no, that's not the assumption at all. The assumption is that the
type works properly with multibyte characters under multibyte-aware
locales. So I want to have tests to ensure that such is true by having
multibyte characters run under a very specific locale and platform. I
don't really care what platform or locale; the point is to make sure
that the type is actually multibyte-aware.

> Let me recalibrate your thoughts a bit: so far
> I have tried citext on three different machines (Mac, Fedora 8, HPUX),
> and I got three different answers from those tests. That's despite
> endeavoring to make the database locales match ... which is less than
> trivial in itself because they use three slightly different
> spellings of
> "en_US.UTF8".

<rant>
This is a truly pitiful state of affairs. Rhetorical question: Why is
there no standardization of locales? I'm sure there are a lot of
opinions out there (should all uppercase chars should precede all
lowercase chars or be mixed in with lowercase chars), but I should
think that, in this day and age, there would be some sort of standard
defining locales and how they work -- and to allow such opinions to be
expressed by different locales, not in the same locale names on
different platforms.
</rant>

> Given that you were more or less deliberately testing corner cases,
> I think it's quite likely that the number of observable reactions from
> N platforms would be more nearly O(N) than O(1).

To me they're not corner cases. To me it is just, "given a specific
platform/locale, does CITEXT respect the locale's rules?" I don't care
to test all platforms and locales (I'm not *that* stupid :-)).

> In the real world, to the extent that we are able to control the
> locale
> of the regression tests, we make it "C" --- and to a large extent we
> can't control it at all, which means you have another uncontrolled
> variable besides platform. So in the current universe there is
> absolutely no value in submitting locale-specific tests for a contrib
> module.

Then how do we know that it will continue to be locale-aware over
time? Someone could replace the comparison function with one that just
lowercases ASCII characters, like CITEXT 1 does, and no tests would
fail. How do you prevent that from happening without being hyper-
vigilant (and never leaving the project, I might add)?

> I see some discussion in the thread about improving the situation, but
> until we are able to decouple database locale from environment locale,
> I doubt we'll be able to do a whole lot about automating this kind
> of test. There are too many variables at the moment.

Is the decoupling of database locale from environment locale likely to
happen anytime soon? Now that I've written CITEXT, I dare say that
such might become my top-desired feature (aside from replication).

Thanks for the discussion, much appreciated, and I'm learning a ton. I
retain the right to be opinionated, however. ;-)

Best,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 17:50:29
Message-ID:	C8004B30-B247-475F-8DCD-A69FC971A60F@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 07:26, Tom Lane wrote:

>> I'd like to keep these tests, since they ensure not just that the
>> functions work but that they work with citext.
>
> It might be reasonable to test a couple of them for that purpose.
> If your agenda is "test every function in the system that comes or
> might come in a bpchar variant", I think that's pointless.

Or a varchar variant, or where such a variant might be added in the
future. To my mind, it's important to have good coverage in my unit
tests to ensure that things continue to work exactly the same over time.

So, since the tests are already written, and are unlikely to add more
than a few milliseconds to test runtime, can you at least agree that
such tests are harmless?

Updated patch later today.

Thanks,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: testing locales and encodings
Date:	2008-07-14 17:51:21
Message-ID:	0C79A16A-63C2-4786-842A-CCCE7647CD27@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 06:49, Alvaro Herrera wrote:

> So this whole area would seem to need a lot of love ...

Do the tests control for platforms, as well, since locales with the
same spellings can vary on different platforms? Or even on different
versions of the same platforms?

Thanks,

David

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 17:55:07
Message-ID:	487B92FB.3050003@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

David E. Wheeler wrote:
> On Jul 14, 2008, at 06:05, Andrew Dunstan wrote:
>
>> You can certainly add any tests you like. But the buildfarm does all
>> its regression tests using an install done with --no-locale. Any test
>> that requires some locale to work, or make sense, should probably not
>> be in your Makefile's REGRESS target.
>
> Well, unless the display of the characters would be broken (the build
> farm databases use UTF-8, no?),

No.

--no-locale is the same as --locale=C which means the encoding is SQL_ASCII.

cheers

andrew

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 17:57:34
Message-ID:	E778558F-21CA-4E58-BF94-215448DB09E3@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 10:55, Andrew Dunstan wrote:

> No.
>
> --no-locale is the same as --locale=C which means the encoding is
> SQL_ASCII.

I've used --no-locale --encoding UTF-8 many times. But if the
regressions run with SQL_ASCII…well, I'm just amazed that there
haven't been more unexpected side-effects to source code changes
without controlling for such variables in tests. Seems like a lot to
keep in one's head.

Anyway, are tests for contrib modules run on the build farms? If so, I
could still potentially get around this issue by turning off ECHO.

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 17:58:35
Message-ID:	28829.1216058315@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Well, unless the display of the characters would be broken (the build
> farm databases use UTF-8, no?),

No, they use C.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 18:00:57
Message-ID:	13F3AE93-087B-45FF-B253-118EED9142BD@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 10:58, Tom Lane wrote:

>> Well, unless the display of the characters would be broken (the build
>> farm databases use UTF-8, no?),
>
> No, they use C.

Um, you mean SQL_ASCII?

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 18:06:52
Message-ID:	28971.1216058812@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 14, 2008, at 07:24, Tom Lane wrote:
>> The fallacy in that proposal is the assumption that there are only two
>> behaviors out there.

> Well, no, that's not the assumption at all. The assumption is that the
> type works properly with multibyte characters under multibyte-aware
> locales. So I want to have tests to ensure that such is true by having
> multibyte characters run under a very specific locale and platform.

[ shrug... ] Seems pretty useless to me: we already know that it works
for you. The point of a regression test in my mind is to make sure that
it works for everybody. Given the platform variations involved in
strcoll's behavior, the definition of "works for everybody" is going to
be pretty darn narrowly circumscribed anyway, and thus I don't have a
big problem with restricting the tests to ASCII cases.

Let me put it another way: if we test on another platform and find out
that strcoll's behavior is different there, are you going to fix that
version of strcoll? No, you're not. So you might as well just test the
behavior of the code that's actually under your control.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 18:10:02
Message-ID:	E1098A2B-10BC-4F88-8D84-81E3CBE3D983@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 11:06, Tom Lane wrote:

> [ shrug... ] Seems pretty useless to me: we already know that it
> works
> for you. The point of a regression test in my mind is to make sure
> that
> it works for everybody. Given the platform variations involved in
> strcoll's behavior, the definition of "works for everybody" is going
> to
> be pretty darn narrowly circumscribed anyway, and thus I don't have a
> big problem with restricting the tests to ASCII cases.

Neither do I, as long as there is *some* context to ensure that the
type remains locale-aware. We only know that it works for me because
I've written the tests.

> Let me put it another way: if we test on another platform and find out
> that strcoll's behavior is different there, are you going to fix that
> version of strcoll? No, you're not. So you might as well just test
> the
> behavior of the code that's actually under your control.

You don't seem to understand what I'm suggesting: I'm not talking
about testing strcoll. I'm talking about making sure that citext
*uses* strcoll. Whether or not strcoll actually works properly is not
my concern.

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 18:49:06
Message-ID:	29758.1216061346@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> On Jul 14, 2008, at 11:06, Tom Lane wrote:
>> Let me put it another way: if we test on another platform and find out
>> that strcoll's behavior is different there, are you going to fix that
>> version of strcoll? No, you're not. So you might as well just test
>> the
>> behavior of the code that's actually under your control.

> You don't seem to understand what I'm suggesting: I'm not talking
> about testing strcoll. I'm talking about making sure that citext
> *uses* strcoll.

This seems pointless, as well as essentially impossible to enforce by
black-box testing. Nobody is likely to consider a patch that removes
such obviously basic functionality of the module. I think you're
worrying about testing the wrong things --- and as evidence I'd note the
serious errors that slipped by your testing (including the fact that the
last submitted version would still claim that 'a' = 'ab'). What we
generally ask a regression test to do is catch portability problems
(which is certainly a real-enough issue here, but we don't know how
to address it well) or catch foreseeable breakage (such as someone
introducing a cast that breaks resolution of citext function calls).
The methodology can't catch everything, and trying to push it beyond
what it can do is just a time sink for effort that can more usefully
be spent other ways, such as code-reading.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 19:04:55
Message-ID:	7D3F0762-F324-4F9D-9ECF-02AACC894AD7@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 11:49, Tom Lane wrote:

>> You don't seem to understand what I'm suggesting: I'm not talking
>> about testing strcoll. I'm talking about making sure that citext
>> *uses* strcoll.
>
> This seems pointless, as well as essentially impossible to enforce by
> black-box testing. Nobody is likely to consider a patch that removes
> such obviously basic functionality of the module.

Never underestimate the human capacity for incompetence. One of my
mottos.

> I think you're
> worrying about testing the wrong things --- and as evidence I'd note
> the
> serious errors that slipped by your testing (including the fact that
> the
> last submitted version would still claim that 'a' = 'ab').

Um, say what? I had that problem at one time, but it was when I was
writing my own lower() function, which was shitty (I wasn't creating a
nul-terminated string). I don't recall that being in any patch I sent
to the list, and indeed you wrote that very test in the revised test
script you sent in.

At any rate, I make no claims that my tests properly covered
everything. There is a lot I didn't know to test, but I'm learning
more all the time. That's why this code review has been so immensely
valuable, both for me and for CITEXT.

> What we
> generally ask a regression test to do is catch portability problems
> (which is certainly a real-enough issue here, but we don't know how
> to address it well) or catch foreseeable breakage (such as someone
> introducing a cast that breaks resolution of citext function calls).
> The methodology can't catch everything, and trying to push it beyond
> what it can do is just a time sink for effort that can more usefully
> be spent other ways, such as code-reading.

I guess what I'm thinking of is not regression tests but unit tests.
And with unit testing, one of the goals is coverage. Good coverage has
saved my ass many times in the past, even catching changes that, in
hindsight, should obviously never have been attempted. Perhaps it'd be
worth thinking about creating a project for unit testing PostgreSQL in
controlled environments? Maybe having tests that can contain proper
conditionals?

Anyway, it seems that, given the limitations of the current testing
system with regard to testing multibyte characters, the issue is moot.
I'll throw in a few tests that test multibyte characters, but I'll
comment them out and just uncomment them for individual test runs on
my box for the purposes of my own sanity going forward.

Thanks,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-14 20:11:59
Message-ID:	3DBB7578-3FE4-4FB6-A594-313BA7C7BD8D@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 12, 2008, at 15:13, David E. Wheeler wrote:

>> 2. It's ridiculously slow; at least a factor of ten slower than doing
>> equivalent tests directly in SQL. This is a very bad thing. Speed
>> of
>> regression tests matters a lot to those of us who run them a dozen
>> times
>> per day --- and I do not wish to discourage any developers who don't
>> work that way from learning better habits ;-)
>
> Hrm. I'm wonder why it's so slow? The test functions don't really do
> a lot. Anyway, I agree that they should perform well.

Just as an FYI, I've just moved all the tests to regular SQL instead
of using pgTAP. The difference in runtime is:

psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 19% cpu 0.253
total
psql -Xd try -f sql/citext.sql 0.03s user 0.02s system 4% cpu 1.298
total

So it's close to a factor of five, though subtract .125 for the time
to load the pgTAP functions. The pgTAP tests *are* doing a lot more
work, but I'm sure that they could be made a lot more efficient,
though of course the TAP functions will always introduce some
overhead. One just needs to decide whether the tradeoffs are worth it.

Best,

David

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-15 05:09:27
Message-ID:	97D27121-AAF8-4FDB-86A9-8D6716323E62@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 14, 2008, at 07:08, Tom Lane wrote:

> You're really making it into another test. Just copy the citext.sql
> file into the sql/ subdirectory and add a "citext" entry to the
> schedule
> file.
>
> The last time I did this, I had to at least "touch" a corresponding
> expected file (expected/citext.out) or pg_regress would go belly-up
> without running the rest of the tests. There's no special need to
> make the expected file match, of course, since you can just ignore
> the "failure" report from that one test.

Okay, I copied citext.sql into src/test/regress/sql and then added
"test: citext" to the top of src/test/regress/serial_schedule. Then I
ran `make check`. All tests passed, but I don't think that citext was
tested.

Do I need to install the server, build a cluster, and run `make
installcheck`?

Thanks for the hand-holding.

Best,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-15 14:09:08
Message-ID:	18554.1216130948@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Okay, I copied citext.sql into src/test/regress/sql and then added
> "test: citext" to the top of src/test/regress/serial_schedule. Then I
> ran `make check`. All tests passed, but I don't think that citext was
> tested.
> Do I need to install the server, build a cluster, and run `make
> installcheck`?

Yeah, probably. I don't think the "make check" path will support it
because it doesn't install contrib into the temp installation.
(You'd also need to have put the extra entry in parallel_schedule
not serial_schedule, but it's not gonna work anyway.)

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-15 18:07:49
Message-ID:	AB235355-40E7-4B4D-8238-BC48AB6F5C28@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 15, 2008, at 07:09, Tom Lane wrote:

> Yeah, probably. I don't think the "make check" path will support it
> because it doesn't install contrib into the temp installation.
> (You'd also need to have put the extra entry in parallel_schedule
> not serial_schedule, but it's not gonna work anyway.)

Well now that was cool to see. I got some failures, of course, but
nothing stands out to me as an obvious bug. I attach the diffs file
(with the citext.sql failure removed) for your perusal. What would be
the best way for me to resolve those permission issues? Or do they
matter for sanity-checking citext?

Thanks,

David

Attachment	Content-Type	Size
regression.diffs	application/octet-stream	8.9 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-15 19:56:15
Message-ID:	6929.1216151775@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Well now that was cool to see. I got some failures, of course, but
> nothing stands out to me as an obvious bug. I attach the diffs file
> (with the citext.sql failure removed) for your perusal. What would be
> the best way for me to resolve those permission issues?

Don't run the tests in a read-only directory, perhaps.

> Or do they matter for sanity-checking citext?

Hard to tell --- I'd suggest trying to get a clean run. As for what you
have, the first diff hunk suggests you've got the wrong function
properties for citextsend/citextrecv.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-16 00:45:00
Message-ID:	34A73036-34A1-4C19-B447-F1752F3F3129@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 15, 2008, at 12:56, Tom Lane wrote:

> Don't run the tests in a read-only directory, perhaps.

Yes, I changed the owner to the postgres system user and that did the
trick.

>
>> Or do they matter for sanity-checking citext?
>
> Hard to tell --- I'd suggest trying to get a clean run. As for what
> you
> have, the first diff hunk suggests you've got the wrong function
> properties for citextsend/citextrecv.

Here's the new diff:

*** ./expected/opr_sanity.out Mon Jul 14 21:55:49 2008
--- ./results/opr_sanity.out Tue Jul 15 17:41:03 2008
***************
*** 87,94 ****
p1.provolatile != p2.provolatile OR
p1.pronargs != p2.pronargs);
oid | proname | oid | proname
! -----+---------+-----+---------
! (0 rows)

-- Look for uses of different type OIDs in the argument/result type
fields
-- for different aliases of the same built-in function.
***************
*** 110,117 ****
prorettype | prorettype
------------+------------
25 | 1043
1114 | 1184
! (2 rows)

SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
FROM pg_proc AS p1, pg_proc AS p2
--- 112,120 ----
prorettype | prorettype
------------+------------
25 | 1043
+ 25 | 87255
1114 | 1184
! (3 rows)

SELECT DISTINCT p1.proargtypes[0], p2.proargtypes[0]
FROM pg_proc AS p1, pg_proc AS p2
***************
*** 124,133 ****
-------------+-------------
25 | 1042
25 | 1043
1114 | 1184
1560 | 1562
2277 | 2283
! (5 rows)

SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
FROM pg_proc AS p1, pg_proc AS p2
--- 127,138 ----
-------------+-------------
25 | 1042
25 | 1043
+ 25 | 87255
+ 1042 | 87255
1114 | 1184
1560 | 1562
2277 | 2283
! (7 rows)

SELECT DISTINCT p1.proargtypes[1], p2.proargtypes[1]
FROM pg_proc AS p1, pg_proc AS p2
***************
*** 139,148 ****
proargtypes | proargtypes
-------------+-------------
23 | 28
1114 | 1184
1560 | 1562
2277 | 2283
! (4 rows)

SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
FROM pg_proc AS p1, pg_proc AS p2
--- 144,154 ----
proargtypes | proargtypes
-------------+-------------
23 | 28
+ 25 | 87255
1114 | 1184
1560 | 1562
2277 | 2283
! (5 rows)

SELECT DISTINCT p1.proargtypes[2], p2.proargtypes[2]
FROM pg_proc AS p1, pg_proc AS p2
***************
*** 305,311 ****
142 | 25 | 0 | a
142 | 1043 | 0 | a
142 | 1042 | 0 | a
! (6 rows)

-- **************** pg_operator ****************
-- Look for illegal values in pg_operator fields.
--- 311,318 ----
142 | 25 | 0 | a
142 | 1043 | 0 | a
142 | 1042 | 0 | a
! 87255 | 1042 | 0 | a
! (7 rows)

-- **************** pg_operator ****************
-- Look for illegal values in pg_operator fields.

======================================================================

So I guess my question is: what is wrong with the properties for
citextsend/citextrecv and what else might these failures be indicating
is wrong?

CREATE OR REPLACE FUNCTION citextrecv(internal)
RETURNS citext
AS 'textrecv'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE OR REPLACE FUNCTION citextsend(citext)
RETURNS bytea
AS 'textsend'
LANGUAGE 'internal' IMMUTABLE STRICT;

CREATE TYPE citext (
INPUT = citextin,
OUTPUT = citextout,
RECEIVE = citextrecv,
SEND = citextsend,
INTERNALLENGTH = VARIABLE,
STORAGE = extended
);

Thanks,

David

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-16 03:26:18
Message-ID:	4111.1216178778@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> So I guess my question is: what is wrong with the properties for
> citextsend/citextrecv

[ checks catalogs... ] textsend and textrecv are marked STABLE not
IMMUTABLE. I am not totally sure about the reasoning offhand --- it
might be because their behavior depends on client_encoding.

> and what else might these failures be indicating
> is wrong?

I think the other diffs are okay, they just reflect the fact that you're
depending on binary equivalence of text and citext.

regards, tom lane

From:	"David E(dot) Wheeler" <david(at)kineticode(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: PATCH: CITEXT 2.0 v3
Date:	2008-07-16 04:54:47
Message-ID:	200EC059-5627-4942-8F0D-89B70C4FBF1E@kineticode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jul 15, 2008, at 20:26, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> So I guess my question is: what is wrong with the properties for
>> citextsend/citextrecv
>
> [ checks catalogs... ] textsend and textrecv are marked STABLE not
> IMMUTABLE. I am not totally sure about the reasoning offhand --- it
> might be because their behavior depends on client_encoding.

Thanks. Looks like maybe the xtypes docs need to be updated?

http://www.postgresql.org/docs/8.3/static/xtypes.html

Anyway, changing them to "STABLE STRICT" appears to have done the
trick (diff attached).

>> and what else might these failures be indicating
>> is wrong?
>
> I think the other diffs are okay, they just reflect the fact that
> you're
> depending on binary equivalence of text and citext.

Great, thanks. And with that, I think I'm just about ready to submit a
new version of the patch, coming up shortly.

Best,

David

Attachment	Content-Type	Size
regression.diffs	application/octet-stream	2.5 KB