Re: Question about Encoding a Custom Type

Lists: pgsql-hackers
From: David E(dot) Wheeler <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Question about Encoding a Custom Type
Date: 2008-06-16 05:07:43
Message-ID: 95FC9074-9199-4214-93B8-51B1264DFDD5@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Howdy,

Possibly showing my ignorance here, but as I'm working on updating
citext to be locale-aware and to work on 8.3, I've run into this
peculiarity:

try=# \encoding
UTF8
try=# select setting from pg_settings where name = 'lc_collate';
setting
-------------
en_US.UTF-8
(1 row)

try=# create table try (name citext);
try=# insert into try (name) values ('aardvark'), ('AAA');
try=# select name, name = 'aaa' from try;
name | ?column?
----------+----------
aardvark | f
AAA | t
(2 rows)

try=# insert into try (name) values ('aba'), ('ABC'), ('abc');
try=# select name, name = 'aaa' from try;
name | ?column?
----------+----------
aardvark | f
AAA | t
aba | f
ABC | f
abc | f
(5 rows)

try=# insert into try (name) values ('AAAA');
try=# select name, name = 'aaa' from try;
ERROR: invalid byte sequence for encoding "UTF8": 0xf6bd
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

I've no idea what could be different about 'AAAA' vs. any other value.
And if I do either of these:

select name, name = 'aaa'::text from try;
select name, name::text = 'aaa' from try;

It just works. I'm mystified.

My casts:

CREATE CAST (citext AS text) WITHOUT FUNCTION AS IMPLICIT;
CREATE CAST (citext AS varchar) WITHOUT FUNCTION AS IMPLICIT;
CREATE CAST (citext AS bpchar) WITHOUT FUNCTION AS IMPLICIT;
CREATE CAST (text AS citext) WITHOUT FUNCTION AS ASSIGNMENT;
CREATE CAST (varchar AS citext) WITHOUT FUNCTION AS ASSIGNMENT;
CREATE CAST (bpchar AS citext) WITHOUT FUNCTION AS ASSIGNMENT;

Question about the code? It's all here (for now):

https://svn.kineticode.com/citext/trunk/

Hrm. Fiddling a bit more, I find that this fails, too:

try=# select citext_smaller( 'aardvark'::citext,
'AARDVARKasdfasdfasdfasdf'::citext );
ERROR: invalid byte sequence for encoding "UTF8": 0xc102
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".

So I guess that something must be up with citext_smaller(). It's quite
simple, though:

PG_FUNCTION_INFO_V1(citext_smaller);

Datum citext_smaller (PG_FUNCTION_ARGS) {
text * left = PG_GETARG_TEXT_P(0);
text * right = PG_GETARG_TEXT_P(1);
PG_RETURN_TEXT_P( citextcmp( PG_ARGS ) < 0 ? left : right );
}

Context:

https://svn.kineticode.com/citext/trunk/citext.c

Anyone have any idea? Feedback would be *most* appreciated.

Thanks,

David


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 09:52:40
Message-ID: 20080616095240.GB4792@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jun 15, 2008 at 10:07:43PM -0700, David E. Wheeler wrote:
> Howdy,
>
> Possibly showing my ignorance here, but as I'm working on updating
> citext to be locale-aware and to work on 8.3, I've run into this
> peculiarity:

The only odd thing I see is the use of PG_ARGS to pass the arguments to
citextcmp. But I can't see why it would break either. Can you attach a
debugger and see where it goes wrong?

As to the comment about freeing stuff, it's usually nice if btree
comparison functions free memory because that way index rebuilds on
large tables don't run you out of memory.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 14:24:41
Message-ID: B88731EB-F654-4A0A-81E0-0FD6A340E510@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 16, 2008, at 02:52, Martijn van Oosterhout wrote:

> The only odd thing I see is the use of PG_ARGS to pass the arguments
> to
> citextcmp. But I can't see why it would break either. Can you attach a
> debugger and see where it goes wrong?

Yes, I can do that, although I'm pretty new to C (let alone gdb), so
I'm not sure exactly how to go about it. I'll try to get on IRC later
today to see if anyone can help me along.

> As to the comment about freeing stuff, it's usually nice if btree
> comparison functions free memory because that way index rebuilds on
> large tables don't run you out of memory.

Thanks. I'll add that to my list.

Best,

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: David E(dot) Wheeler <david(at)kineticode(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 17:44:55
Message-ID: 3C063566-6F6F-4FA7-AEBE-3DE4B9DFC922@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 16, 2008, at 09:24, David E. Wheeler wrote:

> On Jun 16, 2008, at 02:52, Martijn van Oosterhout wrote:
>
>> The only odd thing I see is the use of PG_ARGS to pass the
>> arguments to
>> citextcmp. But I can't see why it would break either. Can you
>> attach a
>> debugger and see where it goes wrong?
>
> Yes, I can do that, although I'm pretty new to C (let alone gdb), so
> I'm not sure exactly how to go about it. I'll try to get on IRC
> later today to see if anyone can help me along.

What's even weirder is that it can not work and then suddenly work:

try=# select citext_smaller( 'aardvark'::citext, 'AARDVARK'::citext );
ERROR: invalid byte sequence for encoding "UTF8": 0xe02483
HINT: This error can also happen if the byte sequence does not match
the encoding expected by the server, which is controlled by
"client_encoding".
try=# select citext_smaller( 'aardvark'::citext, 'AARDVARK'::citext );
citext_smaller
----------------
AARDVARK
(1 row)

WTF? Logging onto IRC now…

https://svn.kineticode.com/citext/trunk/

Best,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 18:06:52
Message-ID: 5704.1213639612@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> What's even weirder is that it can not work and then suddenly work:

Smells like uninitialized-memory problems to me. Perhaps you are
miscalculating the length of the input data?

Are you testing in an --enable-cassert build? The memory clobber
stuff can help to make it more obvious where such problems lurk.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 18:29:33
Message-ID: 9283E9BC-3561-428F-9F15-71461770D501@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 16, 2008, at 13:06, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> What's even weirder is that it can not work and then suddenly work:
>
> Smells like uninitialized-memory problems to me. Perhaps you are
> miscalculating the length of the input data?

Entirely possible. Here are the two functions in which I calculate size:

char * cilower(text * arg) {
// Do I need to free anything here?
char * str = VARDATA_ANY( arg );
#ifdef USE_WIDE_UPPER_LOWER
// Have wstring_lower() do the work.
return wstring_lower( str );
# else
// Copy the string and process it.
int inex, len;
char * result;

index = 0;
len = VARSIZE(arg) - VARHDRSZ;
result = (char *) palloc( strlen( str ) + 1 );

for (index = 0; index <= len; index++) {
result[index] = tolower((unsigned char) str[index] );
}
return result;
#endif /* USE_WIDE_UPPER_LOWER */
}

int citextcmp (PG_FUNCTION_ARGS) {
// Could we do away with the varlena struct here?
text * left = PG_GETARG_TEXT_P(0);
text * right = PG_GETARG_TEXT_P(1);
char * lstr = cilower( left );
char * rstr = cilower( right );
int llen = VARSIZE_ANY_EXHDR(left);
int rlen = VARSIZE_ANY_EXHDR(right);
return varstr_cmp(lstr, llen, rstr, rlen);
}

> Are you testing in an --enable-cassert build? The memory clobber
> stuff can help to make it more obvious where such problems lurk.

I've just recompiled with --enable-cassert and --enable-debug, but got
no more information when I triggered the error, neither in psql nor in
the log. :-(

Thanks,

David


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 18:41:58
Message-ID: 20080616184158.GH4792@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 16, 2008 at 01:29:33PM -0500, David E. Wheeler wrote:
> >Smells like uninitialized-memory problems to me. Perhaps you are
> >miscalculating the length of the input data?
>
> Entirely possible. Here are the two functions in which I calculate size:

Actually, real dumb question but: arn't you assume that text* values
are NULL terminated, because they're not...
>
> char * cilower(text * arg) {
> // Do I need to free anything here?
> char * str = VARDATA_ANY( arg );

str here is not null terminated. You need text_to_cstring or something
similar.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 19:10:22
Message-ID: 6EF5F3C6-22D2-43C1-AAD2-5AD9B8E5AE59@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 16, 2008, at 13:41, Martijn van Oosterhout wrote:

> Actually, real dumb question but: arn't you assume that text* values
> are NULL terminated, because they're not...
>>
>> char * cilower(text * arg) {
>> // Do I need to free anything here?
>> char * str = VARDATA_ANY( arg );
>
> str here is not null terminated. You need text_to_cstring or something
> similar.

Ah! That makes sense. I changed it to this:

#define GET_TEXT_STR(textp) DatumGetCString( \
DirectFunctionCall1( textout, PointerGetDatum( textp ) ) \
)

char * cilower(text * arg) {
// Do I need to free anything here?
char * str = GET_TEXT_STR( arg );
...

And now I don't get that error anymore. W00t! Many thanks.

Now I have just one more bizarre error: PostgreSQL thinks that a
citext column is not in an aggregate even when it is:

try=# CREATE AGGREGATE array_accum (anyelement) (
try(# sfunc = array_append,
try(# stype = anyarray,
try(# initcond = '{}'
try(# );
try=# CREATE TEMP TABLE srt ( name CITEXT );
try=#
try=# INSERT INTO srt (name)
try-# VALUES ('aardvark'),
try-# ('AAA'),
try-# ('â');
try=# select array_accum(name) from srt order by name;
ERROR: column "srt.name" must appear in the GROUP BY clause or be
used in an aggregate function

Um, what? Again, I'm sure I'm just missing something really stupid.
What might cause this?

Many thanks all,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 19:38:32
Message-ID: 6957.1213645112@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Now I have just one more bizarre error: PostgreSQL thinks that a
> citext column is not in an aggregate even when it is:
> try=# select array_accum(name) from srt order by name;
> ERROR: column "srt.name" must appear in the GROUP BY clause or be
> used in an aggregate function

> Um, what?

It's complaining about the use in ORDER BY.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 19:45:57
Message-ID: 5613682E-89E6-4A68-83FA-B291643C1C60@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 16, 2008, at 14:38, Tom Lane wrote:

> It's complaining about the use in ORDER BY.

Okay, so stupid question: How can I get an array of the values in a
given order? I guess this works:

select array_accum(b) from ( select name from srt order by name ) AS
A(b);

Thanks,

David


From: David Fetter <david(at)fetter(dot)org>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-16 21:48:45
Message-ID: 20080616214845.GB2859@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 16, 2008 at 02:45:57PM -0500, David Wheeler wrote:
> On Jun 16, 2008, at 14:38, Tom Lane wrote:
>
>> It's complaining about the use in ORDER BY.
>
> Okay, so stupid question: How can I get an array of the values in a
> given order? I guess this works:
>
> select array_accum(b) from ( select name from srt order by name ) AS
> A(b);

SELECT ARRAY(SELECT name FROM srt ORDER BY name); -- also works.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Question about Encoding a Custom Type
Date: 2008-06-17 03:04:59
Message-ID: 79F2A31C-EA8C-4A9A-885C-D1E11DB22EE9@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 16, 2008, at 16:48, David Fetter wrote:

>> select array_accum(b) from ( select name from srt order by name ) AS
>> A(b);
>
> SELECT ARRAY(SELECT name FROM srt ORDER BY name); -- also works.

Wow, somehow I'd missed that syntax over the years. Thanks David!

Best,

David