Re: Latest on CITEXT 2.0

Lists: pgsql-hackers
From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Latest on CITEXT 2.0
Date: 2008-06-25 18:47:39
Message-ID: 0E9C2D67-5F27-4747-B6F2-346BF8C2D81E@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Howdy,

I just wanted to report the latest on my pet project: implementing a
new case-insensitive text type, "citext", to be locale-aware and to
build and run on PostgreSQL 8.3. I'm not much of a C programmer (this
is only the second time I've written *anything* in C), so I also have
a few questions about my code, best practices, coverage, etc. You can
grab the latest here:

https://svn.kineticode.com/citext/trunk/

BTW, the tests in sql/citext.sql use the pgtap.sql file to run TAP
regression tests. So you can run them using `make installcheck` or
`make test`. The latter requires that pg_prove be installed; you can
get it here:

https://svn.kineticode.com/pgtap/trunk/

Anyway, I think I've got it pretty close to done. The tests cover a
lot of stuff -- nearly everything I could figure out, anyway. But
there are a few gaps.

As a result, I'd appreciate a little help with these questions, all in
the name of making this a solid data type suitable for use on
production systems:

* There seem to still be some implicit CASTS to text that I'd like to
duplicate. For example, select '192.168.1.2'::cidr::text;` works, but
`select '192.168.1.2'::cidr::citext;` does not. Where can I find the C
functions that do these casts for TEXT so that I can put them to work
for citext, too? The internal cast functions used in the old citext
distribution don't exist at all on 8.3.

* There are casts from text that I'd also like to harness for use by
citext, like `cidr(text)`. Where can I find these C functions as well?
(The upshot of this and the previous points is that I'd like citext to
be as compatible with TEXT as possible, and I just need to figure out
how to fill in the gaps in that compatibility.)

* Regular expression and LIKE comparisons using the the operators
properly work case-insensitively, but functions like replace() and
regexp_replace() do not. Should they? and if so, how can I make them
do so?

* The tests assume that LC_COLLATE is set to en_US.UTF-8. Does that
work well for standard PostgreSQL regression tests? How are locale-
sensitive tests run in core regression tests?

* As for my C programming, well, what's broken? I'm especially
concerned that I pfree variables appropriately, but I'm not at all
clear on what needs to be freed. Martijn mentioned before that btree
comparison functions free memory, but I'm such a C n00b that I don't
know what that actually means for my implementation. I'd actually
appreciate a bit of pedantry here. :-)

* Am I in fact getting an appropriate nul-terminated string in my
cilower() function using this code?

char * str = DatumGetCString(
DirectFunctionCall1( textout, PointerGetDatum( arg ) )
);

Those are all the questions I had about my implementation. I'd like to
get this thing done and released soon, so that I can be done with this
particular Yak and get back to what I'm *supposed* to be doing with my
time.

BTW, would there be any interest in this code going into contrib/ in
the distribution? I think that, if we can ensure that it works just
like LOWER() = LOWER(), but without requiring that code, then it would
be a great type to point people to to use instead of that SQL hack
(with all the usual caveats about it being locale-sensitive and not
canonically case-insensitive in the Unicode sense). If so, I'd be
happy to make whatever changes are necessary to make it fit in with
the coding and organization standards of the core and to submit it.

But please, don't expect a civarchar type from me anytime soon. ;-)

Many thanks,

David


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 10:28:22
Message-ID: 20080626102822.GB27678@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 25, 2008 at 11:47:39AM -0700, David E. Wheeler wrote:
> * There seem to still be some implicit CASTS to text that I'd like to
> duplicate. For example, select '192.168.1.2'::cidr::text;` works, but
> `select '192.168.1.2'::cidr::citext;` does not. Where can I find the C
> functions that do these casts for TEXT so that I can put them to work
> for citext, too? The internal cast functions used in the old citext
> distribution don't exist at all on 8.3.

Hmm, casts to/from text are somewhat "magic" in postgres. They are
implemented by calling the usual type input/output function. I have no
idea how to extend that to other types.

> * There are casts from text that I'd also like to harness for use by
> citext, like `cidr(text)`. Where can I find these C functions as well?
> (The upshot of this and the previous points is that I'd like citext to
> be as compatible with TEXT as possible, and I just need to figure out
> how to fill in the gaps in that compatibility.)

As above, they're probably not as seperate functions but a special hack
inthe casting code.

> * Regular expression and LIKE comparisons using the the operators
> properly work case-insensitively, but functions like replace() and
> regexp_replace() do not. Should they? and if so, how can I make them
> do so?

Regexes have case-insensetive modifiers, don't they? In which case I
don't think it'd be becessary.

> * As for my C programming, well, what's broken? I'm especially
> concerned that I pfree variables appropriately, but I'm not at all
> clear on what needs to be freed. Martijn mentioned before that btree
> comparison functions free memory, but I'm such a C n00b that I don't
> know what that actually means for my implementation. I'd actually
> appreciate a bit of pedantry here. :-)

When creating an index, your comparison functions are going ot be
called O(N log N) times. If they leak into a context that isn't
regularly freed you may have a problem. I'd suggest loking at how the
text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
because the incoming tuples may be detoasted.

> * Am I in fact getting an appropriate nul-terminated string in my
> cilower() function using this code?
>
> char * str = DatumGetCString(
> DirectFunctionCall1( textout, PointerGetDatum( arg ) )
> );

Yes.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 15:27:36
Message-ID: 5F9B80BA-0389-4B64-B402-E0AB8863EC0C@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote:

> Hmm, casts to/from text are somewhat "magic" in postgres. They are
> implemented by calling the usual type input/output function. I have no
> idea how to extend that to other types.

Oh. Okay. Perhaps I won't worry about it just now, then.

> As above, they're probably not as seperate functions but a special
> hack
> inthe casting code.

Okay.

> Regexes have case-insensetive modifiers, don't they? In which case I
> don't think it'd be becessary.

They do, but replace(), split_part(), strpos(), and translate() do not.

> When creating an index, your comparison functions are going ot be
> called O(N log N) times. If they leak into a context that isn't
> regularly freed you may have a problem. I'd suggest loking at how the
> text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
> because the incoming tuples may be detoasted.

Okay. I'll have a look at varlena.c, then.

>> * Am I in fact getting an appropriate nul-terminated string in my
>> cilower() function using this code?
>>
>> char * str = DatumGetCString(
>> DirectFunctionCall1( textout, PointerGetDatum( arg ) )
>> );
>
> Yes.

Great, I thought so (since it made the failures go away). Many thanks.

David


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 16:02:56
Message-ID: 25892F1D-AB99-4CBB-AA04-A5A4267ED196@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks a million for your answers, Martijn. I just have some more
stupid questions, if you could bear with me.

On Jun 26, 2008, at 03:28, Martijn van Oosterhout wrote:

> When creating an index, your comparison functions are going ot be
> called O(N log N) times. If they leak into a context that isn't
> regularly freed you may have a problem. I'd suggest loking at how the
> text comparisons do it. PG_FREE_IF_COPY() is probably a good idea
> because the incoming tuples may be detoasted.

Okay, I see that text_cmp in varlena doesn't use PG_FREE_IF_COPY(),
and neither do text_smaller nor text_larger (which just dispatch to
text_cmp anyway).

The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing
it's these functions you're talking about. However, my implementation
just looks like this:

Datum citext_ne (PG_FUNCTION_ARGS) {
// Fast path for different-length inputs. Okay for canonical
equivalence?
if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
PG_RETURN_BOOL( 1 );
PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
}

I don't *thinkI any variables are copied there. citextcmp() is just
this:

int citextcmp (PG_FUNCTION_ARGS) {
// XXX These are all just references to existing structures, right?
text * left = PG_GETARG_TEXT_P(0);
text * right = PG_GETARG_TEXT_P(1);
return varstr_cmp(
cilower( left ),
VARSIZE_ANY_EXHDR(left),
cilower( right ),
VARSIZE_ANY_EXHDR(right)
);
}

Again, no copying. cilower() does copy:

int index, len;
char * result;

index = 0;
len = VARSIZE(arg) - VARHDRSZ;
result = (char *) palloc( strlen( str ) + 1 );

for (index = 0; index <= len; index++) {
result[index] = tolower((unsigned char) str[index] );
}
// XXX I don't need to pfree result if I'm returning it, right?
return result;

But the copied value is returned. Hrm…it should probably be pfreed
somewhere, yes?

So I'm wondering if I should change citextcmp to pfree values?
Something like this:

text * left = PG_GETARG_TEXT_P(0);
text * right = PG_GETARG_TEXT_P(1);
char * lcstr = cilower( left );
char * rcstr = cilower( right );

int result = varstr_cmp(
cilower( left ),
VARSIZE_ANY_EXHDR(left),
cilower( right ),
VARSIZE_ANY_EXHDR(right)
);

pfree( lcstr );
pfree( rcstr );
return result;

This is the only function that calls cilower(). And I *think* it's the
only place where values are copied or memory is allocated needing to
be freed. Does that sound right to you?

On a side note, I've implemented this pretty differently from how the
text functions are implemented in varlena.c, just to try to keep
things succinct. But I'm wondering now if I shouldn't switch back to
the style used by varlena.c, if only to keep the style the same, and
thus perhaps to increase the chances that citext would be a welcome
contrib addition. Thoughts?

Many thanks again. You're a great help to this C n00b.

Best,

David


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 16:19:56
Message-ID: 20080626161956.GC4396@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David E. Wheeler wrote:

> The operator functions *do* use PG_FREE_IF_COPY(). So I'm guessing it's
> these functions you're talking about. However, my implementation just
> looks like this:
>
> Datum citext_ne (PG_FUNCTION_ARGS) {
> // Fast path for different-length inputs. Okay for canonical
> equivalence?
> if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
> PG_RETURN_BOOL( 1 );
> PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
> }

PG_GETARG_TEXT_P can detoast the datum, which creates a copy.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 16:38:21
Message-ID: 7BDE8C80-B9FC-459A-902C-2EDF99A5C8EF@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 26, 2008, at 09:19, Alvaro Herrera wrote:

> PG_GETARG_TEXT_P can detoast the datum, which creates a copy.

Thanks. I've just completely refactored things to look more like the
approach taken by varlena.c, both in terms of when stuff gets freed
and in terms of coding style. It's more verbose, but I feel much more
comfortable with memory management now that I'm following a known
implementation more closely. :-)

So now I've changed citextcmp to this:

static int
citextcmp (text * left, text * right)
{
char * lcstr, * rcstr;
int result;

lcstr = cilower( left );
rcstr = cilower( right );

result = varstr_cmp(
cilower( left ),
VARSIZE_ANY_EXHDR(left),
cilower( right ),
VARSIZE_ANY_EXHDR(right)
);

pfree( lcstr );
pfree( rcstr );
return result;
}

And now all of the operator functions are freeing memory using
PG_FREE_IF_COPY() like this:

Datum
citext_cmp(PG_FUNCTION_ARGS)
{
text * left = PG_GETARG_TEXT_PP(0);
text * right = PG_GETARG_TEXT_PP(1);
int32 result;

result = citextcmp(left, right);

PG_FREE_IF_COPY(left, 0);
PG_FREE_IF_COPY(right, 1);

PG_RETURN_INT32( result );
}

The only functions that don't do that are citext_smaller() and
citext_larger():

Datum
citext_smaller(PG_FUNCTION_ARGS)
{
text * left = PG_GETARG_TEXT_PP(0);
text * right = PG_GETARG_TEXT_PP(1);
text * result;

result = citextcmp(left, right) < 0 ? left : right;
PG_RETURN_TEXT_P(result);
}

This is just how varlena.c does it, but I am wondering if something
*should* be freed there.

Thanks a bunch!

Best,

David


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 17:02:19
Message-ID: 17935.1214499739@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> Datum citext_ne (PG_FUNCTION_ARGS) {
> // Fast path for different-length inputs. Okay for canonical
> equivalence?
> if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
> PG_RETURN_BOOL( 1 );
> PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
> }

BTW, I don't think you can use that same-length optimization for
citext. There's no reason to think that upper/lowercase pairs will
have the same length all the time in multibyte encodings.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 17:09:37
Message-ID: 7998C08A-D40B-4081-A343-1EA1B3FA7976@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 26, 2008, at 10:02, Tom Lane wrote:

> BTW, I don't think you can use that same-length optimization for
> citext. There's no reason to think that upper/lowercase pairs will
> have the same length all the time in multibyte encodings.

I was wondering about that. I had been thinking of canonically-
equivalent stings and combining marks. Doing a quick test it looks
like combining marks are not equivalent. For example, this returns
false:

SELECT 'Ä'::text = 'Ä'::text;

At least with en_US.UTF-8. Hrm. It looks like my client makes them
both canonical, so I've attached a script demonstrating this issue.

Anyway, I was aware of different byte counts for canonical
equivalence, but not for differences between upper- and lowercase
characters. I'd certainly defer to your knowledge of how these things
truly work in PostgreSQL, Tom, and can of course easily remove that
optimization. So, are your certain about this?

Many thanks,

David

Attachment Content-Type Size
try.sql application/octet-stream 34 bytes

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 19:03:43
Message-ID: 200806261203.43675.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David,

> Thanks. I've just completely refactored things to look more like the
> approach taken by varlena.c, both in terms of when stuff gets freed
> and in terms of coding style. It's more verbose, but I feel much more
> comfortable with memory management now that I'm following a known
> implementation more closely. :-)

Will this be ready for the July CommitFest?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 19:48:27
Message-ID: B62FBAB4-DE52-4006-962D-419233AB6E30@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 26, 2008, at 12:03, Josh Berkus wrote:

> Will this be ready for the July CommitFest?

When is it due, July 1? If so, yes, it should be. I could use a close
review by someone who knows this shit a whole lot better than I do.

Thanks,

David


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 20:12:42
Message-ID: 200806261312.42877.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

David,

> When is it due, July 1? If so, yes, it should be. I could use a close
> review by someone who knows this shit a whole lot better than I do.

Well, that's what the commitfest is for. Go ahead and add yourself once you
post the new patch.

--
Josh Berkus
PostgreSQL @ Sun
San Francisco


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-26 20:59:23
Message-ID: 20960.1214513963@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"David E. Wheeler" <david(at)kineticode(dot)com> writes:
> So, are your certain about this?

See Turkish --- in that locale i and I are not an upper/lower pair,
instead they pair with some non-ASCII letters. There are likely
other cases but that's the counterexample I remember.

regards, tom lane


From: "David E(dot) Wheeler" <david(at)kineticode(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-06-27 17:24:56
Message-ID: 70324AF2-C8F7-4A38-9042-A12614D639D1@kineticode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jun 26, 2008, at 13:59, Tom Lane wrote:

> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
>> So, are your certain about this?
>
> See Turkish --- in that locale i and I are not an upper/lower pair,
> instead they pair with some non-ASCII letters. There are likely
> other cases but that's the counterexample I remember.

Perfect, thank you. I was able to add a failing test and make it pass
by removing the string length optimization.

For future reference of anyone reading the list, this page has a great
description of the problem:

http://www.i18nguy.com/unicode/turkish-i18n.html

Best,

David


From: "Marko Kreen" <markokr(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 14:50:43
Message-ID: e51f66da0807010750w31aae655h553578ce7aba9d78@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/26/08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "David E. Wheeler" <david(at)kineticode(dot)com> writes:
> > Datum citext_ne (PG_FUNCTION_ARGS) {
> > // Fast path for different-length inputs. Okay for canonical
> > equivalence?
> > if (VARSIZE(PG_GETARG_TEXT_P(0)) != VARSIZE(PG_GETARG_TEXT_P(1)))
> > PG_RETURN_BOOL( 1 );
> > PG_RETURN_BOOL( citextcmp( PG_ARGS ) != 0 );
> > }
>
> BTW, I don't think you can use that same-length optimization for
> citext. There's no reason to think that upper/lowercase pairs will
> have the same length all the time in multibyte encodings.

What about this code in current str_tolower():

/* Output workspace cannot have more codes than input bytes */
workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));

Bug?

--
marko


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marko Kreen" <markokr(at)gmail(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 14:52:16
Message-ID: 4993.1214923936@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Marko Kreen" <markokr(at)gmail(dot)com> writes:
> On 6/26/08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> BTW, I don't think you can use that same-length optimization for
>> citext. There's no reason to think that upper/lowercase pairs will
>> have the same length all the time in multibyte encodings.

> What about this code in current str_tolower():

> /* Output workspace cannot have more codes than input bytes */
> workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));

That's working with wchars, not bytes.

regards, tom lane


From: "Marko Kreen" <markokr(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 15:13:26
Message-ID: e51f66da0807010813i6aec31bp7dc37419d5964679@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/1/08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Marko Kreen" <markokr(at)gmail(dot)com> writes:
> > On 6/26/08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
> >> BTW, I don't think you can use that same-length optimization for
> >> citext. There's no reason to think that upper/lowercase pairs will
> >> have the same length all the time in multibyte encodings.
>
> > What about this code in current str_tolower():
>
> > /* Output workspace cannot have more codes than input bytes */
> > workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
>
>
> That's working with wchars, not bytes.

Ah, I missed the point of char2wchar() line.

I'm rather unfamiliar with various MB API-s, sorry.

There's another thing I'm probably missing: does current code handle
multi-wchar codepoints? Or is it guaranteed they don't happen?
(Wasn't wchar_t usually 16bit value?)

--
marko


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 15:25:07
Message-ID: 200807011525.m61FP7221773@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marko Kreen wrote:
> On 7/1/08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > "Marko Kreen" <markokr(at)gmail(dot)com> writes:
> > > On 6/26/08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >
> > >> BTW, I don't think you can use that same-length optimization for
> > >> citext. There's no reason to think that upper/lowercase pairs will
> > >> have the same length all the time in multibyte encodings.
> >
> > > What about this code in current str_tolower():
> >
> > > /* Output workspace cannot have more codes than input bytes */
> > > workspace = (wchar_t *) palloc((nbytes + 1) * sizeof(wchar_t));
> >
> >
> > That's working with wchars, not bytes.
>
> Ah, I missed the point of char2wchar() line.
>
> I'm rather unfamiliar with various MB API-s, sorry.
>
> There's another thing I'm probably missing: does current code handle
> multi-wchar codepoints? Or is it guaranteed they don't happen?
> (Wasn't wchar_t usually 16bit value?)

If you want a simple example of wide character use look at
oracle_compat.c::upper() which calls str_toupper() in CVS HEAD.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Marko Kreen" <markokr(at)gmail(dot)com>
To: "Bruce Momjian" <bruce(at)momjian(dot)us>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 15:33:01
Message-ID: e51f66da0807010833j1129d25dvc2dbb01e566b337f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/1/08, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> If you want a simple example of wide character use look at
> oracle_compat.c::upper() which calls str_toupper() in CVS HEAD.

ATM I'm looking at str_tolower/upper internal implementation.

They do:

workspace[curr_char] = towlower(workspace[curr_char]);

where workspace is wchar_t but towlower() operates on wint_t.

Is such inconsistency ok?

--
marko


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marko Kreen" <markokr(at)gmail(dot)com>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 15:36:41
Message-ID: 10993.1214926601@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Marko Kreen" <markokr(at)gmail(dot)com> writes:
> ATM I'm looking at str_tolower/upper internal implementation.
> They do:
> workspace[curr_char] = towlower(workspace[curr_char]);
> where workspace is wchar_t but towlower() operates on wint_t.

IIRC this is exactly comparable to the type situation for the
traditional <ctype.h> macros. The reason is that they are defined
to accept EOF in addition to actual char (or wchar) values.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Marko Kreen" <markokr(at)gmail(dot)com>
Cc: "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 15:43:07
Message-ID: 11143.1214926987@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Marko Kreen" <markokr(at)gmail(dot)com> writes:
> There's another thing I'm probably missing: does current code handle
> multi-wchar codepoints? Or is it guaranteed they don't happen?

AFAIK we disallow multi-wchar situations (by rejecting the UTF8
combining codes).

> (Wasn't wchar_t usually 16bit value?)

Hmm. It's unsigned int on my ancient HPUX box. I think we could have a
problem on any machines whose mbstowcs doesn't support 4-byte UTF8
codes, though ... in particular, what about Windows?

regards, tom lane


From: "Marko Kreen" <markokr(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Bruce Momjian" <bruce(at)momjian(dot)us>, "David E(dot) Wheeler" <david(at)kineticode(dot)com>, "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Latest on CITEXT 2.0
Date: 2008-07-01 16:07:32
Message-ID: e51f66da0807010907v2a9b3780r9e983e82440fd422@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/1/08, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> "Marko Kreen" <markokr(at)gmail(dot)com> writes:
> > ATM I'm looking at str_tolower/upper internal implementation.
> > They do:
> > workspace[curr_char] = towlower(workspace[curr_char]);
> > where workspace is wchar_t but towlower() operates on wint_t.
>
> IIRC this is exactly comparable to the type situation for the
> traditional <ctype.h> macros. The reason is that they are defined
> to accept EOF in addition to actual char (or wchar) values.

I read SUS v3 and there is no hint on multi-wchar anything,
so for unix systems you are right, wint_t == wchar_t.

Seems stories how Windows and Java operate have affected me too much.

Then I browsed MSDN:

http://msdn.microsoft.com/en-us/library/dtxesf6k.aspx

and they seem to strongly hint that wchar_t == 16 bits and
UTF-16 is used internally.

Probably some Windows developer should look into it
and decide if there is a #ifdef WIN32 branch needed.

--
marko