Quick Links

Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)

Lists:	pgsql-hackers

From:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 14:54:52
Message-ID:	5aa1df8a-96f5-1d14-46fd-032e32846c71@8kdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi list.

As has been previously discussed (see
https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
for instance) varlena fields cannot accept the literal 0x00 value. Sure,
you can use bytea, but this hardly a good solution. The problem seems to
be hitting some use cases, like:

- People migrating data from other databases (apart from PostgreSQL, I
don't know of any other database which suffers the same problem).
- People using drivers which use UTF-8 or equivalent encodings by
default (Java for example)

Given that 0x00 is a perfectly legal UTF-8 character, I conclude
we're strictly non-compliant. And given the general Postgres policy
regarding standards compliance and the people being hit by this, I think
it should be addressed. Specially since all the usual fixes are a real
PITA (re-parsing, re-generating strings, which is very expensive, or
dropping data).

What would it take to support it? Isn't the varlena header
propagated everywhere, which could help infer the real length of the
string? Any pointers or suggestions would be welcome.

Thanks,

Álvaro

Álvaro Hernández Tortosa

-----------
8Kdata

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 15:23:41
Message-ID:	24883.1470237821@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?= <aht(at)8kdata(dot)com> writes:
> As has been previously discussed (see
> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
> for instance) varlena fields cannot accept the literal 0x00 value.

Yup.

> What would it take to support it?

One key reason why that's hard is that datatype input and output
functions use nul-terminated C strings as the representation of the
text form of any datatype. We can't readily change that API without
breaking huge amounts of code, much of it not under the core project's
control.

There may be other places where nul-terminated strings would be a hazard
(mumble fgets mumble), but offhand that API seems like the major problem
so far as the backend is concerned.

There would be a slew of client-side problems as well. For example this
would assuredly break psql and pg_dump, along with every other client that
supposes that it can treat PQgetvalue() as returning a nul-terminated
string. This end of it would possibly be even worse than fixing the
backend, because so little of the affected code is under our control.

In short, the problem is not with having an embedded nul in a stored
text value. The problem is the reams of code that suppose that the
text representation of any data value is a nul-terminated C string.

regards, tom lane

From:	Kevin Grittner <kgrittn(at)gmail(dot)com>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 15:47:56
Message-ID:	CACjxUsNfr6FH-1vwmVX21H6jfoK+O1ErkCfYx6UjbyZVWjhH9A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:

> What would it take to support it?

Would it be of any value to support "Modified UTF-8"?

https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To:	Kevin Grittner <kgrittn(at)gmail(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 16:00:25
Message-ID:	00eb530a-f4e5-bdf4-a804-15c4ad515cb6@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 8/3/16 11:47 AM, Kevin Grittner wrote:
> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
>
>> What would it take to support it?
>
> Would it be of any value to support "Modified UTF-8"?
>
> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

Will this work with OS libraries?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 16:35:36
Message-ID:	CAEzk6fcgyRkxw35kRrJ9XNFXmgnxS9qAAMh1quo=iN6w+yRP5w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3 August 2016 at 15:54, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
> Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
> strictly non-compliant.

It's perhaps worth mentioning that 0x00 is valid ASCII too, and
PostgreSQL has never stored that either.

If you want to start quoting standards, there is in fact specific
mention in the ANSI spec of null terminators in passing strings to
host languages, so if postgresql stored NULs in that way we would end
up with parameters that we couldn't pass to UDFs in a
standards-compliant way.

Geoff

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 17:16:40
Message-ID:	CAMsr+YHa3AjtBmXc2hOZaKVLbuyjWPipNknSPa_JRmdZh4X4Ww@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:

>
> Hi list.
>
> As has been previously discussed (see
> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
> for instance) varlena fields cannot accept the literal 0x00 value. Sure,
> you can use bytea, but this hardly a good solution. The problem seems to be
> hitting some use cases, like:
>
> - People migrating data from other databases (apart from PostgreSQL, I
> don't know of any other database which suffers the same problem).
> - People using drivers which use UTF-8 or equivalent encodings by default
> (Java for example)
>
> Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
> strictly non-compliant. And given the general Postgres policy regarding
> standards compliance and the people being hit by this, I think it should be
> addressed. Specially since all the usual fixes are a real PITA (re-parsing,
> re-generating strings, which is very expensive, or dropping data).
>
> What would it take to support it? Isn't the varlena header propagated
> everywhere, which could help infer the real length of the string? Any
> pointers or suggestions would be welcome.

One of the bigger pain points is that our interaction with C library
collation routines for sorting uses NULL-terminated C strings. strcoll,
strxfrm, etc.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 18:10:36
Message-ID:	a7346dd0-a677-d3f2-814a-15705641f8cf@8kdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/08/16 17:23, Tom Lane wrote:
> =?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?= <aht(at)8kdata(dot)com> writes:
>> As has been previously discussed (see
>> https://www.postgresql.org/message-id/BAY7-F17FFE0E324AB3B642C547E96890%40phx.gbl
>> for instance) varlena fields cannot accept the literal 0x00 value.
> Yup.
>
>> What would it take to support it?
> One key reason why that's hard is that datatype input and output
> functions use nul-terminated C strings as the representation of the
> text form of any datatype. We can't readily change that API without
> breaking huge amounts of code, much of it not under the core project's
> control.
>
> There may be other places where nul-terminated strings would be a hazard
> (mumble fgets mumble), but offhand that API seems like the major problem
> so far as the backend is concerned.
>
> There would be a slew of client-side problems as well. For example this
> would assuredly break psql and pg_dump, along with every other client that
> supposes that it can treat PQgetvalue() as returning a nul-terminated
> string. This end of it would possibly be even worse than fixing the
> backend, because so little of the affected code is under our control.
>
> In short, the problem is not with having an embedded nul in a stored
> text value. The problem is the reams of code that suppose that the
> text representation of any data value is a nul-terminated C string.
>
> regards, tom lane

Wow. That seems like a daunting task.

I guess, then, than even implementing a new datatype based on bytea
but that would use the text IO functions to show up as text (not
send/recv) would neither work, right?

Thanks for the input,

Álvaro

Álvaro Hernández Tortosa

-----------
8Kdata

From:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
To:	Kevin Grittner <kgrittn(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 18:14:04
Message-ID:	ca3608a6-8fad-67de-d368-a52318aed3d7@8kdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/08/16 17:47, Kevin Grittner wrote:
> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
>
>> What would it take to support it?
> Would it be of any value to support "Modified UTF-8"?
>
> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
>

That's nice, but I don't think so.

The problem is that you cannot predict how people would send you
data, like when importing from other databases. I guess it may work if
Postgres would implement such UTF-8 variant and also the drivers, but
that would still require an encoding conversion (i.e., parsing every
string) to change the 0x00, which seems like a serious performance hit.

It could be worse than nothing, though!

Thanks,

Álvaro

Álvaro Hernández Tortosa

-----------
8Kdata

From:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
To:	Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 18:15:44
Message-ID:	13c36522-ca3f-5e78-3518-6615f3defb7f@8kdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/08/16 18:35, Geoff Winkless wrote:
> On 3 August 2016 at 15:54, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
>> Given that 0x00 is a perfectly legal UTF-8 character, I conclude we're
>> strictly non-compliant.
> It's perhaps worth mentioning that 0x00 is valid ASCII too, and
> PostgreSQL has never stored that either.

Then yes, it could also be a problem. But as of today I believe
solving the problem for UTF-8 would solve the great majority of this
embedded NUL problems.

Álvaro

Álvaro Hernández Tortosa

-----------
8Kdata

From:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
To:	Kevin Grittner <kgrittn(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 19:13:18
Message-ID:	b2f6204e-a4a1-06e9-f333-1b18477d3504@8kdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/08/16 20:14, Álvaro Hernández Tortosa wrote:
>
>
> On 03/08/16 17:47, Kevin Grittner wrote:
>> On Wed, Aug 3, 2016 at 9:54 AM, Álvaro Hernández Tortosa
>> <aht(at)8kdata(dot)com> wrote:
>>
>>> What would it take to support it?
>> Would it be of any value to support "Modified UTF-8"?
>>
>> https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8
>>
>
> That's nice, but I don't think so.
>
> The problem is that you cannot predict how people would send you
> data, like when importing from other databases. I guess it may work if
> Postgres would implement such UTF-8 variant and also the drivers, but
> that would still require an encoding conversion (i.e., parsing every
> string) to change the 0x00, which seems like a serious performance hit.
>
> It could be worse than nothing, though!
>
> Thanks,
>
> Álvaro
>

It may indeed work.

According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code
point. In short, the \u00 nul is represented (overlong encoding) by the
two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8
so should not appear in an otherwise valid UTF-8 string. Yet they are
accepted by Postgres (like if Postgres would support Modified UTF-8
intentionally). The caracter in psql does not render as a nul but as
this symbol: "삀".

Given that this works, the process would look like this:

- Parse all input data looking for bytes with hex value 0x00. If they
appear in the string, they are the null byte.
- Replace that byte with the two bytes 0xc080.
- Reverse the operation when reading.

This is OK but of course a performance hit (searching for 0x00 and
then augmenting the byte[] or whatever data structure to account for the
extra byte). A little bit of a PITA, but I guess better than fixing it
all :)

Álvaro

Álvaro Hernández Tortosa

-----------
8Kdata

From:	Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	Kevin Grittner <kgrittn(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 19:31:49
Message-ID:	CAEzk6fd89ddBdaw6mJJgo+G-NHKQX75VGh+Ssq5dq9SsOVbrMg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3 August 2016 at 20:13, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
> Yet they are accepted by Postgres
> (like if Postgres would support Modified UTF-8 intentionally). The caracter
> in psql does not render as a nul but as this symbol: "삀".

Not accepted as valid utf8:

# select E'\xc0\x80';
ERROR: invalid byte sequence for encoding "UTF8": 0xc0 0x80

You would need a "modified utf8" encoding, I think.

Geoff

From:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
To:	Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
Cc:	Kevin Grittner <kgrittn(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 19:36:26
Message-ID:	154e9300-a17b-2cb2-f032-3e036f94ee9c@8kdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/08/16 21:31, Geoff Winkless wrote:
> On 3 August 2016 at 20:13, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
>> Yet they are accepted by Postgres
>> (like if Postgres would support Modified UTF-8 intentionally). The caracter
>> in psql does not render as a nul but as this symbol: "삀".
> Not accepted as valid utf8:
>
> # select E'\xc0\x80';
> ERROR: invalid byte sequence for encoding "UTF8": 0xc0 0x80
>
> You would need a "modified utf8" encoding, I think.
>
> Geoff

Isn't the correct syntax something like:

select E'\uc080', U&'\c080';

It is a single character, 16 bit unicode sequence (see
https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).

Álvaro

Álvaro Hernández Tortosa

-----------
8Kdata

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	Kevin Grittner <kgrittn(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 19:41:54
Message-ID:	16003.1470253314@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

=?UTF-8?Q?=c3=81lvaro_Hern=c3=a1ndez_Tortosa?= <aht(at)8kdata(dot)com> writes:
> According to https://en.wikipedia.org/wiki/UTF-8#Codepage_layout
> the encoding used in Modified UTF-8 is an (otherwise) invalid UTF-8 code
> point. In short, the \u00 nul is represented (overlong encoding) by the
> two-byte, 1 character sequence \uc080. These two bytes are invalid UTF-8
> so should not appear in an otherwise valid UTF-8 string. Yet they are
> accepted by Postgres (like if Postgres would support Modified UTF-8
> intentionally).

Really? It sure looks to me like pg_utf8_islegal() would reject this.

We could hack it to allow the case, no doubt, but I concur with Peter's
concern that we'd have trouble with OS-level code that is strict about
what UTF8 allows. glibc, for example, is known to do very strange things
with strings that it thinks are invalid in the active encoding.

regards, tom lane

From:	Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
To:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 19:42:40
Message-ID:	CAEzk6feZ3PcQhSPEitH-5_QuQNGq4FjkXiEO8+ocRQmVtj7Mig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3 August 2016 at 20:36, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
> Isn't the correct syntax something like:
>
> select E'\uc080', U&'\c080';
>
> ?
>
> It is a single character, 16 bit unicode sequence (see
> https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).

No, what you've done there is created the three-byte utf8 sequence \xec8280

# select U&'\c080'::bytea;
bytea
----------
\xec8280

It's not a UCS2 c080, it's utf8 c080.

Geoff

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 21:00:02
Message-ID:	CAEepm=2_oauRJW=K63cvbk4ye8VK70fp+Ldpcpu+uMHS6Td1CQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
>> What would it take to support it? Isn't the varlena header propagated
>> everywhere, which could help infer the real length of the string? Any
>> pointers or suggestions would be welcome.
>
>
> One of the bigger pain points is that our interaction with C library
> collation routines for sorting uses NULL-terminated C strings. strcoll,
> strxfrm, etc.

That particular bit of the problem would go away if this ever happened:

https://wiki.postgresql.org/wiki/Todo:ICU

ucoll_strcoll takes explicit lengths (though optionally accepts -1 for
null terminated mode).

http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters

--
Thomas Munro
http://www.enterprisedb.com

From:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
To:	Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-03 22:31:13
Message-ID:	2b577088-91cc-87c4-bb8f-b77f574bbc2f@8kdata.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/08/16 21:42, Geoff Winkless wrote:
> On 3 August 2016 at 20:36, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com> wrote:
>> Isn't the correct syntax something like:
>>
>> select E'\uc080', U&'\c080';
>>
>> ?
>>
>> It is a single character, 16 bit unicode sequence (see
>> https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html).
> No, what you've done there is created the three-byte utf8 sequence \xec8280
>
> # select U&'\c080'::bytea;
> bytea
> ----------
> \xec8280
>
> It's not a UCS2 c080, it's utf8 c080.
>
> Geoff

Yes, you're absolutely right ^_^

Álvaro

Álvaro Hernández Tortosa

-----------
8Kdata

From:	Craig Ringer <craig(at)2ndquadrant(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc:	Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-04 00:22:25
Message-ID:	CAMsr+YF8ua27YmYJkOD_o+DwU_mUDjEZss9Nua9s-Wo2Qs2MOw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4 August 2016 at 05:00, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> On Thu, Aug 4, 2016 at 5:16 AM, Craig Ringer <craig(at)2ndquadrant(dot)com>
> wrote:
> > On 3 August 2016 at 22:54, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>
> wrote:
> >> What would it take to support it? Isn't the varlena header
> propagated
> >> everywhere, which could help infer the real length of the string? Any
> >> pointers or suggestions would be welcome.
> >
> >
> > One of the bigger pain points is that our interaction with C library
> > collation routines for sorting uses NULL-terminated C strings. strcoll,
> > strxfrm, etc.
>
> That particular bit of the problem would go away if this ever happened:
>
> https://wiki.postgresql.org/wiki/Todo:ICU
>
> ucoll_strcoll takes explicit lengths (though optionally accepts -1 for
> null terminated mode).
>
>
> http://userguide.icu-project.org/strings#TOC-Using-C-Strings:-NUL-Terminated-vs.-Length-Parameters
>

Yep, it does. But we've made little to no progress on integration of ICU
support and AFAIK nobody's working on it right now.

I wonder how MySQL implements their collation and encoding support?

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-04 00:40:31
Message-ID:	20160804004031.GG1702@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
> Yep, it does. But we've made little to no progress on integration of ICU
> support and AFAIK nobody's working on it right now.

Uh, this email from July says Peter Eisentraut will submit it in
September :-)

https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject:	Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-10 20:42:01
Message-ID:	A4DB6CD4-F4CC-4C48-A9DC-DCBDCBD51186@pingpong.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 4 aug. 2016 kl. 02:40 skrev Bruce Momjian <bruce(at)momjian(dot)us>:
>
> On Thu, Aug 4, 2016 at 08:22:25AM +0800, Craig Ringer wrote:
>> Yep, it does. But we've made little to no progress on integration of ICU
>> support and AFAIK nobody's working on it right now.
>
> Uh, this email from July says Peter Eisentraut will submit it in
> September :-)
>
> https://www.postgresql.org/message-id/2b833706-1133-1e11-39d9-4fa2288925bd@2ndquadrant.com

Cool.

I have brushed up my decade+ old patches [1] for ICU, so they now have support for COLLATE on columns.

https://github.com/girgen/postgres/

in branches icu/XXX where XXX is master or REL9_X_STABLE.

They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.

I have one question for someone with knowledge about Turkish (Devrim?). This is the diff from regression tests, when running

$ gmake check EXTRA_TESTS=collate.linux.utf8 LANG=sv_SE.UTF-8

$ cat "/Users/girgen/postgresql/obj/src/test/regress/regression.diffs"
*** /Users/girgen/postgresql/postgres/src/test/regress/expected/collate.linux.utf8.out 2016-08-10 21:09:03.000000000 +0200
--- /Users/girgen/postgresql/obj/src/test/regress/results/collate.linux.utf8.out 2016-08-10 21:12:53.000000000 +0200
***************
*** 373,379 ****
SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
false
-------
! f
(1 row)

SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
--- 373,379 ----
SELECT 'Türkiye' COLLATE "tr_TR" ~* 'KI' AS "false";
false
-------
! t
(1 row)

SELECT 'bıt' ~* 'BIT' COLLATE "en_US" AS "false";
***************
*** 385,391 ****
SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
true
------
! t
(1 row)

-- The following actually exercises the selectivity estimation for ~*.
--- 385,391 ----
SELECT 'bıt' ~* 'BIT' COLLATE "tr_TR" AS "true";
true
------
! f
(1 row)

-- The following actually exercises the selectivity estimation for ~*.

======================================================================

The Linux locale behaves differently from ICU for the above (corner ?) cases. Any ideas if one is more correct than the other? I seems unclear to me. Perhaps it depends on whether the case-insensitive match is done using lower(both) or upper(both)? I haven't investigated this yet. @Devrim, is one more correct than the other?

As Thomas points out, using ucoll_strcoll it is quick, since no copying is needed. I will get some benchmarks soon.

Palle

[1] https://people.freebsd.org/~girgen/postgresql-icu/README.html

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject:	Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-11 01:05:19
Message-ID:	CAM3SWZRpzdphRojzOQpjq1cm4nk-5Kf9P0W5T1rzJFih=2AOig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen(at)pingpong(dot)net> wrote:
> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.

Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
is not the release that introduced its use, it did expand it
significantly. I think you need to fix this, even though it isn't
actually used to sort text at present, since presumably FreeBSD builds
of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
reasonably trust the ICU equivalent of strxfrm(), so that's a missed
opportunity. (See the wiki page on the abbreviated keys issue [1] if
you don't know what I'm talking about.)

Shouldn't you really have a strxfrm() wrapper, used across the board,
including for callers outside of varlena.c? convert_string_datum() has
been calling strxfrm() for many releases now. These calls are still
used in FreeBSD builds, I would think, which seems like a bug that is
not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
that that matching the ordering used elsewhere not really hold on
FreeBSD builds?

[1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
--
Peter Geoghegan

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject:	Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-11 09:15:36
Message-ID:	7D9265F9-2E7B-4136-BCA1-AAE8481561B7@pingpong.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <pg(at)heroku(dot)com>:
>
> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen(at)pingpong(dot)net> wrote:
>> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.
>
> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
> is not the release that introduced its use, it did expand it
> significantly. I think you need to fix this, even though it isn't
> actually used to sort text at present, since presumably FreeBSD builds
> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
> opportunity. (See the wiki page on the abbreviated keys issue [1] if
> you don't know what I'm talking about.)

My plan was to get it working without TRUST_STRXFRM first, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I will have to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code base.

>
> Shouldn't you really have a strxfrm() wrapper, used across the board,
> including for callers outside of varlena.c? convert_string_datum() has
> been calling strxfrm() for many releases now. These calls are still
> used in FreeBSD builds, I would think, which seems like a bug that is
> not dodged by simply not defining TRUST_STRXFRM. Isn't its assumption
> that that matching the ordering used elsewhere not really hold on
> FreeBSD builds?

I was not aware of convert_string_datum, I will check that, thanks! Using a wrapper across the board seems like a good idea for refactoring.

>
> [1] https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue
> --
> Peter Geoghegan

From:	Palle Girgensohn <girgen(at)pingpong(dot)net>
To:	Peter Geoghegan <pg(at)heroku(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject:	Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-11 11:22:09
Message-ID:	789A2F56-0E42-409D-A840-6AF5110D6085@pingpong.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> 11 aug. 2016 kl. 11:15 skrev Palle Girgensohn <girgen(at)pingpong(dot)net>:
>
>>
>> 11 aug. 2016 kl. 03:05 skrev Peter Geoghegan <pg(at)heroku(dot)com>:
>>
>> On Wed, Aug 10, 2016 at 1:42 PM, Palle Girgensohn <girgen(at)pingpong(dot)net> wrote:
>>> They've been used for the FreeBSD ports since 2005, and have served us well. I have of course updated them regularly. In this latest version, I've removed support for other encodings beside UTF-8, mostly since I don't know how to test them, but also, I see little point in supporting anything else using ICU.
>>
>> Looks like you're not using the ICU equivalent of strxfrm(). While 9.5
>> is not the release that introduced its use, it did expand it
>> significantly. I think you need to fix this, even though it isn't
>> actually used to sort text at present, since presumably FreeBSD builds
>> of 9.5 don't TRUST_STRXFRM. Since you're using ICU, though, you could
>> reasonably trust the ICU equivalent of strxfrm(), so that's a missed
>> opportunity. (See the wiki page on the abbreviated keys issue [1] if
>> you don't know what I'm talking about.)
>
> My plan was to get it working without TRUST_STRXFRM first, and then add that functinality. I've made some preliminary tests using ICU:s ucol_getSortKey but I will have to test it a bit more. For now, I just expect not to trust strxfrm. It is the first iteration wrt strxfrm, the plan is to use that code base.

Here are some preliminary results running 10000 times comparing the same two strings in a tight loop.

ucol_strcollUTF8: -1 0.002448
strcoll: 1 0.060711
ucol_strcollIter: -1 0.009221
direct memcmp: 1 0.000457
memcpy memcmp: 1 0.001706
memcpy strcoll: 1 0.068425
nextSortKeyPart: -1 0.041011
ucnv_toUChars + getSortKey: -1 0.050379

correct answer is -1, but since we compare åasdf and äasdf with a Swedish locale, memcmp and strcoll fails of course, as espected. Direct memcmp is 5 times faster than ucol_strcollUTF8 (used in my patch), but sadly the best implementation using sort keys with ICU, nextSortKeyPart, is way slower.

startTime = getRealTime();
for ( int i = 0; i < loop; i++) {
result = ucol_strcollUTF8(collator, arg1, len1, arg2, len2, &status);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "ucol_strcollUTF8", result, endTime - startTime);

int sortkeysize=8;

startTime = getRealTime();
uint8_t key1[sortkeysize], key2[sortkeysize];
uint32_t sState[2], tState[2];
UCharIterator sIter, tIter;

for ( int i = 0; i < loop; i++) {
uiter_setUTF8(&sIter, arg1, len1);
uiter_setUTF8(&tIter, arg2, len2);
sState[0] = 0; sState[1] = 0;
tState[0] = 0; tState[1] = 0;
ucol_nextSortKeyPart(collator, &sIter, sState, key1, sortkeysize, &status);
ucol_nextSortKeyPart(collator, &tIter, tState, key2, sortkeysize, &status);
result = memcmp (key1, key2, sortkeysize);
}
endTime = getRealTime();
printf("%30s: %d\t%lf\n", "nextSortKeyPart", result, endTime - startTime);

But in your strxfrm code in PostgreSQL, the keys are cached, and represented as int64:s if I remember correctly, so perhaps there is still a benefit using the abbreviated keys? More testing is required, I guess...

Palle

From:	Peter Geoghegan <pg(at)heroku(dot)com>
To:	Palle Girgensohn <girgen(at)pingpong(dot)net>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Álvaro Hernández Tortosa <aht(at)8kdata(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Devrim Gündüz <devrim(at)gunduz(dot)org>, Jakob Egger <jakob(at)eggerapps(dot)at>, Tobias Bussmann <t(dot)bussmann(at)gmx(dot)net>
Subject:	Re: Improved ICU patch - WAS: Implementing full UTF-8 support (aka supporting 0x00)
Date:	2016-08-11 16:56:25
Message-ID:	CAM3SWZRVBUZ2pdcJ_Br7J+wzZv0m1m4sM=1uFgD+JWh8u8wgXw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 11, 2016 at 4:22 AM, Palle Girgensohn <girgen(at)pingpong(dot)net> wrote:
> But in your strxfrm code in PostgreSQL, the keys are cached, and represented
> as int64:s if I remember correctly, so perhaps there is still a benefit
> using the abbreviated keys? More testing is required, I guess...

ICU's ucol_nextSortKeyPart() interface is faster for this, I believe,
and works because you only need the first sizeof(Datum) bytes (4 or 8
bytes). So, you have the right idea about using it (at least for the
abbreviated key stuff), but the second last argument needs to be
sizeof(Datum).

The whole point of the abbreviated key optimization is that you can
avoid pointer chasing during each and every comparison. Your test here
is invalid because it doesn't involved the reuse of the keys. Often,
during a sort, each item has to be compared about 20 times.

I've experimented with ICU, and know it works well with this. You
really need to create a scenario with a real sort, and all the
conditions I describe.

--
Peter Geoghegan