Re: UTF8 national character data type support WIP patch and list of open issues.

From: Chapman Flack <chap(at)anastigmatix(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Boguk, Maksym" <maksymb(at)fast(dot)au(dot)fujitsu(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: UTF8 national character data type support WIP patch and list of open issues.
Date: 2023-08-27 16:56:18
Message-ID: df1325b316827117d1086cd0762a402b@anastigmatix.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Although this is a ten-year-old message, it was the one I found quickly
when looking to see what the current state of play on this might be.

On 2013-09-20 14:22, Robert Haas wrote:
> Hmm. So under that design, a database could support up to a total of
> two character sets, the one that you get when you say 'foo' and the
> other one that you get when you say n'foo'.
>
> I guess we could do that, but it seems a bit limited. If we're going
> to go to the trouble of supporting multiple character sets, why not
> support an arbitrary number instead of just two?

Because that old thread came to an end without mentioning how the
standard approaches that, it seemed worth adding, just to complete the
record.

In the draft of the standard I'm looking at (which is also around a
decade old), n'foo' is nothing but a handy shorthand for _csname'foo'
(which is a syntax we do not accept) for some particular csname that
was chosen when setting up the db.

So really, the standard contemplates letting you have columns of
arbitrary different charsets (CHAR(x) CHARACTER SET csname), and
literals of arbitrary charsets _csname'foo'. Then, as a bit of
sugar, you get to pick which two of those charsets you'd like
to have easy shorter ways of writing, 'foo' or n'foo',
CHAR or NCHAR.

The grammar for csname is kind of funky. It can be nothing but
<SQL language identifier>, which has the nice restricted form
/[A-Za-z][A-Za-z0-9_]*/. But it can also be schema-qualified,
with the schema of course being a full-fledged <identifier>.

So yeah, to fully meet this part of the standard, the parser'd
have to know that
_U&"I am a schema nameZ0021" UESCAPE 'Z'/*hi!*/.LATIN1'foo'
is a string literal, expressing foo, in a character set named
LATIN1, in some cutely-named schema.

Never a dull moment.

Regards,
-Chap

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Yugo NAGATA 2023-08-27 17:49:08 Re: Incremental View Maintenance, take 2
Previous Message Yugo NAGATA 2023-08-27 16:12:41 Re: Incremental View Maintenance, take 2