Re: charset/collation in values

Lists: pgsql-hackers
From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: charset/collation in values
Date: 2004-11-01 06:41:09
Message-ID: Pine.LNX.4.44.0411010718390.2015-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've looked into storing charset/collation in the string values. This
means that we change varchar/text/BpChar to be structures that have a
charset oid field and a collation oid field, the rest of the Datum is the
string data.

Coercability I think one don't need to put in the Datum and it can be
stored in the Nodes. Charset/Collation need to be in the Datum since we
send that into functions as arguments.

Since we are changing what's stored in the Datum and the normal code saves
that on disk then we will end up with charset/collation stored on disk for
each value. If we want to avoid storing charset/collation both in the
column type and in each row, we would need an extra layer that transforms
the Datums before they are stored. As a first implementation it's easier
to just store everything.

For each type we need to have convertion functions to and from strings.
Any suggestion of how to represent these as strings now when it's a string
plus two oid's? This is a though one..

I have more comments/questions later on, but these are enough for one
mail.

--
/Dennis Björklund


From: Thomas Hallgren <thhal(at)mailblocks(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-01 13:11:14
Message-ID: 418635F2.8060706@mailblocks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund wrote:
> I've looked into storing charset/collation in the string values. This
> means that we change varchar/text/BpChar to be structures that have a
> charset oid field and a collation oid field, the rest of the Datum is the
> string data.
>
I think the number of charset/collation combinations will be relatively
few so perhaps it would be space efficient to maintain a table where
each combination is given an oid and have string values store that
rather than two separate oid's?

Regards,
Thomas Hallgren


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Thomas Hallgren <thhal(at)mailblocks(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-01 15:41:30
Message-ID: 13961.1099323690@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thomas Hallgren <thhal(at)mailblocks(dot)com> writes:
> I think the number of charset/collation combinations will be relatively
> few so perhaps it would be space efficient to maintain a table where
> each combination is given an oid and have string values store that
> rather than two separate oid's?

In fact, we should do our best to get the overhead down to 1 or 2 bytes.
Two OIDs (8 bytes) is ridiculous.

I'm not sure if 1 byte is enough or not --- there might be more than 256
charsets/collations to support. 2 ought to be plenty though.

regards, tom lane


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Hallgren <thhal(at)mailblocks(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: charset/collation in values
Date: 2004-11-01 16:08:21
Message-ID: Pine.LNX.4.44.0411011703340.2015-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 1 Nov 2004, Tom Lane wrote:

> > I think the number of charset/collation combinations will be relatively
> > few so perhaps it would be space efficient to maintain a table where
> > each combination is given an oid and have string values store that
> > rather than two separate oid's?
>
> In fact, we should do our best to get the overhead down to 1 or 2 bytes.
> Two OIDs (8 bytes) is ridiculous.

Just to be clear, we don't want to store it on disk no matter what since
it should be enough to store it once for each column. As a first solution
we could store it just to keep it simple until we have tried it out.

--
/Dennis Björklund


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: db(at)zigo(dot)dhs(dot)org
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, thhal(at)mailblocks(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-01 22:26:31
Message-ID: 20041102.072631.62373551.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> On Mon, 1 Nov 2004, Tom Lane wrote:
>
> > > I think the number of charset/collation combinations will be relatively
> > > few so perhaps it would be space efficient to maintain a table where
> > > each combination is given an oid and have string values store that
> > > rather than two separate oid's?
> >
> > In fact, we should do our best to get the overhead down to 1 or 2 bytes.
> > Two OIDs (8 bytes) is ridiculous.
>
> Just to be clear, we don't want to store it on disk no matter what since
> it should be enough to store it once for each column. As a first solution
> we could store it just to keep it simple until we have tried it out.

Right. AFAIK nobody has proposed charsets/collations onto disk.
--
Tatsuo Ishii


From: Thomas Hallgren <thhal(at)mailblocks(dot)com>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc: tgl(at)sss(dot)pgh(dot)pa(dot)us, pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-01 22:30:49
Message-ID: thhal-0sVFiAtfw3kAR4CjMcwz0nvW8hBQAde@mailblocks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii wrote:
> Right. AFAIK nobody has proposed charsets/collations onto disk.
> --
My apologies in that case. I triggered on Dennis wording "If we want to
avoid storing charset/collation both in the column type and in each row,
we would need an extra layer that transforms the Datums before they are
stored. As a first implementation it's easier to just store everything."

Regards,
Thomas Hallgren


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc: db(at)zigo(dot)dhs(dot)org, thhal(at)mailblocks(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 03:35:19
Message-ID: 29585.1099366519@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp> writes:
> Right. AFAIK nobody has proposed charsets/collations onto disk.

Oh?

Personally, I'd much sooner eat those few bytes than try to impose a
regime where in-memory representation is different from on-disk.

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 09:17:50
Message-ID: 200411021017.50497.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Montag, 1. November 2004 07:41 schrieb Dennis Bjorklund:
> For each type we need to have convertion functions to and from strings.
> Any suggestion of how to represent these as strings now when it's a string
> plus two oid's? This is a though one..

A collation implies a character set, so you only need to store one piece of
information anyway.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 12:15:45
Message-ID: Pine.LNX.4.44.0411021308080.2015-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2 Nov 2004, Peter Eisentraut wrote:

> A collation implies a character set, so you only need to store one piece of
> information anyway.

No, a collation implies a character repertoire like UCS (unicode), it can
apply to several character sets like UTF8 and UTF16.

One can enumerate all combinations if one want to, as suggested
previously.

--
/Dennis Björklund


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: peter_e(at)gmx(dot)net
Cc: db(at)zigo(dot)dhs(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 12:53:54
Message-ID: 20041102.215354.45519473.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Am Montag, 1. November 2004 07:41 schrieb Dennis Bjorklund:
> > For each type we need to have convertion functions to and from strings.
> > Any suggestion of how to represent these as strings now when it's a string
> > plus two oid's? This is a though one..
>
> A collation implies a character set, so you only need to store one piece of
> information anyway.

In my understanding the relation between charset and collation is
1:N. Thus storing only a collation is sufficient to determine the
charset. However a charset cannot determine a collation.
--
Tatsuo Ishii


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 15:36:00
Message-ID: 200411021636.00552.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Dienstag, 2. November 2004 13:15 schrieb Dennis Bjorklund:
> On Tue, 2 Nov 2004, Peter Eisentraut wrote:
> > A collation implies a character set, so you only need to store one piece
> > of information anyway.
>
> No, a collation implies a character repertoire like UCS (unicode), it can
> apply to several character sets like UTF8 and UTF16.

For the theoretical specification of a collation, it might suffice to know the
character repertoire. But I think in practice, the implementation of a
collation will require knowing the specific character encoding.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Cc: db(at)zigo(dot)dhs(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 15:36:38
Message-ID: 200411021636.38416.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Am Dienstag, 2. November 2004 13:53 schrieb Tatsuo Ishii:
> In my understanding the relation between charset and collation is
> 1:N. Thus storing only a collation is sufficient to determine the
> charset. However a charset cannot determine a collation.

Exactly.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 16:32:55
Message-ID: Pine.LNX.4.44.0411021726190.2015-100000@zigo.dhs.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2 Nov 2004, Peter Eisentraut wrote:

> For the theoretical specification of a collation, it might suffice to
> know the character repertoire. But I think in practice, the
> implementation of a collation will require knowing the specific
> character encoding.

The named entity that is called a collation works for a character
repertoire. It would need to handle different charsets for that repertoire
of course. So there would be one collation called say ucs_sv and not
utf8_sv, utf16_sv, utf32_sv.

Anyway, this is not a problem.

--
/Dennis Björklund


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-02 17:31:55
Message-ID: 200411021831.55936.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dennis Bjorklund wrote:
> The named entity that is called a collation works for a character
> repertoire. It would need to handle different charsets for that
> repertoire of course. So there would be one collation called say
> ucs_sv and not utf8_sv, utf16_sv, utf32_sv.

Again, theoretically, this might work, but I doubt that this is a
practical implementation. Moreover, since Unicode is more or less the
only chararacter repertoire that have more than one encoding in use,
and neither UTF-16 nor UTF-32 can be used inside the PostgreSQL server
(embedded zero bytes etc.), this is really a nonissue.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: peter_e(at)gmx(dot)net
Cc: db(at)zigo(dot)dhs(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: charset/collation in values
Date: 2004-11-03 06:10:17
Message-ID: 20041103.151017.45517473.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Dennis Bjorklund wrote:
> > The named entity that is called a collation works for a character
> > repertoire. It would need to handle different charsets for that
> > repertoire of course. So there would be one collation called say
> > ucs_sv and not utf8_sv, utf16_sv, utf32_sv.
>
> Again, theoretically, this might work, but I doubt that this is a
> practical implementation. Moreover, since Unicode is more or less the
> only chararacter repertoire that have more than one encoding in use,
> and neither UTF-16 nor UTF-32 can be used inside the PostgreSQL server
> (embedded zero bytes etc.), this is really a nonissue.

I agree with Peter.
--
Tatsuo Ishii