Re: UTF8 national character data type support WIP patch and list of open issues.

From: Valentine Gogichashvili <valgog(at)gmail(dot)com>
To: MauMau <maumau307(at)gmail(dot)com>
Cc: ishii(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Boguk, Maksym" <maksymb(at)fast(dot)au(dot)fujitsu(dot)com>, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: UTF8 national character data type support WIP patch and list of open issues.
Date: 2013-09-22 21:55:40
Message-ID: CAP93muVF=baHDtRs1JBPS3A85j6jRTjEUDMBjr=Voa-xFym4qg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
>
> PostgreSQL has a very powerful possibilities for storing any kind of
>> encoding. So maybe it makes sense to add the ENCODING as another column
>> property, the same way a COLLATION was added?
>>
>
> Some other people in this community suggested that. ANd the SQL standard
> suggests the same -- specifying a character encoding for each column:
> CHAR(n) CHARASET SET ch.
>
>
> Text operations should work automatically, as in memory all strings will
>> be
>> converted to the database encoding.
>>
>> This approach will also open a possibility to implement custom ENCODINGs
>> for the column data storage, like snappy compression or even BSON, gobs or
>> protbufs for much more compact type storage.
>>
>
> Thanks for your idea that sounds interesting, although I don't understand
> that well.
>
>
The idea is very simple:

CREATE DATABASE utf8_database ENCODING 'utf8';

\c utf8_database

CREATE TABLE a(
id serial,
ascii_data text ENCODING 'ascii', -- will use ascii_to_utf8 to read and
utf8_to_ascii to write
koi8_data text ENCODING 'koi8_r', -- will use koi8_r_to_utf8 to read and
utf8_to_koi8_r to write
json_data json ENCODING 'bson' -- will use bson_to_json to read and
json_to_bson to write
);

The problem with bson_to_json here is that probably it will not be possible
to write JSON in koi8_r for example. But now it is also even not considered
in these discussions.

If the ENCODING machinery would get not only the encoding name, but also
the type OID, it should be possible to write encoders for TYPEs and array
of TYPEs (I had to do it using the casts to bytea and protobuff to minimize
the size of storage for an array of types when writing a lot of data, that
could be unpacked afterwords directly in the DB as normal database types).

I hope I made my point a little bit clearer.

Regards,

Valentine Gogichashvili

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Hannu Krosing 2013-09-22 22:07:33 Re: SSI freezing bug
Previous Message Alexander Korotkov 2013-09-22 20:47:09 Re: GIN improvements part 1: additional information