Re: UTF8 national character data type support WIP patch and list of open issues.

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: maumau307(at)gmail(dot)com, laurenz(dot)albe(at)wien(dot)gv(dot)at, robertmhaas(at)gmail(dot)com, peter_e(at)gmx(dot)net, arul(at)fast(dot)au(dot)fujitsu(dot)com, stark(at)mit(dot)edu, ishii(at)postgresql(dot)org, Maksym(dot)Boguk(at)au(dot)fujitsu(dot)com, hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: UTF8 national character data type support WIP patch and list of open issues.
Date: 2013-11-12 06:57:52
Message-ID: 20131112.155752.666523035722474275.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I'd be much more impressed by seeing a road map for how we get to a
> useful amount of added functionality --- which, to my mind, would be
> the ability to support N different encodings in one database, for N>2.
> But even if you think N=2 is sufficient, we haven't got a road map, and
> commandeering spec-mandated syntax for an inadequate feature doesn't seem
> like a good first step. It'll just make our backwards-compatibility
> problems even worse when somebody does come up with a real solution.

I have been thinking about this for years and I think the key idea for
this is, implementing "universal encoding". The universal encoding
should have following characteristics to implement N>2 encoding in a
database.

1) no loss of round trip encoding conversion

2) no mapping table is necessary to convert from/to existing encodings

Once we implement the universal encoding, other problem such as
"pg_database with multiple encoding problem" can be solved easily.

Currently there's no such an universal encoding in the universe, I
think the only way is, inventing it by ourselves.

At this point the design of the encoding I have in mind is,

1) 1 byte encoding identifier + 7 bytes body (totaly 8 bytes). The
encoding identifier's value is between 0x80 and 0xff and is
assigned to exiting encoding such as UTF-8, ascii, EUC-JP and so
on. The encodings should be limited to "database safe"
encodings. The encoding body is raw characters represented by
existing encodings. This form is called "word".

2) We also have "mutibyte" representation of the universal
encoding. The first byte represents the lenght of the multibyte
character (similar to the first byte of UTF-8). The second byte is
the encoding identifier explained in above. The rest of the
character is same as above.

#1 and #2 are logically same and converted to each other, and we can
use one of them whenever we like.

The form #1 is easy to handle because each word has fixed length (8
bytes). So probably used in temporary data in memory. The second form
can save space and will be used in the data itself.

If we want to have a table encoded in an encoding different from the
database encoding, the table is encoded in the universal
encoding. pg_class should remember the fact to avoid the confusion
about what encoding a table is using. I think majority of tables in a
database uses the same encoding as the database encoding. Only a few
tables want to have different encoding. The design pushes the penalty
to such minorities.

If we need to join two tables which have different encoding, we need
to convert them into the same encoding (this should succeed if the
encodings are "compatible"). If fails, the join will fail too.

We could expand the technique above for the design which allow each
column has different encoding.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2013-11-12 08:48:41 Re: Get more from indices.
Previous Message Craig Ringer 2013-11-12 06:49:28 Re: Updatable security_barrier views (was: [v9.4] row level security)