Re: invalidly encoded strings

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: andrew(at)dunslane(dot)net, laurenz(dot)albe(at)wien(dot)gv(dot)at, pgsql-hackers(at)postgresql(dot)org
Subject: Re: invalidly encoded strings
Date: 2007-09-10 15:30:51
Message-ID: 20070911.003051.41631033.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-patches

> Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> > The reason we are prepared to make an exception for Unicode is precisely
> > because the code point maps to an encoding pattern independently of
> > architecture, ISTM.
>
> Right --- there is a well-defined standard for the numerical value of
> each character in Unicode. And it's also clear what to do in
> single-byte encodings. It's not at all clear what the representation
> ought to be for other multibyte encodings. A direct transliteration
> of the byte sequence not only has endianness issues, but will have
> a weird non-dense set of valid values because of the restrictions on
> valid multibyte characters.
>
> Given that chr() has never before behaved sanely for multibyte values at
> all, extending it to Unicode code points is a reasonable extension,
> and throwing error for other encodings is reasonable too. If we ever do
> come across code-point standards for other encodings we can adopt 'em at
> that time.

I don't understand whole discussion.

Why do you think that employing the Unicode code point as the chr()
argument could avoid endianness issues? Are you going to represent
Unicode code point as UCS-4? Then you have to specify the endianness
anyway. (see the UCS-4 standard for more details)

Or are you going to represent Unicode point as a character string such
as 'U+0259'? Then representing any encoding as a string could avoid
endianness issues anyway, and I don't see Unicode code point is any
better than others.

Also I'd like to point out all encodings has its own code point
systems as far as I know. For example, EUC-JP has its corresponding
code point systems, ASCII, JIS X 0208 and JIS X 0212. So I don't see
we can't use "code point" as chr()'s argument for othe encodings(of
course we need optional parameter specifying which character set is
supposed).
--
Tatsuo Ishii
SRA OSS, Inc. Japan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2007-09-10 15:37:31 Re: [HACKERS] Include Lists for Text Search
Previous Message Martijn van Oosterhout 2007-09-10 15:04:44 Re: Hash index todo list item

Browse pgsql-patches by date

  From Date Subject
Next Message Simon Riggs 2007-09-10 15:37:31 Re: [HACKERS] Include Lists for Text Search
Previous Message Oleg Bartunov 2007-09-10 14:24:46 Re: Include Lists for Text Search