Re: Server-side support of all encodings

Lists: pgsql-hackers
From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Server-side support of all encodings
Date: 2007-03-26 01:39:17
Message-ID: 20070326102148.6502.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello,

PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
but we cannot use them as server encodings. Are there any reason for it?
AFAICS, we can support them only if we add each pg_xxx2wchar_with_len().

I'd like to add server-side SJIS supports for Windows Japanese edition.
Its native encoding is SJIS, so the C library expects SJIS characters are
passed if we set locale='Japanese'. However, we only support EUC_jp and
UTF-8 as valid Japanese encodings.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-03-26 02:20:27
Message-ID: 2099.1174875627@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
> but we cannot use them as server encodings. Are there any reason for it?

Very much so --- they aren't safe ASCII-supersets, and thus for example
the parser will fail on them. Backend encodings must have the property
that all bytes of a multibyte character are >= 128.

regards, tom lane


From: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-03-26 02:29:36
Message-ID: 20070326112248.6507.ITAGAKI.TAKAHIRO@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
> > but we cannot use them as server encodings. Are there any reason for it?
>
> Very much so --- they aren't safe ASCII-supersets, and thus for example
> the parser will fail on them. Backend encodings must have the property
> that all bytes of a multibyte character are >= 128.

But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
Trailing bytes of JOHAB can be less than 128.
It's true that other server-supported encodings use only characters >= 128.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: pgsql-hackers(at)postgresql(dot)org, Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
Subject: Re: Server-side support of all encodings
Date: 2007-03-26 02:49:01
Message-ID: 2428.1174877341@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Backend encodings must have the property
>> that all bytes of a multibyte character are >= 128.

> But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
> Trailing bytes of JOHAB can be less than 128.

In that case we must remove JOHAB from the list of allowed server
encodings. Tatsuo, can you comment on whether this is correct?

regards, tom lane


From: "Ioseph Kim" <pgsql-kr(at)postgresql(dot)or(dot)kr>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Server-side support of all encodings
Date: 2007-03-26 03:30:51
Message-ID: 001301c76f57$272d0c10$1e00a8c0@IDC
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At Korea, Johab code is very old encondig.
by the way, cp949 code page is really used in most environments.

Personally speaking, Johab server code set is not need.
I think that PostgreSQL supports UHC (cp949) server code set.
This feature will be greet many Korean. :)
Unfortunately, UHC code set have character sequences less then 128 byte.

I tred to patch this problem, but this is not simply. I had gave up. :(

----- Original Message -----
From: "ITAGAKI Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Sent: Monday, March 26, 2007 11:29 AM
Subject: Re: [HACKERS] Server-side support of all encodings

>
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> > PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client
>> > encodings,
>> > but we cannot use them as server encodings. Are there any reason for
>> > it?
>>
>> Very much so --- they aren't safe ASCII-supersets, and thus for example
>> the parser will fail on them. Backend encodings must have the property
>> that all bytes of a multibyte character are >= 128.
>
> But then, PG_JOHAB have already infringed it. Please see
> johab_to_utf8.map.
> Trailing bytes of JOHAB can be less than 128.
> It's true that other server-supported encodings use only characters >=
> 128.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate
>


From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-03-26 09:42:22
Message-ID: 20070326.184222.112627760.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Backend encodings must have the property
> >> that all bytes of a multibyte character are >= 128.
>
> > But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
> > Trailing bytes of JOHAB can be less than 128.
>
> In that case we must remove JOHAB from the list of allowed server
> encodings. Tatsuo, can you comment on whether this is correct?

Sigh. From the first day when JOHAB was supported (back to 7.3 days),
it should had not been in the server encodings. JOHAB's second byte
definitely contain 0x41 and above. *johab*.map just reflect the
fact. I think we should remove JOHAB from the server encodings list.
I'm afraid users who have JOHAB encoded databases get angry, though.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-03-26 10:34:45
Message-ID: 20070326.193445.18308031.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Backend encodings must have the property
> >> that all bytes of a multibyte character are >= 128.
>
> > But then, PG_JOHAB have already infringed it. Please see johab_to_utf8.map.
> > Trailing bytes of JOHAB can be less than 128.
>
> In that case we must remove JOHAB from the list of allowed server
> encodings. Tatsuo, can you comment on whether this is correct?

Sigh. From the first day when JOHAB was supported (back to 7.3 days),
it should had not been in the server encodings. JOHAB's second byte
definitely contain 0x41 and above. *johab*.map just reflect the
fact. I think we should remove JOHAB from the server encodings list.
I'm afraid users who have JOHAB encoded databases get angry, though.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Cc: itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-03-26 13:52:23
Message-ID: 15585.1174917143@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> writes:
> Sigh. From the first day when JOHAB was supported (back to 7.3 days),
> it should had not been in the server encodings. JOHAB's second byte
> definitely contain 0x41 and above. *johab*.map just reflect the
> fact. I think we should remove JOHAB from the server encodings list.
> I'm afraid users who have JOHAB encoded databases get angry, though.

I think the best way to proceed is probably to fix this in HEAD but
not back-patch it. During a dump and reload the encoding can be
corrected to something safe.

regards, tom lane


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: ishii(at)sraoss(dot)co(dot)jp, itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-04-15 02:15:34
Message-ID: 20070415.111534.84360945.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp> writes:
> > Sigh. From the first day when JOHAB was supported (back to 7.3 days),
> > it should had not been in the server encodings. JOHAB's second byte
> > definitely contain 0x41 and above. *johab*.map just reflect the
> > fact. I think we should remove JOHAB from the server encodings list.
> > I'm afraid users who have JOHAB encoded databases get angry, though.
>
> I think the best way to proceed is probably to fix this in HEAD but
> not back-patch it. During a dump and reload the encoding can be
> corrected to something safe.

Ok. Shall I go ahead and remove JOHAB in HEAD?
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: ishii(at)sraoss(dot)co(dot)jp, itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-04-15 05:01:16
Message-ID: 26022.1176613276@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>> I think the best way to proceed is probably to fix this in HEAD but
>> not back-patch it. During a dump and reload the encoding can be
>> corrected to something safe.

> Ok. Shall I go ahead and remove JOHAB in HEAD?

+1 for me.

regards, tom lane


From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-04-15 11:09:42
Message-ID: 20070415.200942.119856171.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
> >> I think the best way to proceed is probably to fix this in HEAD but
> >> not back-patch it. During a dump and reload the encoding can be
> >> corrected to something safe.
>
> > Ok. Shall I go ahead and remove JOHAB in HEAD?
>
> +1 for me.
>
> regards, tom lane

Done.

BTW, do we have to modify pg_dump or pg_restore so that it can
automatically adjust JOHAB to UTF8 (it's the only safe encoding
compatible with JOHAB)? I'm not sure it's worth the trouble. Maybe
documenting in the release note is enough? I guess that there is 0
users who are using JOHAB.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc: itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-04-15 15:14:48
Message-ID: 29920.1176650088@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
> BTW, do we have to modify pg_dump or pg_restore so that it can
> automatically adjust JOHAB to UTF8 (it's the only safe encoding
> compatible with JOHAB)? I'm not sure it's worth the trouble. Maybe
> documenting in the release note is enough?

Do we actually need to do anything? Dumps taken in client_encoding
JOHAB could exist regardless of the source server_encoding --- the
same is true of other client-only encodings. Such dumps should load
fine into a UTF8 server_encoding database, as long as we have the right
conversion available.

I can imagine someone wanting to take a dump in a client-only encoding
for other reasons (export of the data to somewhere else, say) so I don't
think pg_dump should try to prevent it.

regards, tom lane


From: "William ZHANG" <zedware(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-06-25 01:56:28
Message-ID: f5n7ba$nmr$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
>> PostgreSQL suppots SJIS, BIG5, GBK, UHC and GB18030 as client encodings,
>> but we cannot use them as server encodings. Are there any reason for it?
>
> Very much so --- they aren't safe ASCII-supersets, and thus for example
> the parser will fail on them. Backend encodings must have the property
> that all bytes of a multibyte character are >= 128.

Sorry. I still cannot understand why backend encodings must have this
property. AFAIK, the parser treats characters as ASCII. So any multi-byte
characters will be treated as two or more ASCII characters. But if
the multi-byte encoding doesnot use any special ASCII characters like
single quote('), double quote(") and backslash(\), I think the parser
can deal with it correctly. A quick search in
src\backend\utils\mb\Unicode\*.map tells me that no encoding uses
single quote or double quote, but JOHAB, GBK, GB18030, BIG5, SJIS
use backslash. Since pgsql doesnot accept backslash as escape character
in identity(double quoted string) or value(single quoted string)
any more, I think the parser/scanner can process multi-bytes characters
correctly.

Thanks in advance.
William ZHANG

>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "William ZHANG" <zedware(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Server-side support of all encodings
Date: 2007-06-25 04:10:03
Message-ID: 12407.1182744603@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"William ZHANG" <zedware(at)gmail(dot)com> writes:
> Sorry. I still cannot understand why backend encodings must have this
> property. AFAIK, the parser treats characters as ASCII. So any multi-byte
> characters will be treated as two or more ASCII characters. But if
> the multi-byte encoding doesnot use any special ASCII characters like
> single quote('), double quote(") and backslash(\), I think the parser
> can deal with it correctly.

You've got your attention too narrowly focused on strings inside quotes;
it's strings outside quotes that are the problem.

As an example, I see that gb18030 defines characters like 97 7e.
If someone tried to use that as a character of a SQL identifier
--- something that'd work fine for the UTF8 equivalent e6 a2 a1
--- the parser would see it as an identifier byte followed by
the operator ~.

Similarly, there are problems if we were to allow these character sets
for the pattern argument of a regular expression operator, or for any
datatype at all that can be embedded in an array constant. And for PL
languages that feed prosrc strings into external interpreters, such as
Perl or R, it gets really interesting really quickly :-(.

It is possible that some of these encodings could be allowed without
any risks, but I don't think it is worth our time to grovel through
each valid character and every possible backend situation to determine
safety. The risks are not always obvious --- see for instance the
security holes we fixed about a year ago in 8.1.4 et al --- and so
I for one would never have a lot of faith in there not being any holes.
The rule "no ASCII-aliasing characters" is a simple one that we can have
some confidence in.

regards, tom lane