Re: UTF8 national character data type support WIP patch and list of open issues.

From: Tatsuo Ishii <ishii(at)postgresql(dot)org>
To: robertmhaas(at)gmail(dot)com
Cc: ishii(at)postgresql(dot)org, maumau307(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, maksymb(at)fast(dot)au(dot)fujitsu(dot)com, hlinnakangas(at)vmware(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: UTF8 national character data type support WIP patch and list of open issues.
Date: 2013-09-21 22:29:52
Message-ID: 20130922.072952.1977066018971837040.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> I think the point here is that, at least as I understand it, encoding
> conversion and sanitization happens at a very early stage right now,
> when we first receive the input from the client. If the user sends a
> string of bytes as part of a query or bind placeholder that's not
> valid in the database encoding, it's going to error out before any
> type-specific code has an opportunity to get control. Look at
> textin(), for example. There's no encoding check there. That means
> it's already been done at that point. To make this work, someone's
> going to have to figure out what to do about *that*. Until we have a
> sketch of what the design for that looks like, I don't see how we can
> credibly entertain more specific proposals.

I don't think the bind placeholder is the case. That is processed by
exec_bind_message() in postgres.c. It has enough info about the type
of the placeholder, and I think we can easily deal with NCHAR. Same
thing can be said to COPY case.

Problem is an ordinary query (simple protocol "Q" message) as you
pointed out. Encoding conversion happens at a very early stage (note
that fast-path case has the same issue). If a query message contains,
say, SHIFT-JIS and EUC-JP, then we are going into trouble because the
encoding conversion routine (pg_client_to_server) regards that the
message from client contains only one encoding. However my question
is, does it really happen? Because there's any text editor which can
create SHIFT-JIS and EUC-JP mixed text. So my guess is, when user want
to use NCHAR as SHIFT-JIS text, the rest of query consist of either
SHIFT-JIS or plain ASCII. If so, what the user need to do is, set the
client encoding to SJIFT-JIS and everything should be fine.

Maumau, is my guess correct?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2013-09-21 22:35:41 Re: [HACKERS] VMs for Reviewers Available
Previous Message Jaime Casanova 2013-09-21 22:16:42 Re: Assertions in PL/PgSQL