Quick Links

Re: Patch: add conversion from pg_wchar to multibyte

Lists:	pgsql-hackers

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Patch: add conversion from pg_wchar to multibyte
Date:	2012-04-23 08:48:20
Message-ID:	CAPpHfdshcHe1ZPQhyd2xhAKnNu0VpdMPuGFtvribqJcnH0K2Ew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hackers,

attached patch adds conversion from pg_wchar string to multibyte string.
This functionality is needed for my patch on index support for regular
expression search
http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php .
Analyzing conversion from multibyte to pg_wchar I found following types of
conversion:
1) Trivial conversion for single-byte encoding. It just adds leading zeros
to each byte.
2) Conversion from UTF-8 to unicode.
3) Conversions from euc* encodings. They write bytes of a character to
pg_wchar in inverse order starting from lower byte (this explanation assume
little endian system).
4) Conversion from mule encoding. This conversion is unclear for me and
also seems to be lossy.

It was easy to write inverse conversion for 1-3. I've changed 4 conversion
to behave like 3. I'm not sure my change is ok, because I didn't understand
original conversion.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
wchar2mb-0.1.patch	application/octet-stream	15.6 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	ishii(at)sraoss(dot)co(dot)jp
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-05-21 22:37:54
Message-ID:	CAPpHfdv8_fa6FCCe_sHndDaAcdA6R28tO-A7PQJSC2joizHmpA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, Ishii-san!

We've talked on PGCon that I've questions about mule to wchar
conversion. My questions about pg_mule2wchar_with_len function are
following. In these parts of code:
*
*
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}

and

else if (IS_LCPRV2(*from) && len >= 4)
{
from++;
*to = *from++ << 16;
*to |= *from++ << 8;
*to |= *from++;
len -= 4;
}

we skip first character of original string. Are we able to restore it back
from pg_wchar?
Also in this part of code we're shifting first byte by 16 bits:

if (IS_LC1(*from) && len >= 2)
{
*to = *from++ << 16;
*to |= *from++;
len -= 2;
}
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}

Why don't we shift it by 8 bits?
You can see my patch in this thread where I propose purely mechanical
changes in this function which make inverse conversion possible.

------
With best regards,
Alexander Korotkov.

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	aekorotkov(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-05-22 07:50:29
Message-ID:	20120522.165029.1187711886221407331.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Alexander,

It was good seeing you in Ottawa!

> Hello, Ishii-san!
>
> We've talked on PGCon that I've questions about mule to wchar
> conversion. My questions about pg_mule2wchar_with_len function are
> following. In these parts of code:
> *
> *
> else if (IS_LCPRV1(*from) && len >= 3)
> {
> from++;
> *to = *from++ << 16;
> *to |= *from++;
> len -= 3;
> }
>
> and
>
> else if (IS_LCPRV2(*from) && len >= 4)
> {
> from++;
> *to = *from++ << 16;
> *to |= *from++ << 8;
> *to |= *from++;
> len -= 4;
> }
>
> we skip first character of original string. Are we able to restore it back
> from pg_wchar?

I think it's possible. The first characters are defined like this:

#define IS_LCPRV1(c) ((unsigned char)(c) == 0x9a || (unsigned char)(c) == 0x9b)
#define IS_LCPRV2(c) ((unsigned char)(c) == 0x9c || (unsigned char)(c) == 0x9d)

It seems IS_LCPRV1 is not used in any of PostgreSQL supported
encodings at this point, that means there's 0 chance which existing
databases include LCPRV1. So you could safely ignore it.

For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
and it is fixed to 0x9d. So you can always restore the value to 0x9d.

> Also in this part of code we're shifting first byte by 16 bits:
>
> if (IS_LC1(*from) && len >= 2)
> {
> *to = *from++ << 16;
> *to |= *from++;
> len -= 2;
> }
> else if (IS_LCPRV1(*from) && len >= 3)
> {
> from++;
> *to = *from++ << 16;
> *to |= *from++;
> len -= 3;
> }
>
> Why don't we shift it by 8 bits?

Because we want the first byte of LC1 case to be placed in the second
byte of wchar. i.e.

0th byte: always 0
1th byte: leading byte (the first byte of the multibyte)
2th byte: always 0
3th byte: the second byte of the multibyte

Note that we always assume that the 1th byte (called "leading byte":
LB in short) represents the id of the character set (from 0x81 to
0xff) in MULE INTERNAL encoding. For the mapping between LB and
charsets, see pg_wchar.h.

> You can see my patch in this thread where I propose purely mechanical
> changes in this function which make inverse conversion possible.
>
> ------
> With best regards,
> Alexander Korotkov.

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-05-22 10:48:11
Message-ID:	CAPpHfduQEZUV89CnDJcjnPrdDmB810O4_xLc71GbEA42Yi=40Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 22, 2012 at 11:50 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>
> I think it's possible. The first characters are defined like this:
>
> #define IS_LCPRV1(c) ((unsigned char)(c) == 0x9a || (unsigned char)(c)
> == 0x9b)
> #define IS_LCPRV2(c) ((unsigned char)(c) == 0x9c || (unsigned char)(c)
> == 0x9d)
>
> It seems IS_LCPRV1 is not used in any of PostgreSQL supported
> encodings at this point, that means there's 0 chance which existing
> databases include LCPRV1. So you could safely ignore it.
>
> For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
> in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
> and it is fixed to 0x9d. So you can always restore the value to 0x9d.
>
> > Also in this part of code we're shifting first byte by 16 bits:
> >
> > if (IS_LC1(*from) && len >= 2)
> > {
> > *to = *from++ << 16;
> > *to |= *from++;
> > len -= 2;
> > }
> > else if (IS_LCPRV1(*from) && len >= 3)
> > {
> > from++;
> > *to = *from++ << 16;
> > *to |= *from++;
> > len -= 3;
> > }
> >
> > Why don't we shift it by 8 bits?
>
> Because we want the first byte of LC1 case to be placed in the second
> byte of wchar. i.e.
>
> 0th byte: always 0
> 1th byte: leading byte (the first byte of the multibyte)
> 2th byte: always 0
> 3th byte: the second byte of the multibyte
>
> Note that we always assume that the 1th byte (called "leading byte":
> LB in short) represents the id of the character set (from 0x81 to
> 0xff) in MULE INTERNAL encoding. For the mapping between LB and
> charsets, see pg_wchar.h.

Thanks for your comments. They clarify a lot.
But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
Isn't it possible for them to produce same pg_wchar?

------
With best regards,
Alexander Korotkov.

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	aekorotkov(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-05-22 11:27:41
Message-ID:	20120522.202741.528025310545384924.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Thanks for your comments. They clarify a lot.
> But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
> Isn't it possible for them to produce same pg_wchar?

If LB is in 0x90 - 0x99 range, then they are LC2.
If LB is in 0xf0 - 0xff range, then they are LCPRV2.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-05-24 04:04:12
Message-ID:	CAPpHfdupQU4+YpaMLixhG2Mr6viSsHG47t2nR1PKQvqPHVDAuA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:

> > Thanks for your comments. They clarify a lot.
> > But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
> > Isn't it possible for them to produce same pg_wchar?
>
> If LB is in 0x90 - 0x99 range, then they are LC2.
> If LB is in 0xf0 - 0xff range, then they are LCPRV2.
>

Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
patch is attached.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
wchar2mb-0.2.patch	application/octet-stream	17.0 KB

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	aekorotkov(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-05-29 01:27:42
Message-ID:	20120529.102742.1942410168809096450.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>
>> > Thanks for your comments. They clarify a lot.
>> > But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
>> > Isn't it possible for them to produce same pg_wchar?
>>
>> If LB is in 0x90 - 0x99 range, then they are LC2.
>> If LB is in 0xf0 - 0xff range, then they are LCPRV2.
>>
>
> Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
> patch is attached.

[forgot to cc: to the list]

I looked into your patch, especially: pg_wchar2euc_with_len(const
pg_wchar *from, unsigned char *to, int len)

I think there's a small room to enhance the function.

if (*from >> 24)
{
*to++ = *from >> 24;
*to++ = (*from >> 16) & 0xFF;
*to++ = (*from >> 8) & 0xFF;
*to++ = *from & 0xFF;
cnt += 4;
}

Since the function walk through this every single wchar, something like:

if ((c = *from >> 24))
{
*to++ = c;
*to++ = (*from >> 16) & 0xFF;
*to++ = (*from >> 8) & 0xFF;
*to++ = *from & 0xFF;
cnt += 4;
}

will save few cycles(I'm not sure the optimizer produces similar code
above anyway though).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-06-27 19:35:56
Message-ID:	CA+TgmoY=Q6ydZb8AjKQ3cQ7mzRmL8r24GBqiPW8kFSSdeFbefw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 24, 2012 at 12:04 AM, Alexander Korotkov
<aekorotkov(at)gmail(dot)com> wrote:
> Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of
> patch is attached.

Review:

It looks to me like pg_wchar2utf_with_len will not work, because
unicode_to_utf8 returns its second argument unmodified - not, as your
code seems to assume, the byte following what was already written.

MULE also looks problematic. The code that you've written isn't
symmetric with the opposite conversion, unlike what you did in all
other cases, and I don't understand why. I'm also somewhat baffled by
the reverse conversion: it treats a multi-byte sequence beginning with
a byte for which IS_LCPRV1(x) returns true as invalid if there are
less than 3 bytes available, but it only reads two; similarly, for
IS_LCPRV2(x), it demands 4 bytes but converts only 3.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-01 09:11:38
Message-ID:	CAPpHfduPZMmpq9yjmd8aXQsdMiG6tCU0w0VoBugz3EwR9o4yUw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jun 27, 2012 at 11:35 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> It looks to me like pg_wchar2utf_with_len will not work, because
> unicode_to_utf8 returns its second argument unmodified - not, as your
> code seems to assume, the byte following what was already written.
>

Fixed.

> MULE also looks problematic. The code that you've written isn't
> symmetric with the opposite conversion, unlike what you did in all
> other cases, and I don't understand why. I'm also somewhat baffled by
> the reverse conversion: it treats a multi-byte sequence beginning with
> a byte for which IS_LCPRV1(x) returns true as invalid if there are
> less than 3 bytes available, but it only reads two; similarly, for
> IS_LCPRV2(x), it demands 4 bytes but converts only 3.

Should we save existing pg_wchar representation for MULE encoding?
Probably, we can modify it like in 0.1 version of patch in order to make it
more transparent.

------
With best regards,
Alexander Korotkov.

Attachment	Content-Type	Size
wchar2mb-0.4.patch	application/octet-stream	17.0 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-02 16:12:49
Message-ID:	CA+TgmobzmAh-WF3dV==rv=ft0mcK0jwQNiHAdhb4KOL2BZQuZw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> MULE also looks problematic. The code that you've written isn't
>> symmetric with the opposite conversion, unlike what you did in all
>> other cases, and I don't understand why. I'm also somewhat baffled by
>> the reverse conversion: it treats a multi-byte sequence beginning with
>> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> less than 3 bytes available, but it only reads two; similarly, for
>> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>
> Should we save existing pg_wchar representation for MULE encoding? Probably,
> we can modify it like in 0.1 version of patch in order to make it more
> transparent.

Changing the encoding would break pg_upgrade, so -1 from me on that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-02 20:33:11
Message-ID:	CAPpHfdvcTiss1MetkZZth5yzMx=W+bqGuAAdesQ_9rQJmf7vjQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> >> MULE also looks problematic. The code that you've written isn't
> >> symmetric with the opposite conversion, unlike what you did in all
> >> other cases, and I don't understand why. I'm also somewhat baffled by
> >> the reverse conversion: it treats a multi-byte sequence beginning with
> >> a byte for which IS_LCPRV1(x) returns true as invalid if there are
> >> less than 3 bytes available, but it only reads two; similarly, for
> >> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
> >
> > Should we save existing pg_wchar representation for MULE encoding?
> Probably,
> > we can modify it like in 0.1 version of patch in order to make it more
> > transparent.
>
> Changing the encoding would break pg_upgrade, so -1 from me on that.

I didn't realize that we store pg_wchar on disk somewhere. I thought it is
only in-memory representation. Where do we store pg_wchar on disk?

------
With best regards,
Alexander Korotkov.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-02 20:37:13
Message-ID:	CA+Tgmoag-V_=BMQ9Xjw4zEWMnOkwPs_Kq=t7rRyE20D5nhT3+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 2, 2012 at 4:33 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
>> wrote:
>> >> MULE also looks problematic. The code that you've written isn't
>> >> symmetric with the opposite conversion, unlike what you did in all
>> >> other cases, and I don't understand why. I'm also somewhat baffled by
>> >> the reverse conversion: it treats a multi-byte sequence beginning with
>> >> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> >> less than 3 bytes available, but it only reads two; similarly, for
>> >> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>> >
>> > Should we save existing pg_wchar representation for MULE encoding?
>> > Probably,
>> > we can modify it like in 0.1 version of patch in order to make it more
>> > transparent.
>>
>> Changing the encoding would break pg_upgrade, so -1 from me on that.
>
>
> I didn't realize that we store pg_wchar on disk somewhere. I thought it is
> only in-memory representation. Where do we store pg_wchar on disk?

OK, now I'm confused. I was thinking (incorrectly) that you were
talking about changing the multibyte encoding, which of course is
saved on disk all over the place. Changing the wchar encoding is a
different kettle of fish, and I have no idea what that would or would
not break. But I don't see why we'd want to do such a thing. We just
need to make the MB->WCHAR and WCHAR->MB transformations mirror images
of each other; why is that hard?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-02 20:46:03
Message-ID:	CAPpHfdvjejw0d5XyHoLXhvBpNiYiK_YbTN9395KGRjOMpqANPg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 3, 2012 at 12:37 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> On Mon, Jul 2, 2012 at 4:33 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com>
> wrote:
> > On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> >>
> >> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <
> aekorotkov(at)gmail(dot)com>
> >> wrote:
> >> >> MULE also looks problematic. The code that you've written isn't
> >> >> symmetric with the opposite conversion, unlike what you did in all
> >> >> other cases, and I don't understand why. I'm also somewhat baffled
> by
> >> >> the reverse conversion: it treats a multi-byte sequence beginning
> with
> >> >> a byte for which IS_LCPRV1(x) returns true as invalid if there are
> >> >> less than 3 bytes available, but it only reads two; similarly, for
> >> >> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
> >> >
> >> > Should we save existing pg_wchar representation for MULE encoding?
> >> > Probably,
> >> > we can modify it like in 0.1 version of patch in order to make it more
> >> > transparent.
> >>
> >> Changing the encoding would break pg_upgrade, so -1 from me on that.
> >
> >
> > I didn't realize that we store pg_wchar on disk somewhere. I thought it
> is
> > only in-memory representation. Where do we store pg_wchar on disk?
>
> OK, now I'm confused. I was thinking (incorrectly) that you were
> talking about changing the multibyte encoding, which of course is
> saved on disk all over the place. Changing the wchar encoding is a
> different kettle of fish, and I have no idea what that would or would
> not break. But I don't see why we'd want to do such a thing. We just
> need to make the MB->WCHAR and WCHAR->MB transformations mirror images
> of each other; why is that hard?

So, I provided such transformation in versions 0.3 and 0.4 based on
explanation from Tatsuo Ishii. The problem is that both conversions are
nontrivial and it's not evident that they are mirror (understanding that
they are mirror require some additional assumptions about encodings, not
evident just by transformation itself). I though you mention that problem
two message back.

------
With best regards,
Alexander Korotkov.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-02 22:16:42
Message-ID:	CA+TgmoaHLC6tD+88XZJmo-gJ7Ue+5d7oNKeES-5hyrUTC_LiKQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 2, 2012 at 4:46 PM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> So, I provided such transformation in versions 0.3 and 0.4 based on
> explanation from Tatsuo Ishii. The problem is that both conversions are
> nontrivial and it's not evident that they are mirror (understanding that
> they are mirror require some additional assumptions about encodings, not
> evident just by transformation itself). I though you mention that problem
> two message back.

Yeah, I did. I think I may be a bit confused here, so let me try to
understand this a bit better. It seems like pg_mule2wchar_with_len
uses the following algorithm:

- If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
with shifts of 16 and 0.
- If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
skipping the first one and storing the remaining two with shifts of 16
and 0.
- If the first character IS_LC2 (0x90-0x99), decode three bytes,
stored with shifts of 16, 8, and 0.
- If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
skipping the first one and storing the remaining three with offsets of
16, 8, and 0.

In the reverse transformation implemented by pg_wchar2mule_with_len,
if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
3 bytes, respectively, exactly as I would expect. ASCII decoding is
also as I would expect. The case I don't understand is what happens
when the leading byte of the multibyte character was IS_LCPRV1 or
IS_LCPRV2. In that case, we ought to decode three bytes if it was
IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
always decode 4 bytes. That implies that the IS_LCPRV1() case in
pg_mule2wchar_with_len is dead code, and that any 4 byte characters
are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
comment there is driving at, but it's not too clear to me.

Am I close?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	robertmhaas(at)gmail(dot)com
Cc:	aekorotkov(at)gmail(dot)com, ishii(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-02 23:33:36
Message-ID:	20120703.083336.1290159206305528932.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Yeah, I did. I think I may be a bit confused here, so let me try to
> understand this a bit better. It seems like pg_mule2wchar_with_len
> uses the following algorithm:
>
> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
> with shifts of 16 and 0.
> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
> skipping the first one and storing the remaining two with shifts of 16
> and 0.
> - If the first character IS_LC2 (0x90-0x99), decode three bytes,
> stored with shifts of 16, 8, and 0.
> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
> skipping the first one and storing the remaining three with offsets of
> 16, 8, and 0.

Correct.

> In the reverse transformation implemented by pg_wchar2mule_with_len,
> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
> 3 bytes, respectively, exactly as I would expect. ASCII decoding is
> also as I would expect. The case I don't understand is what happens
> when the leading byte of the multibyte character was IS_LCPRV1 or
> IS_LCPRV2. In that case, we ought to decode three bytes if it was
> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
> always decode 4 bytes. That implies that the IS_LCPRV1() case in
> pg_mule2wchar_with_len is dead code,

Yes, dead code unless we want to support following encodings in the
future(from include/mb/pg_wchar.h:
#define LC_SISHENG 0xa0/* Chinese SiSheng characters for
* PinYin/ZhuYin (not supported) */
#define LC_IPA 0xa1/* IPA (International Phonetic Association)
* (not supported) */
#define LC_VISCII_LOWER 0xa2/* Vietnamese VISCII1.1 lower-case (not
* supported) */
#define LC_VISCII_UPPER 0xa3/* Vietnamese VISCII1.1 upper-case (not
* supported) */
#define LC_ARABIC_DIGIT 0xa4 /* Arabic digit (not supported) */
#define LC_ARABIC_1_COLUMN 0xa5 /* Arabic 1-column (not supported) */
#define LC_ASCII_RIGHT_TO_LEFT 0xa6 /* ASCII (left half of ISO8859-1) with
* right-to-left direction (not
* supported) */
#define LC_LAO 0xa7/* Lao characters (ISO10646 0E80..0EDF) (not
* supported) */
#define LC_ARABIC_2_COLUMN 0xa8 /* Arabic 1-column (not supported) */

> and that any 4 byte characters
> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
> comment there is driving at, but it's not too clear to me.

Yes, that's because we only support EUC_TW and BIG5 which are using
IS_LCPRV2 in the mule interal encoding, as stated in the comment.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 00:12:33
Message-ID:	5895.1341274353@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> In the reverse transformation implemented by pg_wchar2mule_with_len,
> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
> 3 bytes, respectively, exactly as I would expect. ASCII decoding is
> also as I would expect. The case I don't understand is what happens
> when the leading byte of the multibyte character was IS_LCPRV1 or
> IS_LCPRV2.

Some inspection of pg_wchar.h suggests that the IS_LCPRV1 and IS_LCPRV2
cases are unused: the file doesn't define any encoding labels that match
the byte values they accept, nor do the comments suggest that Emacs has
any such labels either. If true, it would not be much of a stretch to
believe that any code claiming to support these cases could be broken.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 00:55:56
Message-ID:	6653.1341276956@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Some inspection of pg_wchar.h suggests that the IS_LCPRV1 and IS_LCPRV2
> cases are unused: the file doesn't define any encoding labels that match
> the byte values they accept, nor do the comments suggest that Emacs has
> any such labels either.

Scratch that --- I was misled by the fond illusion that our code
wouldn't use magic hex literals for encoding labels. Stuff like this:

/* 0x9d means LCPRV2 */
if (c1 == LC_CNS11643_1 || c1 == LC_CNS11643_2 || c1 == 0x9d)

seems to me to be well below the minimum acceptable quality standards
for Postgres code.

Having said that, grepping the src/backend/utils/mb/conversion_procs/
reveals no sign that 0x9a, 0x9b, or 0x9c are used anywhere with the
meanings that the IS_LCPRV1 and IS_LCPRV2 macros assign to them.
Furthermore, AFAICS the 0x9d case is only used in euc_tw_and_big5/,
with the following byte being one of the LC_CNS11643_[3-7] constants.

Given that these constants are treading on encoding ID namespace that
Emacs upstream might someday decide to assign, I think we'd be well
advised to *not* start installing any code that thinks that 9a-9c
mean something.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 02:56:18
Message-ID:	CA+TgmoY-Tud3MnJTF0CFj1EhE1SYG+vQ5RNH_REX==-g=_tBRg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 2, 2012 at 7:33 PM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:
>> Yeah, I did. I think I may be a bit confused here, so let me try to
>> understand this a bit better. It seems like pg_mule2wchar_with_len
>> uses the following algorithm:
>>
>> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored
>> with shifts of 16 and 0.
>> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes,
>> skipping the first one and storing the remaining two with shifts of 16
>> and 0.
>> - If the first character IS_LC2 (0x90-0x99), decode three bytes,
>> stored with shifts of 16, 8, and 0.
>> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes,
>> skipping the first one and storing the remaining three with offsets of
>> 16, 8, and 0.
>
> Correct.
>
>> In the reverse transformation implemented by pg_wchar2mule_with_len,
>> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or
>> 3 bytes, respectively, exactly as I would expect. ASCII decoding is
>> also as I would expect. The case I don't understand is what happens
>> when the leading byte of the multibyte character was IS_LCPRV1 or
>> IS_LCPRV2. In that case, we ought to decode three bytes if it was
>> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we
>> always decode 4 bytes. That implies that the IS_LCPRV1() case in
>> pg_mule2wchar_with_len is dead code,
>
> Yes, dead code unless we want to support following encodings in the
> future(from include/mb/pg_wchar.h:
> #define LC_SISHENG 0xa0/* Chinese SiSheng characters for
> * PinYin/ZhuYin (not supported) */
> #define LC_IPA 0xa1/* IPA (International Phonetic Association)
> * (not supported) */
> #define LC_VISCII_LOWER 0xa2/* Vietnamese VISCII1.1 lower-case (not
> * supported) */
> #define LC_VISCII_UPPER 0xa3/* Vietnamese VISCII1.1 upper-case (not
> * supported) */
> #define LC_ARABIC_DIGIT 0xa4 /* Arabic digit (not supported) */
> #define LC_ARABIC_1_COLUMN 0xa5 /* Arabic 1-column (not supported) */
> #define LC_ASCII_RIGHT_TO_LEFT 0xa6 /* ASCII (left half of ISO8859-1) with
> * right-to-left direction (not
> * supported) */
> #define LC_LAO 0xa7/* Lao characters (ISO10646 0E80..0EDF) (not
> * supported) */
> #define LC_ARABIC_2_COLUMN 0xa8 /* Arabic 1-column (not supported) */
>
>> and that any 4 byte characters
>> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the
>> comment there is driving at, but it's not too clear to me.
>
> Yes, that's because we only support EUC_TW and BIG5 which are using
> IS_LCPRV2 in the mule interal encoding, as stated in the comment.

OK. So, in that case, I suggest that if the leading byte is non-zero,
we emit 0x9d followed by the three available bytes, instead of first
testing whether the first byte is >= 0xf0. That test seems to serve
no purpose but to confuse the issue.

I further suggest that we improve the comments on the mule functions
for both wchar->mb and mb->wchar to make all this more clear.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	robertmhaas(at)gmail(dot)com
Cc:	ishii(at)postgresql(dot)org, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 06:17:47
Message-ID:	20120703.151747.1330940307954703732.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> OK. So, in that case, I suggest that if the leading byte is non-zero,
> we emit 0x9d followed by the three available bytes, instead of first
> testing whether the first byte is >= 0xf0. That test seems to serve
> no purpose but to confuse the issue.

Probably the code shoud look like this(see below comment):

else if (lb >= 0xf0 && lb <= 0xfe)
{
if (lb <= 0xf4)
*to++ = 0x9c;
else
*to++ = 0x9d;
*to++ = lb;
*to++ = (*from >> 8) & 0xff;
*to++ = *from & 0xff;
cnt += 4;

> I further suggest that we improve the comments on the mule functions
> for both wchar->mb and mb->wchar to make all this more clear.

I have added comments about mule internal encoding by refreshing my
memory and from old document found on
web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

Please take a look at. BTW, it seems conversion between multibyte and
wchar can be roundtrip in the leading character is LCPRV2 case:

If the second byte of wchar (out of 4 bytes of wchar. The first byte
is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of
multibyte must be 0x9c. If the second byte of wchar is in range of
0xf5 to 0xfe, then the first byte of multibyte must be 0x9d.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Attachment	Content-Type	Size
pg_wchar.h.patch	text/x-patch	1.7 KB

From:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 21:41:11
Message-ID:	CAPpHfdssF4epQsghxDyyw_=8=tscHaXCU-wx14EoQwTsipvrEw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 3, 2012 at 10:17 AM, Tatsuo Ishii <ishii(at)postgresql(dot)org> wrote:

> > OK. So, in that case, I suggest that if the leading byte is non-zero,
> > we emit 0x9d followed by the three available bytes, instead of first
> > testing whether the first byte is >= 0xf0. That test seems to serve
> > no purpose but to confuse the issue.
>
> Probably the code shoud look like this(see below comment):
>
> else if (lb >= 0xf0 && lb <= 0xfe)
> {
> if (lb <= 0xf4)
> *to++ = 0x9c;
> else
> *to++ = 0x9d;
> *to++ = lb;
> *to++ = (*from >> 8) & 0xff;
> *to++ = *from & 0xff;
> cnt += 4;

It's likely we also need to assign some names to all these numbers
(0xf0, 0xf4, 0xfe, 0x9c, 0x9d). But it's hard for me to invent such names.

> > I further suggest that we improve the comments on the mule functions
> > for both wchar->mb and mb->wchar to make all this more clear.
>
> I have added comments about mule internal encoding by refreshing my
> memory and from old document found on
> web(
> http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string
> ).
>
> Please take a look at. BTW, it seems conversion between multibyte and
> wchar can be roundtrip in the leading character is LCPRV2 case:
>
> If the second byte of wchar (out of 4 bytes of wchar. The first byte
> is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of
> multibyte must be 0x9c. If the second byte of wchar is in range of
> 0xf5 to 0xfe, then the first byte of multibyte must be 0x9d.

Should I intergrate these code changes into my patch? Or we would like to
commit them first?

------
With best regards,
Alexander Korotkov.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 21:49:46
Message-ID:	15520.1341352186@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alexander Korotkov <aekorotkov(at)gmail(dot)com> writes:
> It's likely we also need to assign some names to all these numbers
> (0xf0, 0xf4, 0xfe, 0x9c, 0x9d). But it's hard for me to invent such names.

The encoding ID byte values already have names (see pg_wchar.h), but the
private prefix bytes don't. I griped about that upthread. I agree this
code needs some basic readability cleanup that's independent of your
feature addition. It'd likely be reasonable to do that as a separate
patch.

regards, tom lane

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 22:05:14
Message-ID:	20120704.070514.1645722301115662368.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I have added comments about mule internal encoding by refreshing my
> memory and from old document found on
> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

Any objection to apply my patch?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-03 22:37:03
Message-ID:	16449.1341355023@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>> I have added comments about mule internal encoding by refreshing my
>> memory and from old document found on
>> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

> Any objection to apply my patch?

It needs a bit of copy-editing, and I think we need to do more than just
add comments: the various byte values should have #defines so that you
can grep for code usages. I'll see what I can do with it.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-04 04:43:26
Message-ID:	6875.1341377006@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>>> I have added comments about mule internal encoding by refreshing my
>>> memory and from old document found on
>>> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).

>> Any objection to apply my patch?

> It needs a bit of copy-editing, and I think we need to do more than just
> add comments: the various byte values should have #defines so that you
> can grep for code usages. I'll see what I can do with it.

I cleaned up the comments in pg_wchar.h some more, added #define
symbols for the LCPRVn marker codes, and committed it.

So far as I can see, the only LCPRVn marker code that is actually in
use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
that I can find.

I also read in the xemacs internals doc, at
http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
that XEmacs thinks the marker code for private single-byte charsets
is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
moreover they think 0x9a-0x9d are potential future official multibyte
charset codes. I don't know how we got to the current state of using
0x9a-0x9d as private charset markers, but it seems pretty inconsistent
with XEmacs.

Since only 0x9d could possibly be on-disk anywhere at the moment (unless
I'm missing something), I think we would be well advised to redefine our
marker codes thus:

LCPRV1 0x9e (only) (matches XEmacs spec)
LCPRV2 0x9d (only) (doesn't match XEmacs, but too late to change)

This would simplify and speed up the IS_LCPRVn macros, simplify the
conversions that Alexander is worried about, and get us out from under
the risk that XEmacs will assign their next three official multibyte
encoding IDs. We're still in trouble if they ever get to 0x9d, but
since that's the last code they have, I bet they will be in no hurry
to use it up.

regards, tom lane

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	ishii(at)postgresql(dot)org, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-04 05:03:28
Message-ID:	20120704.140328.1690321442965749275.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> So far as I can see, the only LCPRVn marker code that is actually in
> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
> that I can find.
>
> I also read in the xemacs internals doc, at
> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
> that XEmacs thinks the marker code for private single-byte charsets
> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
> moreover they think 0x9a-0x9d are potential future official multibyte
> charset codes. I don't know how we got to the current state of using
> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
> with XEmacs.

At the time when mule internal code was introduced to PostgreSQL,
xemacs did not have multi encoding capabilty and mule (a patch to
emacs) was the only implementation allowed to use multi encoding. So I
used the specification of mule documented in the URL I wrote.

> Since only 0x9d could possibly be on-disk anywhere at the moment (unless
> I'm missing something), I think we would be well advised to redefine our
> marker codes thus:
>
> LCPRV1 0x9e (only) (matches XEmacs spec)
> LCPRV2 0x9d (only) (doesn't match XEmacs, but too late to change)
>
> This would simplify and speed up the IS_LCPRVn macros, simplify the
> conversions that Alexander is worried about, and get us out from under
> the risk that XEmacs will assign their next three official multibyte
> encoding IDs. We're still in trouble if they ever get to 0x9d, but
> since that's the last code they have, I bet they will be in no hurry
> to use it up.
>
> regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc:	Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-04 21:13:29
Message-ID:	CA+TgmoawNLPP1HPHnd9VRqYF9F+ss2sFEvR1FQoqLzXjFUE-kQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
> [ new patch ]

With the improved comments in pg_wchar.h, it seemed clear what needed
to be done here, so I fixed up the MULE conversion and committed this.
I'd appreciate it if someone would check my work, but I think it's
right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	robertmhaas(at)gmail(dot)com
Cc:	aekorotkov(at)gmail(dot)com, ishii(at)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-05 01:04:53
Message-ID:	20120705.100453.1445303370455181867.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> [ new patch ]
>
> With the improved comments in pg_wchar.h, it seemed clear what needed
> to be done here, so I fixed up the MULE conversion and committed this.
> I'd appreciate it if someone would check my work, but I think it's
> right.

For me your commit looks good.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-05 23:11:59
Message-ID:	3773.1341529919@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>> [ new patch ]

> With the improved comments in pg_wchar.h, it seemed clear what needed
> to be done here, so I fixed up the MULE conversion and committed this.
> I'd appreciate it if someone would check my work, but I think it's
> right.

Hm, several of these routines seem to neglect to advance the "from"
pointer?

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-05 23:15:38
Message-ID:	3895.1341530138@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>> So far as I can see, the only LCPRVn marker code that is actually in
>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>> that I can find.
>>
>> I also read in the xemacs internals doc, at
>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>> that XEmacs thinks the marker code for private single-byte charsets
>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>> moreover they think 0x9a-0x9d are potential future official multibyte
>> charset codes. I don't know how we got to the current state of using
>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>> with XEmacs.

> At the time when mule internal code was introduced to PostgreSQL,
> xemacs did not have multi encoding capabilty and mule (a patch to
> emacs) was the only implementation allowed to use multi encoding. So I
> used the specification of mule documented in the URL I wrote.

I see. Given that upstream has decided that a simpler definition is
more appropriate, is there any reason not to follow their lead, to the
extent that we can do so without breaking existing on-disk data?

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-06 00:43:24
Message-ID:	CA+TgmoaDtJbhMgNKB8rLGP147g0DVYyEJw6ajpCsAsW3TR14bQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov(at)gmail(dot)com> wrote:
>>> [ new patch ]
>
>> With the improved comments in pg_wchar.h, it seemed clear what needed
>> to be done here, so I fixed up the MULE conversion and committed this.
>> I'd appreciate it if someone would check my work, but I think it's
>> right.
>
> Hm, several of these routines seem to neglect to advance the "from"
> pointer?

Err... yeah. That's not a bug I introduced, but I should have caught
it... and it does make me wonder how well this code was tested.

Does the attached look like an appropriate fix?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
wchar-advance-from.patch	application/octet-stream	720 bytes

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-06 00:46:35
Message-ID:	6759.1341535595@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Hm, several of these routines seem to neglect to advance the "from"
>> pointer?

> Err... yeah. That's not a bug I introduced, but I should have caught
> it... and it does make me wonder how well this code was tested.

> Does the attached look like an appropriate fix?

I'd be inclined to put the from++ and len-- at the bottom of each loop,
and in that order every time, just for consistency and obviousness.
But yeah, that's basically what's needed.

regards, tom lane

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-06 00:52:02
Message-ID:	20120706.095202.1949717139594072856.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>>> So far as I can see, the only LCPRVn marker code that is actually in
>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>> that I can find.
>>>
>>> I also read in the xemacs internals doc, at
>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>> that XEmacs thinks the marker code for private single-byte charsets
>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>> charset codes. I don't know how we got to the current state of using
>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>> with XEmacs.
>
>> At the time when mule internal code was introduced to PostgreSQL,
>> xemacs did not have multi encoding capabilty and mule (a patch to
>> emacs) was the only implementation allowed to use multi encoding. So I
>> used the specification of mule documented in the URL I wrote.
>
> I see. Given that upstream has decided that a simpler definition is
> more appropriate, is there any reason not to follow their lead, to the
> extent that we can do so without breaking existing on-disk data?

Please let me spend week end to understand the their latest spec.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Tatsuo Ishii <ishii(at)postgresql(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-06 03:49:14
Message-ID:	CA+TgmoagjtHvbhVUhTLAyVbLGkvJNAmeByP+pDJjF_oc5fxSDw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 5, 2012 at 8:46 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Hm, several of these routines seem to neglect to advance the "from"
>>> pointer?
>
>> Err... yeah. That's not a bug I introduced, but I should have caught
>> it... and it does make me wonder how well this code was tested.
>
>> Does the attached look like an appropriate fix?
>
> I'd be inclined to put the from++ and len-- at the bottom of each loop,
> and in that order every time, just for consistency and obviousness.
> But yeah, that's basically what's needed.

OK, I've committed a slightly tweaked version of that patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-08 02:10:57
Message-ID:	20120708.111057.2187928410302833000.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>> that I can find.
>>>>
>>>> I also read in the xemacs internals doc, at
>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>> charset codes. I don't know how we got to the current state of using
>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>> with XEmacs.
>>
>>> At the time when mule internal code was introduced to PostgreSQL,
>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>> emacs) was the only implementation allowed to use multi encoding. So I
>>> used the specification of mule documented in the URL I wrote.
>>
>> I see. Given that upstream has decided that a simpler definition is
>> more appropriate, is there any reason not to follow their lead, to the
>> extent that we can do so without breaking existing on-disk data?
>
> Please let me spend week end to understand the their latest spec.

This is an intermediate report on the internal multi-byte charset
implementation of emacen. I have read the link Tom showed. Also I made
a quick scan on xemacs-21.4.0 source code, especially
xemacs-21.4.0/src/mule-charset.h. It seems the web document is
essentially a copy of the comments in the file. Also I looked into
other place of xemacs code and I think I can conclude that xeamcs
21.4's multi-byte implementation is based on the doc on the web.

Next I looked into emacs 24.1 source code because I could not find any
doc regarding emacs's(not xemacs's) implementation of internal
multi-byte charset. I found followings in src/charset.h:

/* Leading-code followed by extended leading-code. DIMENSION/COLUMN */
#define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */
#define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */

And these are used like this:

/* Read one non-ASCII character from INSTREAM. The character is
encoded in `emacs-mule' and the first byte is already read in
C. */

static int
read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
{
:
:
else if (len == 3)
{
if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
|| buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
{
charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
code = buf[2] & 0x7F;
}

As far as I can tell, this is exactly the same way how PostgreSQL
handles single private character sets: they consist of 3 bytes, and
leading byte is either 0x9a or 0x9b. Other examples regarding single
byte/multi-byte private charsets can be seen in coding.c.

As far as I can tell, it seems emacs and xemacs employes different
implementations of multi-byte charaset regarding "private"
charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am
contacting to the original Mule author if he knows anything about
this.

BTW, while looking into emacs's source code, I found their charset
definitions are in lisp/international/mule-conf.el. According to the
file several new charsets has been added. Included is the patch to
follow their changes. This makes no changes to current behavior, since
the patch just changes some comments and non supported charsets.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Attachment	Content-Type	Size
pg_wchar.h.patch	text/x-patch	2.1 KB

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-09 04:15:46
Message-ID:	20120709.131546.2272132227508407100.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>>> that I can find.
>>>>>
>>>>> I also read in the xemacs internals doc, at
>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>>> charset codes. I don't know how we got to the current state of using
>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>>> with XEmacs.
>>>
>>>> At the time when mule internal code was introduced to PostgreSQL,
>>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>>> emacs) was the only implementation allowed to use multi encoding. So I
>>>> used the specification of mule documented in the URL I wrote.
>>>
>>> I see. Given that upstream has decided that a simpler definition is
>>> more appropriate, is there any reason not to follow their lead, to the
>>> extent that we can do so without breaking existing on-disk data?
>>
>> Please let me spend week end to understand the their latest spec.
>
> This is an intermediate report on the internal multi-byte charset
> implementation of emacen. I have read the link Tom showed. Also I made
> a quick scan on xemacs-21.4.0 source code, especially
> xemacs-21.4.0/src/mule-charset.h. It seems the web document is
> essentially a copy of the comments in the file. Also I looked into
> other place of xemacs code and I think I can conclude that xeamcs
> 21.4's multi-byte implementation is based on the doc on the web.
>
> Next I looked into emacs 24.1 source code because I could not find any
> doc regarding emacs's(not xemacs's) implementation of internal
> multi-byte charset. I found followings in src/charset.h:
>
> /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */
> #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */
>
> And these are used like this:
>
> /* Read one non-ASCII character from INSTREAM. The character is
> encoded in `emacs-mule' and the first byte is already read in
> C. */
>
> static int
> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
> {
> :
> :
> else if (len == 3)
> {
> if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
> || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
> {
> charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
> code = buf[2] & 0x7F;
> }
>
> As far as I can tell, this is exactly the same way how PostgreSQL
> handles single private character sets: they consist of 3 bytes, and
> leading byte is either 0x9a or 0x9b. Other examples regarding single
> byte/multi-byte private charsets can be seen in coding.c.
>
> As far as I can tell, it seems emacs and xemacs employes different
> implementations of multi-byte charaset regarding "private"
> charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am
> contacting to the original Mule author if he knows anything about
> this.

I got reply from the Mule author, Kenichi Handa (the mail is in
Japanese. So I do not quote his mail here. If somebody wants to read
the original mail please let me know). First of all my understanding
with emacs's implementaion is correct according to him. He did not
know about xemacs's implementation. Apparently the implementation of
xemacs was not lead by the original mule author.

So which one of emacs/xemacs should we follow? My suggestion is, not
to follow xemacs, and to leave the current treatment of private
leading byte as it is because emacs seems to be more "right" upstream
comparing with xemacs.

> BTW, while looking into emacs's source code, I found their charset
> definitions are in lisp/international/mule-conf.el. According to the
> file several new charsets has been added. Included is the patch to
> follow their changes. This makes no changes to current behavior, since
> the patch just changes some comments and non supported charsets.

If there's no objection, I would like to commit this. Objection?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	ishii(at)postgresql(dot)org
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-10 23:23:26
Message-ID:	20120711.082326.1199398009192084540.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>>> Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
>>>>>> So far as I can see, the only LCPRVn marker code that is actually in
>>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c
>>>>>> that I can find.
>>>>>>
>>>>>> I also read in the xemacs internals doc, at
>>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145
>>>>>> that XEmacs thinks the marker code for private single-byte charsets
>>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only);
>>>>>> moreover they think 0x9a-0x9d are potential future official multibyte
>>>>>> charset codes. I don't know how we got to the current state of using
>>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent
>>>>>> with XEmacs.
>>>>
>>>>> At the time when mule internal code was introduced to PostgreSQL,
>>>>> xemacs did not have multi encoding capabilty and mule (a patch to
>>>>> emacs) was the only implementation allowed to use multi encoding. So I
>>>>> used the specification of mule documented in the URL I wrote.
>>>>
>>>> I see. Given that upstream has decided that a simpler definition is
>>>> more appropriate, is there any reason not to follow their lead, to the
>>>> extent that we can do so without breaking existing on-disk data?
>>>
>>> Please let me spend week end to understand the their latest spec.
>>
>> This is an intermediate report on the internal multi-byte charset
>> implementation of emacen. I have read the link Tom showed. Also I made
>> a quick scan on xemacs-21.4.0 source code, especially
>> xemacs-21.4.0/src/mule-charset.h. It seems the web document is
>> essentially a copy of the comments in the file. Also I looked into
>> other place of xemacs code and I think I can conclude that xeamcs
>> 21.4's multi-byte implementation is based on the doc on the web.
>>
>> Next I looked into emacs 24.1 source code because I could not find any
>> doc regarding emacs's(not xemacs's) implementation of internal
>> multi-byte charset. I found followings in src/charset.h:
>>
>> /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */
>> #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */
>>
>> And these are used like this:
>>
>> /* Read one non-ASCII character from INSTREAM. The character is
>> encoded in `emacs-mule' and the first byte is already read in
>> C. */
>>
>> static int
>> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun)
>> {
>> :
>> :
>> else if (len == 3)
>> {
>> if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11
>> || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12)
>> {
>> charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]);
>> code = buf[2] & 0x7F;
>> }
>>
>> As far as I can tell, this is exactly the same way how PostgreSQL
>> handles single private character sets: they consist of 3 bytes, and
>> leading byte is either 0x9a or 0x9b. Other examples regarding single
>> byte/multi-byte private charsets can be seen in coding.c.
>>
>> As far as I can tell, it seems emacs and xemacs employes different
>> implementations of multi-byte charaset regarding "private"
>> charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am
>> contacting to the original Mule author if he knows anything about
>> this.
>
> I got reply from the Mule author, Kenichi Handa (the mail is in
> Japanese. So I do not quote his mail here. If somebody wants to read
> the original mail please let me know). First of all my understanding
> with emacs's implementaion is correct according to him. He did not
> know about xemacs's implementation. Apparently the implementation of
> xemacs was not lead by the original mule author.
>
> So which one of emacs/xemacs should we follow? My suggestion is, not
> to follow xemacs, and to leave the current treatment of private
> leading byte as it is because emacs seems to be more "right" upstream
> comparing with xemacs.
>
>> BTW, while looking into emacs's source code, I found their charset
>> definitions are in lisp/international/mule-conf.el. According to the
>> file several new charsets has been added. Included is the patch to
>> follow their changes. This makes no changes to current behavior, since
>> the patch just changes some comments and non supported charsets.
>
> If there's no objection, I would like to commit this. Objection?

Done along with comment that we follow emacs's implementation, not
xemacs's.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
Cc:	robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-11 05:07:11
Message-ID:	15039.1341983231@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tatsuo Ishii <ishii(at)postgresql(dot)org> writes:
> Done along with comment that we follow emacs's implementation, not
> xemacs's.

Well, when the preceding comment block contains five references to
xemacs and the link for more information leads to www.xemacs.org,
I don't think it's real helpful to add one sentence saying "oh
by the way we're not actually following xemacs".

I continue to think that we'd be better off to follow the xemacs
spec, as the subdivisions the emacs spec is insisting on seem like
entirely useless complication. The only possible reason for doing
it the emacs way is that it would provide room for twice as many
charset IDs ... but the present design for wchar conversion destroys
that advantage, because it requires the charset ID spaces to be
nonoverlapping anyhow. Moreover, it's not apparent to me that
charset standards are still proliferating, so I doubt that we need
any more ID space.

regards, tom lane

From:	Tatsuo Ishii <ishii(at)postgresql(dot)org>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	ishii(at)postgresql(dot)org, robertmhaas(at)gmail(dot)com, aekorotkov(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Patch: add conversion from pg_wchar to multibyte
Date:	2012-07-11 05:23:26
Message-ID:	20120711.142326.875632511077408958.t-ishii@sraoss.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Well, when the preceding comment block contains five references to
> xemacs and the link for more information leads to www.xemacs.org,
> I don't think it's real helpful to add one sentence saying "oh
> by the way we're not actually following xemacs".
>
> I continue to think that we'd be better off to follow the xemacs
> spec, as the subdivisions the emacs spec is insisting on seem like
> entirely useless complication. The only possible reason for doing
> it the emacs way is that it would provide room for twice as many
> charset IDs ... but the present design for wchar conversion destroys
> that advantage, because it requires the charset ID spaces to be
> nonoverlapping anyhow. Moreover, it's not apparent to me that
> charset standards are still proliferating, so I doubt that we need
> any more ID space.

Well, we have been following emacs spec, not xemacs spec from the day
0. I don't see any value to switch to xemacs way at this moment,
because I think the reason why we support particular encoding is, to
keep on supporting existing user data, not "enhance" our internal
architecture.

If you like xeamcs's spec, I think you'd better add new encoding,
rather than break data compatibility.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp