Re: patch: utf8_to_unicode (trivial)

From: Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: patch: utf8_to_unicode (trivial)
Date: 2010-08-13 07:12:44
Message-ID: AANLkTin2x3OaKFZXNpMR+Z3WBDA_3d5QNp_dRYF4JzOJ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 27, 2010 at 1:31 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Jul 24, 2010 at 10:34 PM, Joseph Adams
> <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> In src/include/mb/pg_wchar.h , there is a function unicode_to_utf8 ,
>> but no corresponding utf8_to_unicode .  However, there is a static
>> function called utf2ucs that does what utf8_to_unicode would do.
>>
>> I'd like this function to be available because the JSON code needs to
>> convert UTF-8 to and from Unicode codepoints, and I'm currently using
>> a separate UTF-8 to codepoint function for that.
>>
>> This patch renames utf2ucs to utf8_to_unicode and makes it public.  It
>> also fixes the version of utf2ucs in  src/bin/psql/mbprint.c so that
>> it's equivalent to the one in wchar.c .
>>
>> This is a patch against CVS HEAD for application.  It compiles and
>> tests successfully.
>>
>> Comments?  Thanks,
>
> I feel obliged to respond this since I'm supposed to be covering your
> GSoC project while Magnus is on vacation, but I actually know very
> little about this topic.  What's undeniable, however, is that the
> coding in the two versions of utf8ucs() in the tree right now don't
> match.  src/backend/utils/mb/wchar.c has:
>
>        else if ((*c & 0xf8) == 0xf0)
>
> while src/bin/psql/mbprint.c, which is otherwise identical, has:
>
>        else if ((*c & 0xf0) == 0xf0)
>
> I'm inclined to believe that your patch is right to think that the
> former version is correct, because it used to match the latter version
> until Tom Lane changed it in 2007, and I suspect he simply failed to
> update both copies.  But I'd like someone who actually understands
> what this code is doing to confirm that.
>
> http://archives.postgresql.org/pgsql-committers/2007-01/msg00293.php
>
> I suspect we need to not only fix this, but back-patch it at least to
> 8.2, which is the first release where there are two copies of this
> function.  I am not sure whether earlier releases need to be changed,
> or not.  But again, someone who understands the issues better than I
> do needs to weigh in here.
>
> In terms of making this function non-static, I'm inclined to think
> that a better approach would be to move it to src/port.  That gets rid
> of the need to have two copies in the first place.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

I've attached another patch that moves utf8_to_unicode to src/port per
Robert Haas's suggestion.

This patch itself is not quite as elegant as the first one because it
puts platform-independent code that "belongs" in wchar.c into src/port
. It also uses unsigned int instead of pg_wchar because the typedef
of pg_wchar isn't available to the frontend, if I'm not mistaken.

I'm not sure whether I like the old patch better or the new one.

Joey Adams

Attachment Content-Type Size
utf8-to-unicode-port.patch application/octet-stream 5.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Boxuan Zhai 2010-08-13 08:25:47 Re: MERGE command for inheritance
Previous Message Heikki Linnakangas 2010-08-13 06:33:22 Re: MERGE command for inheritance