Re: git: uh-oh

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Max Bowsher <maxb(at)f2s(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Haggerty <mhagger(at)alum(dot)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: git: uh-oh
Date: 2010-08-25 11:15:53
Message-ID: AANLkTim+gfwcLXKZ5JkP-5sJVFkfWdVhePhC1=Fd5OK8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Aug 25, 2010 at 13:03, Max Bowsher <maxb(at)f2s(dot)com> wrote:
> On 25/08/10 09:18, Magnus Hagander wrote:
>> On Wed, Aug 25, 2010 at 07:11, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>
>>>> 2. Any non-ASCII characters in, for example, contributor's names show
>>>> up differently in the two repos.  Generally, the original repo is OK
>>>> and the new repo is garbled; although I found one very old example
>>>> that went the other way.
>>>
>>> What it looks like to me is that a Latin1->UTF8 conversion has been
>>> applied to the log text.  Which might be a good idea if it all *was*
>>> Latin1, but a fair-sized percentage isn't.  Applying this conversion to
>>> UTF8 entries results in garbage, of course.  Even if this could be done
>>> reliably, I think this counts as editorializing on the historical
>>> record, and should be switched off if possible.
>>
>> I think the problem is that we have a mix of them :( git requires it to be utf8.
>>
>> cvs2git is configured to try, in order, latin1, utf8 and ascii, and
>> use whichever first returns correct result. In this case it seems it
>> does return saying things are right, because the result is valid utf8
>> - just not the utf8 we expected.
>>
>> I can give it a try the other way around - trying utf8 *before*
>> latin1, to see if that makes it better - utf8 tends to be more strict.
>
> *Every* byte sequence is valid latin1, therefore if you try latin1,
> utf8, ascii in that order, latin1 will always be used.
>
> You most likely want utf8, latin1 (no point also including ascii since
> it's a strict subset of latin1).

Yup. I re-ran it with utf8, latin1, ascii and that commit looks better now.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2010-08-25 11:19:16 Re: git: uh-oh
Previous Message Max Bowsher 2010-08-25 11:03:58 Re: git: uh-oh