Quick Links

8.3 to 8.4 Upgrade issues

Lists:	pgsql-hackers

From:	Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	8.3 to 8.4 Upgrade issues
Date:	2010-08-10 17:21:45
Message-ID:	AANLkTikKZgWdEx3iAvc7y7A3Udph51rRM0KcMgxRSdkY@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

We recently upgraded from 8.3 to 8.4 and have seen a performance
degredation which we are trying to explain and I have been asked to
get a second opinion on the cost of going from LATIN1 to UTF8
(Collation and CType) where the encoding remained SQL_ASCII..

Does anybody have experience on the cost, if any, of making this change?

Pg 8.3:
Encoding: SQL_ASCII
LC_COLLATE: en_US
LC_CTYPE: en_US

Pg 8.4:
Encoding: SQL_ASCII
Collation: en_US.UTF-8
Ctype: en_US.UTF-8

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 to 8.4 Upgrade issues
Date:	2010-08-10 17:49:41
Message-ID:	3310.1281462581@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Rod Taylor <rod(dot)taylor(at)gmail(dot)com> writes:
> Does anybody have experience on the cost, if any, of making this change?

> Pg 8.3:
> Encoding: SQL_ASCII
> LC_COLLATE: en_US
> LC_CTYPE: en_US

> Pg 8.4:
> Encoding: SQL_ASCII
> Collation: en_US.UTF-8
> Ctype: en_US.UTF-8

Well, *both* of those settings collections are fundamentally
wrong/bogus; any collation/ctype setting other than "C" is unsafe if
you've got encoding set to SQL_ASCII. But without knowing what your
platform thinks "en_US" means, it's difficult to speculate about what
the difference between them is. I suppose that your libc's default
assumption about encoding is not UTF-8, else these would be equivalent.
If it had been assuming a single-byte encoding, then telling it UTF8
instead could lead to a significant slowdown in strcoll() speed ...
but I would think that would mainly be a problem if you had a lot of
non-ASCII data, and if you did, you'd be having a lot of problems other
than just performance. Have you noticed any change in sorting behavior?

regards, tom lane

From:	Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 to 8.4 Upgrade issues
Date:	2010-08-10 23:36:13
Message-ID:	AANLkTi=fAKad5XUcj_UYjnu0noCTQ4DNTB8kqfRvvgtF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Aug 10, 2010 at 13:49, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Rod Taylor <rod(dot)taylor(at)gmail(dot)com> writes:
>> Does anybody have experience on the cost, if any, of making this change?
>
>> Pg 8.3:
>> Encoding: SQL_ASCII
>> LC_COLLATE: en_US
>> LC_CTYPE: en_US
>
>> Pg 8.4:
>> Encoding: SQL_ASCII
>> Collation: en_US.UTF-8
>> Ctype: en_US.UTF-8
>
> Well, *both* of those settings collections are fundamentally
> wrong/bogus; any collation/ctype setting other than "C" is unsafe if
> you've got encoding set to SQL_ASCII. But without knowing what your
> platform thinks "en_US" means, it's difficult to speculate about what
> the difference between them is. I suppose that your libc's default
> assumption about encoding is not UTF-8, else these would be equivalent.
> If it had been assuming a single-byte encoding, then telling it UTF8
> instead could lead to a significant slowdown in strcoll() speed ...
> but I would think that would mainly be a problem if you had a lot of
> non-ASCII data, and if you did, you'd be having a lot of problems other
> than just performance. Have you noticed any change in sorting behavior?

Agreed with it being an interesting choice of settings. Nearly all of
the data is 7-bit ASCII and what isn't seems to be a mix of UTF8,
LATIN1, and LATIN15.

I'm pretty sure it interpreted en_US to be LATIN1. There haven't been
any noticeable changes in sorting order that I know of.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Rod Taylor <rod(dot)taylor(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: 8.3 to 8.4 Upgrade issues
Date:	2010-08-10 23:56:26
Message-ID:	21669.1281484586@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Rod Taylor <rod(dot)taylor(at)gmail(dot)com> writes:
> Agreed with it being an interesting choice of settings. Nearly all of
> the data is 7-bit ASCII and what isn't seems to be a mix of UTF8,
> LATIN1, and LATIN15.

> I'm pretty sure it interpreted en_US to be LATIN1. There haven't been
> any noticeable changes in sorting order that I know of.

Well, if you've got non-ASCII data that you know is not UTF8, then
setting a UTF8-dependent locale setting is a really really bad idea :-(.
You are risking not just bad performance but seriously bad misbehavior.
If you use a LATIN-n (or other single-byte-encoding) locale, the worst
that data in other encodings can do to you is sort into odd positions.
If you use a UTF8 locale and have data of other encodings, then
strcoll() can tell that you are violating the encoding spec, and on
many platforms it goes entirely berserk when you do that. glibc in
particular does not play nice with that. You didn't say what platform
this is, but if it's glibc based then you are sitting on a ticking time
bomb, and you had better dump and reinitialize in a safer locale setting
before your data gets eaten.

regards, tom lane