Multilingual application, ORDER BY w/ different locales?

Lists: pgsql-hackers
From: Palle Girgensohn <girgen(at)partitur(dot)se>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Multilingual application, ORDER BY w/ different locales?
Date: 2001-11-17 16:50:38
Message-ID: 68010000.1006015838@palle.girgensohn.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

We are using postgres as RDBMS in an web application that is translated to
about a dozen different languages. Some users get puzzled about the sorting
order in lists, since we have to choose only one locale for all ORDER BY
queries. I am dreaming of a SET LC_COLLATE or simliar command that will
only affect my session, not all other users.

I know this is not implemented in postgres. How impossible is it to add
this feature, and what implications would pg suffer? All discussions
regarding locale problems in postgres are about LIKE indexing. For us,
collating is more important. Can we help?

/Palle


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Palle Girgensohn <girgen(at)partitur(dot)se>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multilingual application, ORDER BY w/ different locales?
Date: 2001-11-17 18:39:36
Message-ID: 22114.1006022376@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Palle Girgensohn <girgen(at)partitur(dot)se> writes:
> I am dreaming of a SET LC_COLLATE or simliar command that will
> only affect my session, not all other users.
> I know this is not implemented in postgres. How impossible is it to add
> this feature, and what implications would pg suffer?

Actually, what the SQL spec suggests is that LOCALE be attached to
individual table columns. A SET command to cause LOCALE to change
on the fly within a session is quite impractical: that would mean
that the sort ordering of existing columns changes, which would mean
that any indexes on those columns are broken.

Per-column LOCALE is on the to-do list. In my mind the main difficulty
with it is that the standard C library doesn't really support concurrent
use of multiple locales: it's built around the assumption that you set
your locale once at program startup. setlocale() is, typically, not
a fast operation. To get around this it seems we'd need to write our
own set of locale library routines, which is a daunting amount of work.

I think the last time this came up, someone mentioned that there's an
open BSD-license locale library being worked on, which possibly we could
adapt instead of reinventing this wheel for ourselves. But I don't
recall more than that. Check the archives.

regards, tom lane


From: Palle Girgensohn <girgen(at)partitur(dot)se>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multilingual application, ORDER BY w/ different
Date: 2001-11-17 19:46:03
Message-ID: 11910000.1006026363@elbas.partitur.se
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On Saturday, November 17, 2001 13:39:36 -0500 Tom Lane
<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Palle Girgensohn <girgen(at)partitur(dot)se> writes:
>> I am dreaming of a SET LC_COLLATE or simliar command that will
>> only affect my session, not all other users.
>> I know this is not implemented in postgres. How impossible is it to add
>> this feature, and what implications would pg suffer?
>
> Actually, what the SQL spec suggests is that LOCALE be attached to
> individual table columns. A SET command to cause LOCALE to change
> on the fly within a session is quite impractical: that would mean
> that the sort ordering of existing columns changes, which would mean
> that any indexes on those columns are broken.

OK, indexes and sort ordering are coupled, and must be? In that case, I see
the problem.

> Per-column LOCALE is on the to-do list.

My need is really to get different sorting on *the same* column, depending
on which locale the present user prefers. Collation can be quite different
in Swedish, English, German or Frencn, for example. Our users can chose the
language they prefer from a list, and since it is a web app, all languages
are used simultaneously on the same system, and since we use a database
session pool, different langs can be preferred att different times in the
same database session. So, in this case there is no need for per-column
locale; we really need to be able to shift sorting order (ORDER BY only)
"on-the-fly". I guess this is not even supported by the SQL standard, or
any other RDBMS for that matter, right?

> In my mind the main difficulty
> with it is that the standard C library doesn't really support concurrent
> use of multiple locales: it's built around the assumption that you set
> your locale once at program startup. setlocale() is, typically, not
> a fast operation. To get around this it seems we'd need to write our
> own set of locale library routines, which is a daunting amount of work.
>
> I think the last time this came up, someone mentioned that there's an
> open BSD-license locale library being worked on, which possibly we could
> adapt instead of reinventing this wheel for ourselves. But I don't
> recall more than that. Check the archives.

Thanks, I will.

Cheers,
Palle


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Palle Girgensohn <girgen(at)partitur(dot)se>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multilingual application, ORDER BY w/ different locales?
Date: 2001-11-17 19:57:23
Message-ID: 723.1006027043@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Palle Girgensohn <girgen(at)partitur(dot)se> writes:
>> Actually, what the SQL spec suggests is that LOCALE be attached to
>> individual table columns. A SET command to cause LOCALE to change
>> on the fly within a session is quite impractical: that would mean
>> that the sort ordering of existing columns changes, which would mean
>> that any indexes on those columns are broken.

> OK, indexes and sort ordering are coupled, and must be?

Well, the sort ordering of any particular index has to be well-defined,
which means that there has to be a fixed locale associated with it.

> My need is really to get different sorting on *the same* column, depending
> on which locale the present user prefers.
> ... I guess this is not even supported by the SQL standard, or
> any other RDBMS for that matter, right?

I believe SQL regards the locale as essentially a property of a
datatype, which means that in theory you should be able to cast a column
value to type text-with-locale-X and then ORDER BY that. It'd be an
on-the-fly sort, not able to exploit any indexes, but it sounds like
that's acceptable to you.

Looking at the SQL92 spec, the name they actually give to this notion
is COLLATE, not locale, but it does look like you can label a string
expression with the collation type you want it to be sorted by.

regards, tom lane


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Palle Girgensohn <girgen(at)partitur(dot)se>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multilingual application, ORDER BY w/ different locales?
Date: 2001-11-17 21:13:15
Message-ID: 20011117131146.I31141-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sat, 17 Nov 2001, Tom Lane wrote:

> Palle Girgensohn <girgen(at)partitur(dot)se> writes:
> > My need is really to get different sorting on *the same* column, depending
> > on which locale the present user prefers.
> > ... I guess this is not even supported by the SQL standard, or
> > any other RDBMS for that matter, right?
>
> I believe SQL regards the locale as essentially a property of a
> datatype, which means that in theory you should be able to cast a column
> value to type text-with-locale-X and then ORDER BY that. It'd be an
> on-the-fly sort, not able to exploit any indexes, but it sounds like
> that's acceptable to you.

Would it be possible to make a function in plpgsql or whatever that
wrapped the collate changes and then order by that and make functional
indexes? Would the system use it?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
Cc: Palle Girgensohn <girgen(at)partitur(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multilingual application, ORDER BY w/ different locales?
Date: 2001-11-17 21:21:58
Message-ID: 979.1006032118@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> Would it be possible to make a function in plpgsql or whatever that
> wrapped the collate changes and then order by that and make functional
> indexes? Would the system use it?

IIRC, we were debating whether we should consider collation to be an
attribute of the datatype (think typmod) or an attribute of individual
values (think field added to values of textual types). In the former
case, a function like this would only work if we allowed its result to
be declared as having the right collate attribute. Which is not
impossible, but we don't currently associate any typmod with function
arguments or results, and so I'm not sure how painful it would be.
With the field-in-data-value approach it's easy to see how it would
work. But another byte or word per text value might be a high price
to pay ...

regards, tom lane


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Palle Girgensohn <girgen(at)partitur(dot)se>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Multilingual application, ORDER BY w/ different locales?
Date: 2001-11-18 01:04:29
Message-ID: 20011117162318.U32516-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sat, 17 Nov 2001, Tom Lane wrote:

> Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> > Would it be possible to make a function in plpgsql or whatever that
> > wrapped the collate changes and then order by that and make functional
> > indexes? Would the system use it?
>
> IIRC, we were debating whether we should consider collation to be an
> attribute of the datatype (think typmod) or an attribute of individual
> values (think field added to values of textual types). In the former
> case, a function like this would only work if we allowed its result to
> be declared as having the right collate attribute. Which is not
> impossible, but we don't currently associate any typmod with function
> arguments or results, and so I'm not sure how painful it would be.
> With the field-in-data-value approach it's easy to see how it would
> work. But another byte or word per text value might be a high price
> to pay ...

True. Although I wonder how things like substring would work in the
model with typmods if the collation isn't attached in any fashion to
the return values since I think the substring collation is supposed
to be the same as the input string's, whereas for something like
convert it's a different collation based on a parameter. I wonder if
as a temporary thing, you could use a function that did something
similar to strxfrm as long as you only used that for sorting purposes.


From: Hannu Krosing <hannu(at)sid(dot)tm(dot)ee>
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Palle Girgensohn <girgen(at)partitur(dot)se>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multilingual application, ORDER BY w/ different locales?
Date: 2001-11-18 06:56:46
Message-ID: 3BF75BAE.1070905@sid.tm.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Stephan Szabo wrote:

>On Sat, 17 Nov 2001, Tom Lane wrote:
>
>>Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
>>
>>>Would it be possible to make a function in plpgsql or whatever that
>>>wrapped the collate changes and then order by that and make functional
>>>indexes? Would the system use it?
>>>
>>IIRC, we were debating whether we should consider collation to be an
>>attribute of the datatype (think typmod) or an attribute of individual
>>values (think field added to values of textual types). In the former
>>case, a function like this would only work if we allowed its result to
>>be declared as having the right collate attribute. Which is not
>>impossible, but we don't currently associate any typmod with function
>>arguments or results, and so I'm not sure how painful it would be.
>>With the field-in-data-value approach it's easy to see how it would
>>work. But another byte or word per text value might be a high price
>>to pay ...
>>
>
>True. Although I wonder how things like substring would work in the
>model with typmods if the collation isn't attached in any fashion to
>the return values since I think the substring collation is supposed
>to be the same as the input string's, whereas for something like
>convert it's a different collation based on a parameter. I wonder if
>as a temporary thing, you could use a function that did something
>similar to strxfrm as long as you only used that for sorting purposes.
>
That would mean a new datatype that such function returns

CREATE FUNCTION text_with_collation(text,collation) RETURNS
text_with_collation

That would be sorted using the rules of that collation.

This can currently be added in contrib, but should eventually go into core.

The function itself is quite easy, but the collation is the part that
can either be done by
a) writing our own library

b) using system locale (i think that locale switching is slow in default
glibc , so the
following can be slow too
ORDER BY text_with_collation(t1,'et_EE'), text_with_collation(t1,'fr_CA')
but I doubt anybody uses it.

c) using a third party library - at least IBM has one which is almost as
big as whole postgreSQL ;)

assuming that one backend needs mostl one locale at a time, I think that
b) will be the easiest to
implement, but this will clash with current locale support if it is
compiled in so you have to be
rapidly swithcing LC_COLLATE between the default and that of the current
datum.

so what we actually need is a system that will _not_ use locale-aware
functions unless specifically
told to do so by feeding it with text_with_locale values.

---------------
Hannu

----------------
Hannu


From: Hannu Krosing <hannu(at)sid(dot)tm(dot)ee>
To: Palle Girgensohn <girgen(at)partitur(dot)se>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multilingual application, ORDER BY w/ different
Date: 2001-11-18 07:24:45
Message-ID: 3BF7623D.2030008@sid.tm.ee
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> --On Saturday, November 17, 2001 13:39:36 -0500 Tom Lane
> <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> In my mind the main difficulty
>> with it is that the standard C library doesn't really support concurrent
>> use of multiple locales: it's built around the assumption that you set
>> your locale once at program startup. setlocale() is, typically, not
>> a fast operation. To get around this it seems we'd need to write our
>> own set of locale library routines, which is a daunting amount of work.
>>
>> I think the last time this came up, someone mentioned that there's an
>> open BSD-license locale library being worked on, which possibly we could
>> adapt instead of reinventing this wheel for ourselves. But I don't
>> recall more than that. Check the archives.
>
I guess it must have been IBM's International Classes for Unicode at
http://oss.software.ibm.com/icu/

It is quite big:

Download

File Size Description
icu-1.8.1.zip
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1.zip> 7.3 MB
ZIP file for Windows platforms
icu-1.8.1.tgz
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1.tgz> 6.4 MB
gzipped tar archive for Unix and other platforms
icu-1.8.1-docs.zip
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1-docs.zip>
1.1 MB ZIP file with the API documentation
icu-1.8.1-docs.tgz
<http://oss.software.ibm.com/icu/download/1.8.1/icu-1.8.1-docs.tgz>
0.9 MB gzipped tar archive with the API documentation

but I suspect that it would otherways be the easiest way to get a good
internationalisation support.

---------------
Hannu


From: Tatsuo Ishii <t-ishii(at)sra(dot)co(dot)jp>
To: tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc: sszabo(at)megazone23(dot)bigpanda(dot)com, girgen(at)partitur(dot)se, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Multilingual application, ORDER BY w/ different
Date: 2001-11-18 14:17:48
Message-ID: 20011118231748L.t-ishii@sra.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> IIRC, we were debating whether we should consider collation to be an
> attribute of the datatype (think typmod) or an attribute of individual
> values (think field added to values of textual types). In the former
> case, a function like this would only work if we allowed its result to
> be declared as having the right collate attribute. Which is not
> impossible, but we don't currently associate any typmod with function
> arguments or results, and so I'm not sure how painful it would be.
> With the field-in-data-value approach it's easy to see how it would
> work. But another byte or word per text value might be a high price
> to pay ...

I think the price is not so high. To give the collation info to text
data types, it's enough to store the info in the
pg_attribute. ie. only additional several bytes per column are
required, not per instance. Of course we would need to add some extra
bytes to the in-memory string data, it's just a temporary data anyway.
--
Tatsuo Ishii