Re: proposal: UTF8 to_ascii function

From: Jan Urbański <j(dot)urbanski(at)students(dot)mimuw(dot)edu(dot)pl>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>, PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: UTF8 to_ascii function
Date: 2008-08-11 14:13:15
Message-ID: 48A048FB.5030805@students.mimuw.edu.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Andrew Dunstan wrote:
>
>
> Jan Urbański wrote:
>> Andrew Dunstan wrote:
>>>
>>>
>>> Pavel Stehule wrote:
>>> What you have not said is how you propose to convert UTF8 to ASCII.
>>>
>>> Currently to_ascii() converts a small number of single byte charsets
>>> to ASCII by folding the chars with high bits set, so what we get is a
>>> pure ASCII result which is safe in any server encoding, as they are
>>> all ASCII supersets.
>>>
>>> But what conversion rule will you use for the gazillions of Unicode
>>> characters?
>>>
>>> I honestly do not understand the use case for this at all.
>>
>> I do. Often clients want their searches to be
>> accented-or-language-specific letters insensitive. So searching for
>> 'łódź' returns 'lodz'. So the use case is there (in fact, the lack of
>> such facility made me consider not upgrading particular client to
>> 8.3...).
>> Or maybe there's a better way to do it?
>
> Well, my first question would be "Why aren't you using a database
> encoding that supports to_ascii()?"

Because I want UTF-8 in it ;) It's mostly LATIN2, but clients sometimes
input Cyrillic, Greek or Hebrew letters, and sometimes use Unicode
characters like (U+2026) HORIZONTAL ELLIPSIS.

I'd like to have
to_ascii(text, [error_handling]) returns text

So no bytea, to_ascii would accept text that's legal in my current
database encoding and return text in that encoding. And error_handling
would be something like:
- 'error' (the default, throw an error if a character is untranslable to
ASCII)
- 'ignore' (omit untranslable characters)
- 'transliterate' (do your best to transliterate the character, or leave
it as it is if impossible).

Examples would include (assuming UTF-8 database)
to_ascii('łódź') -> 'lodz'
to_ascii('china is written 中國') -> ERROR
to_ascii('china is written 中國', 'ignore') -> 'china is written '
to_ascii('china is written 中國', 'transliterate') -> 'china is written
zhong guo' (in an ideal world)
to_ascii('china is written 中國', 'transliterate') -> 'china is written
中國' (in reality)\

These would have the property, that:
to_ascii(X, 'ignore') is always pure ASCII data and never throws an error
to_ascii(X, 'transliterate') is sometimes non-ASCII data and never
throws an error
to_ascii(X) is sometimes non-ASCII data and sometimes throws an error

It's something like PHP's iconv that can have //TRANSLIT or somesuch
(forgive me for giving PHP as an example...). Now I'd love to hear
people punch holes in my daydreaming design ;)

Cheers,
Jan

--
Jan Urbanski
GPG key ID: E583D7D2

ouden estin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Zdenek Kotala 2008-08-11 14:44:51 Re: Proposal: PageLayout footprint
Previous Message Heikki Linnakangas 2008-08-11 14:07:20 Re: Proposal: PageLayout footprint