Re: Remove diacritical marks in SQL

Lists: pgsql-general
From: Jiří Němec <konference(at)menea(dot)cz>
To: pgsql-general(at)postgresql(dot)org
Subject: Remove diacritical marks in SQL
Date: 2007-01-12 21:16:22
Message-ID: 16651036.20070112221622@menea.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Hello,

I would like to remove diacritical marks from a string in a SQL query.
I tried to convert a UTF8 string to ASCII but it doesn't work for me.

SELECT convert('ěščřžýáíé','UTF8','SQL_ASCII')

array(1) {
["convert"]=>
string(18) "ěščřžýáíé"
}

Thanks for any advice,

J.N.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Ji?í N?mec <konference(at)menea(dot)cz>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Remove diacritical marks in SQL
Date: 2007-01-12 21:58:36
Message-ID: 20070112215836.GA20949@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Jan 12, 2007 at 10:16:22PM +0100, Ji?í N?mec wrote:
> Hello,
>
> I would like to remove diacritical marks from a string in a SQL query.
> I tried to convert a UTF8 string to ASCII but it doesn't work for me.
>
> SELECT convert('?????ýáíé','UTF8','SQL_ASCII')

I don't think postgres has any stuff builtin for that, but other
languages (like perl) have modules to do this kind of thing. The method
is to decompose the string to normal form D, strip the diacritics, and
recompose what's left.

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Jiří Němec <konference(at)menea(dot)cz>, pgsql-general(at)postgresql(dot)org
Subject: Re: Remove diacritical marks in SQL
Date: 2007-01-14 16:27:21
Message-ID: 20070114162721.GA3943@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Fri, Jan 12, 2007 at 10:58:36PM +0100, Martijn van Oosterhout wrote:
> On Fri, Jan 12, 2007 at 10:16:22PM +0100, Jiří Němec wrote:
> > I would like to remove diacritical marks from a string in a SQL query.
> > I tried to convert a UTF8 string to ASCII but it doesn't work for me.
> >
> > SELECT convert('ěščřžýáíé','UTF8','SQL_ASCII')
>
> I don't think postgres has any stuff builtin for that, but other
> languages (like perl) have modules to do this kind of thing. The method
> is to decompose the string to normal form D, strip the diacritics, and
> recompose what's left.

A technique that's been posted before might work:

SELECT to_ascii(convert('ěščřžýáíé', 'LATIN2'), 'LATIN2');
to_ascii
-----------
escrzyaie
(1 row)

to_ascii() supports only LATIN1, LATIN2, LATIN9, and WIN1250 so you
have to convert to one of those encodings first.

As Martijn suggested, you could use Perl. Here's an example with
Text::Unaccent (you'll need to use encoding names that iconv
recognizes):

CREATE FUNCTION unaccent(charset text, string text) RETURNS text AS $$
use Text::Unaccent;
return unac_string($_[0], $_[1]);
$$ LANGUAGE plperlu IMMUTABLE STRICT;

SELECT unaccent('UTF-8', 'ěščřžýáíé');
unaccent
-----------
escrzyaie
(1 row)

Here's an example that uses Unicode::Normalize to strip non-spacing
and enclosing marks:

CREATE FUNCTION unaccent(string text) RETURNS text AS $$
use Unicode::Normalize;
my $nfd_string = NFD($_[0]);
$nfd_string =~ s/[\p{Mn}\p{Me}]//g;
return NFC($nfd_string);
$$ LANGUAGE plperlu IMMUTABLE STRICT;

SELECT unaccent('ěščřžýáíé');
unaccent
-----------
escrzyaie
(1 row)

--
Michael Fuhr