Lossy character conversion to Latin-1

Lists: pgsql-general
From: John DeSoi <desoi(at)pgedit(dot)com>
To: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Lossy character conversion to Latin-1
Date: 2006-05-30 15:19:42
Message-ID: 6E891C11-EC36-49B8-9533-A82412A814E1@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I have a client that only supports Latin-1 and needs to connect to a
UTF-8 database to retrieve some data. Some columns may contain
characters that have no Latin-1 equivalent. I would like to convert
these to a blank or perhaps some hex value. Is there any way to do
this in PostgreSQL without using anything other than built in
functions or pl/pgsql? It would be nice if the built in convert
function had an option to handle this rather than only generating an
error. Any pointers to an existing pl/pgsql function to perform this
conversion?

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


From: Nis Jorgensen <nis(at)superlativ(dot)dk>
To: John DeSoi <desoi(at)pgedit(dot)com>
Cc: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 09:17:48
Message-ID: 447D5F3C.7050801@superlativ.dk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

John DeSoi wrote:
> I have a client that only supports Latin-1 and needs to connect to a
> UTF-8 database to retrieve some data. Some columns may contain
> characters that have no Latin-1 equivalent. I would like to convert
> these to a blank or perhaps some hex value. Is there any way to do this
> in PostgreSQL without using anything other than built in functions or
> pl/pgsql? It would be nice if the built in convert function had an
> option to handle this rather than only generating an error. Any pointers
> to an existing pl/pgsql function to perform this conversion?

You should be able to do this with regular expressions (substituting all
invalid chars) on the text columns of the result set(s). I write "should
be", since I don't remember the capabilities of pg regular expressions.

/Nis


From: John DeSoi <desoi(at)pgedit(dot)com>
To: Nis Jorgensen <nis(at)superlativ(dot)dk>
Cc: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 12:51:32
Message-ID: 70102AF4-E70A-4CE1-AE35-CCFC0BF61F12@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On May 31, 2006, at 5:17 AM, Nis Jorgensen wrote:

> You should be able to do this with regular expressions
> (substituting all
> invalid chars) on the text columns of the result set(s). I write
> "should be", since I don't remember the capabilities of pg regular
> expressions.

I thought about this, but I don't see how to make it work to
transform every character. The substring function accepts POSIX
regular expressions, but the replace function does not. Maybe I can
put something together using substring and position.

Thanks,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: desoi(at)pgedit(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 13:26:42
Message-ID: 20060531.222642.06944432.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> I have a client that only supports Latin-1 and needs to connect to a
> UTF-8 database to retrieve some data. Some columns may contain
> characters that have no Latin-1 equivalent. I would like to convert
> these to a blank or perhaps some hex value. Is there any way to do
> this in PostgreSQL without using anything other than built in
> functions or pl/pgsql? It would be nice if the built in convert
> function had an option to handle this rather than only generating an
> error. Any pointers to an existing pl/pgsql function to perform this
> conversion?

It should be easy to write user defined funtion and define your own
CONVERSION. CREATE CONVERSION is your friend.
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: John DeSoi <desoi(at)pgedit(dot)com>
To: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 14:35:08
Message-ID: 0091B8A8-F6BB-4C48-84CC-3F7764FB2D7A@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On May 31, 2006, at 9:26 AM, Tatsuo Ishii wrote:

> It should be easy to write user defined funtion and define your own
> CONVERSION. CREATE CONVERSION is your friend.

It looks like CREATE CONVERSION requires a C function to do the
conversion. This will be used in a hosted environment -- I won't be
able to add any C language functions. Am I misunderstanding the
documentation?

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


From: John DeSoi <desoi(at)pgedit(dot)com>
To: Nis Jorgensen <nis(at)superlativ(dot)dk>
Cc: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 14:40:10
Message-ID: 19935E64-13C0-4F17-B2F9-580DC7DFCB82@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On May 31, 2006, at 9:17 AM, Nis Jorgensen wrote:

> regexp_replace seems to do what you need:
>
> http://www.postgresql.org/docs/8.1/static/functions-matching.html
>
> Something like
>
> regexp_replace (field, '[^\u0000-\u00FF]', '?', 'g')

Yes! Thanks very much -- I looked at that page several times and
missed regexp_replace.

Here is an example:

=== psql 3 ===
select regexp_replace('©«¡®£§¼½¾¿ ÀÉÌÓÙÝ àéìóùý
āŹźŻżŽž ∧∨ wxyz', '[^\\u0000-\\u00FF]', '?', 'g');
regexp_replace
------------------------------------------
©«¡®£§¼½¾¿ ÀÉÌÓÙÝ àéìóùý ??????? ?? wxyz
(1 row)

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL


From: John DeSoi <desoi(at)pgedit(dot)com>
To: Nis Jorgensen <nis(at)superlativ(dot)dk>
Cc: pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 15:24:40
Message-ID: 3386F7FB-7100-4314-9808-562C7F630636@pgedit.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


On May 31, 2006, at 10:40 AM, John DeSoi wrote:

> Yes! Thanks very much -- I looked at that page several times and
> missed regexp_replace.

Ok, now I know why I missed it. regexp_replace is only in PostgreSQL
8.1 and later. I'm stuck with 8.0 for hosting at the moment.

I'm sure it is not very efficient, but the plpgsql function below
does the same job for PostgreSQL versions prior to 8.1.

Thanks for the help,

John DeSoi, Ph.D.
http://pgedit.com/
Power Tools for PostgreSQL

create or replace function lossy_latin(p_text text)
returns text as $$
declare
t text;
res text := '';
ch text := substring(p_text from '[^\\u0000-\\u00FF]');
pos integer;
begin
if ch is null then
return p_text;
else
t := p_text;
loop
pos := strpos(t, ch);
res := res || substr(t, 1, pos - 1) || '?';
t := substr(t, pos + 1);
ch := substring(t from '[^\\u0000-\\u00FF]');
if ch is null then
res := res || t;
exit;
end if;
end loop;
return res;
end if;
end;
$$ language plpgsql immutable;


From: Tatsuo Ishii <ishii(at)sraoss(dot)co(dot)jp>
To: desoi(at)pgedit(dot)com
Cc: ishii(at)sraoss(dot)co(dot)jp, pgsql-general(at)postgresql(dot)org
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 22:17:25
Message-ID: 20060601.071725.74748760.t-ishii@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> > It should be easy to write user defined funtion and define your own
> > CONVERSION. CREATE CONVERSION is your friend.
>
> It looks like CREATE CONVERSION requires a C function to do the
> conversion. This will be used in a hosted environment -- I won't be
> able to add any C language functions. Am I misunderstanding the
> documentation?

You cannot write your own C function? Too bad. Maybe now is the time
to change the hosted environment...
--
Tatsuo Ishii
SRA OSS, Inc. Japan


From: "Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To: John DeSoi <desoi(at)pgedit(dot)com>
Cc: Nis Jorgensen <nis(at)superlativ(dot)dk>, pgsql general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lossy character conversion to Latin-1
Date: 2006-05-31 22:59:04
Message-ID: 447E1FB8.3060604@commandprompt.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

John DeSoi wrote:
>
> On May 31, 2006, at 10:40 AM, John DeSoi wrote:
>
>> Yes! Thanks very much -- I looked at that page several times and
>> missed regexp_replace.
>
>
> Ok, now I know why I missed it. regexp_replace is only in PostgreSQL 8.1
> and later. I'm stuck with 8.0 for hosting at the moment.
>
> I'm sure it is not very efficient, but the plpgsql function below does
> the same job for PostgreSQL versions prior to 8.1.
>

Command Prompt supports 8.1 for PostgreSQL if you need a new environment.

Sincerely,

Joshua D. Drake

> Thanks for the help,
>
>
> John DeSoi, Ph.D.
> http://pgedit.com/
> Power Tools for PostgreSQL
>
>
>
> create or replace function lossy_latin(p_text text)
> returns text as $$
> declare
> t text;
> res text := '';
> ch text := substring(p_text from '[^\\u0000-\\u00FF]');
> pos integer;
> begin
> if ch is null then
> return p_text;
> else
> t := p_text;
> loop
> pos := strpos(t, ch);
> res := res || substr(t, 1, pos - 1) || '?';
> t := substr(t, pos + 1);
> ch := substring(t from '[^\\u0000-\\u00FF]');
> if ch is null then
> res := res || t;
> exit;
> end if;
> end loop;
> return res;
> end if;
> end;
> $$ language plpgsql immutable;
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/