Lists: | pgsql-bugs |
---|
From: | "Sergey Burladyan" <eshkinkot(at)gmail(dot)com> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | BUG #4622: xpath only work in utf-8 server encoding |
Date: | 2009-01-22 13:39:00 |
Message-ID: | 200901221339.n0MDd0dE033542@wwwmaster.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
The following bug has been logged online:
Bug reference: 4622
Logged by: Sergey Burladyan
Email address: eshkinkot(at)gmail(dot)com
PostgreSQL version: 8.3.5
Operating system: Debian testing
Description: xpath only work in utf-8 server encoding
Details:
hello, all !
i am trying for test parse xml string in other than utf-8 encoding, it
correctly loaded but xpath(text, xml) can't handle it:
seb(at)seb:~/tmp/pg$ echo $LANG
ru_RU.CP1251
seb(at)seb:~/tmp/pg$ /usr/lib/postgresql/8.3/bin/postgres -p 5433 -k s -s -D .
LOG: система была отключена: 2009-01-22 16:30:07 MSK
LOG: autovacuum launcher started
LOG: database system is ready to accept connections
seb(at)seb:~$ echo $LANG
ru_RU.CP1251
seb(at)seb:~$ psql -h localhost -p 5433
Welcome to psql 8.3.5, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
\h for help with SQL commands
\? for help with psql commands
\g or terminate with semicolon to execute query
\q to quit
seb=# select * from (select
xml('<русский>язык</русский>')) as x(v);
v
-------------------------
<русский>язык</русский>
(1 запись)
seb=# select xpath('/русский/text()', v::xml) from (select
xml('<русский>язык</русский>')) as x(v);
ERROR: could not parse XML data
DETAIL: Entity: line 1: parser error : Input is not proper UTF-8, indicate
encoding !
Bytes: 0xF0 0xF3 0xF1 0xF1
<x><русский>язык</русский></x>
^
seb=# select name, setting from pg_settings where name like 'lc_%' or name
like '%enco%';
name | setting
-----------------+--------------
client_encoding | WIN1251
lc_collate | ru_RU.CP1251
lc_ctype | ru_RU.CP1251
lc_messages | ru_RU.CP1251
lc_monetary | ru_RU.CP1251
lc_numeric | ru_RU.CP1251
lc_time | ru_RU.CP1251
server_encoding | WIN1251
(8 rows)
in utf-8 server encoding it work correctly:
seb=> select xpath('/русский/text()', v::xml) from (select
xml('<русский>язык</русский>')) as x(v);
xpath
--------
{язык}
(1 запись)
seb=> select name, setting from pg_settings where name like 'lc_%' or name
like '%enco%';
name | setting
-----------------+-------------
client_encoding | UTF8
lc_collate | ru_RU.UTF-8
lc_ctype | ru_RU.UTF-8
lc_messages | ru_RU.UTF-8
lc_monetary | ru_RU.UTF-8
lc_numeric | ru_RU.UTF-8
lc_time | ru_RU.UTF-8
server_encoding | UTF8
(8 rows)
i am think something is wrong here, string parsed correctly by xml(text),
but it result can't pass to xpath function...
From: | Peter Eisentraut <peter_e(at)gmx(dot)net> |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Cc: | "Sergey Burladyan" <eshkinkot(at)gmail(dot)com> |
Subject: | Re: BUG #4622: xpath only work in utf-8 server encoding |
Date: | 2009-01-22 21:58:49 |
Message-ID: | 200901222358.50489.peter_e@gmx.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
On Thursday 22 January 2009 15:39:00 Sergey Burladyan wrote:
> seb=# select xpath('/русский/text()', v::xml) from (select
> xml('<русский>язык</русский>')) as x(v);
> ERROR: could not parse XML data
> DETAIL: Entity: line 1: parser error : Input is not proper UTF-8, indicate
> encoding !
> Bytes: 0xF0 0xF3 0xF1 0xF1
> <x><русский>язык</русский></x>
> ^
This raises the question: What are the rules about encoding the characters in
XPath expressions themselves? I haven't found anything about that in the
standard. Anyone know?
From: | eshkinkot <eshkinkot(at)gmail(dot)com> |
---|---|
To: | Peter Eisentraut <peter_e(at)gmx(dot)net> |
Cc: | pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #4622: xpath only work in utf-8 server encoding |
Date: | 2009-02-08 05:42:03 |
Message-ID: | 9ea8622b0902072142u76c86c30q8b433182e8cb0800@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-bugs |
23 января 2009 г. 0:58 пользователь Peter Eisentraut <peter_e(at)gmx(dot)net> написал:
> On Thursday 22 January 2009 15:39:00 Sergey Burladyan wrote:
>> seb=# select xpath('/русский/text()', v::xml) from (select
>> xml('<русский>язык</русский>')) as x(v);
>> ERROR: could not parse XML data
>> DETAIL: Entity: line 1: parser error : Input is not proper UTF-8, indicate
>> encoding !
>> Bytes: 0xF0 0xF3 0xF1 0xF1
>> <x><русский>язык</русский></x>
>> ^
> This raises the question: What are the rules about encoding the characters in
> XPath expressions themselves? I haven't found anything about that in the
> standard. Anyone know?
PostgreSQL does not use libxml2 internal encoding support and strip
xml encoding from xml body, so i think there is no choice, by default
for libxml2 it must be in it internal encoding utf-8 anyway.
i am not sure about xml standard but may be documentation of libxml2
can help to solve this issue ? see http://xmlsoft.org/encoding.html
"What does this mean in practice for the libxml2 user:
* xmlChar, the libxml2 data type is a byte, those bytes must be
assembled as UTF-8 valid strings. The proper way to terminate an
xmlChar * string is simply to append 0 byte, as usual.
* One just need to make sure that when using chars outside the ASCII
set, the values has been properly converted to UTF-8"
I understand this as: all xmlChar strings must be in utf-8 encoding,
no matter what is encoding of xml body
i try to fix this issue for xpath function, see patch in attachment
by the way, contrib/xml2 also have this issue...
Attachment | Content-Type | Size |
---|---|---|
fix-xpath-encoding.patch | text/x-diff | 5.4 KB |