Re: Filtering dictionaries support and unaccent dictionary

Lists: pgsql-hackers
From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Filtering dictionaries support and unaccent dictionary
Date: 2009-07-14 19:12:28
Message-ID: Pine.LNX.4.64.0907142308110.8065@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi there,

we'd like to introduce filtering dictionaries support for text search
and new contrib module unaccent, which provides useful example of
filtering dictionary. It finally solves the known problem of
incorrect generation of headlines of text with accents.

Also, this module provides unaccent() functions, which is a simple
wrapper on unaccent dictionary.

Regards,
Oleg

PS. I hope it's not late for July commitfest !

_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

Attachment Content-Type Size
unaccent.gz application/octet-stream 6.5 KB

From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-07-15 11:19:33
Message-ID: 20090715111933.GA4551@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Oleg Bartunov wrote:

Hi,

> we'd like to introduce filtering dictionaries support for text search
> and new contrib module unaccent, which provides useful example of
> filtering dictionary. It finally solves the known problem of incorrect
> generation of headlines of text with accents.

I'm curious about the pg_regress change ... is it really necessary?

AFAICS the changes to the core code are very small; I wonder if you
should commit it separately i.e. without the contrib module, and add the
that one in another commit.

As for the contrib module, I think it could use a lot more function
header comments! Also, it would be great if it could be used separately
from tsearch, i.e. that it provided a function unaccent(text) returns
text that unaccented arbitrary strings (I guess it would use the default
tsconfig).

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-07-28 15:01:30
Message-ID: 4A6F12CA.2020704@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> I'm curious about the pg_regress change ... is it really necessary?

To test unaccent dictionary it's needed to input accented characters, not all
encodings allow that. UTF8 allows that, but it doesn't compatible with a lot of
locales. So, --no-locale should be propagated to CREATE DATABASE command as it's
done for encoding.

> AFAICS the changes to the core code are very small; I wonder if you
> should commit it separately i.e. without the contrib module, and add the
> that one in another commit.
Split patch to two parts:
filter_dictionary-0.1.gz - core changes, including pg_regress changes
unaccent-0.5.gz - contrib module

Also, I added some comments into code and did cosmetic changes in docs.

> As for the contrib module, I think it could use a lot more function
> header comments! Also, it would be great if it could be used separately
> from tsearch, i.e. that it provided a function unaccent(text) returns
> text that unaccented arbitrary strings (I guess it would use the default
> tsconfig).
Umm? Module provides unaccent(text) and unaccent(regdictionary, text) functions.

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

Attachment Content-Type Size
unaccent-0.5.gz application/x-tar 5.9 KB
filter_dictionary-0.1.gz application/x-tar 1.0 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-07-29 12:38:56
Message-ID: 200907291538.56273.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 14 July 2009 22:12:28 Oleg Bartunov wrote:
> we'd like to introduce filtering dictionaries support for text search
> and new contrib module unaccent, which provides useful example of
> filtering dictionary. It finally solves the known problem of
> incorrect generation of headlines of text with accents.

What is the source of the unaccent rules, and how complete is the rule set?


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-07-29 13:57:13
Message-ID: Pine.LNX.4.64.0907291756270.8065@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 29 Jul 2009, Peter Eisentraut wrote:

> On Tuesday 14 July 2009 22:12:28 Oleg Bartunov wrote:
>> we'd like to introduce filtering dictionaries support for text search
>> and new contrib module unaccent, which provides useful example of
>> filtering dictionary. It finally solves the known problem of
>> incorrect generation of headlines of text with accents.
>
> What is the source of the unaccent rules, and how complete is the rule set?

unicode tables from unicode.org. It'be nice if someone check the completeness.

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-01 04:35:07
Message-ID: 20090801043507.GG11098@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Teodor Sigaev wrote:

> >As for the contrib module, I think it could use a lot more function
> >header comments! Also, it would be great if it could be used separately
> >from tsearch, i.e. that it provided a function unaccent(text) returns
> >text that unaccented arbitrary strings (I guess it would use the default
> >tsconfig).
> Umm? Module provides unaccent(text) and unaccent(regdictionary, text) functions.

Sorry, I failed to notice. Looks good.

Isn't that function leaking "res" pointer? Also, I'm curious why you're
allocating 2*sizeof(TSLexeme) in unaccent_lexize ...

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-04 23:04:13
Message-ID: 603c8f070908041604p47eb1ae3lbf7f7fc78d8dbcea@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Aug 1, 2009 at 12:35 AM, Alvaro
Herrera<alvherre(at)commandprompt(dot)com> wrote:
> Teodor Sigaev wrote:
>
>> >As for the contrib module, I think it could use a lot more function
>> >header comments!  Also, it would be great if it could be used separately
>> >from tsearch, i.e. that it provided a function unaccent(text) returns
>> >text that unaccented arbitrary strings (I guess it would use the default
>> >tsconfig).
>> Umm? Module provides unaccent(text) and unaccent(regdictionary, text) functions.
>
> Sorry, I failed to notice.  Looks good.
>
> Isn't that function leaking "res" pointer?  Also, I'm curious why you're
> allocating 2*sizeof(TSLexeme) in unaccent_lexize ...

So are we waiting for an updated version of this patch?

...Robert


From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-06 14:27:10
Message-ID: 4A7AE83E.3070708@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Isn't that function leaking "res" pointer? Also, I'm curious why you're
fixed

> allocating 2*sizeof(TSLexeme) in unaccent_lexize ...
That's is a dictionary's interface part: lexize returns an array of TSLexeme and
last structure should have lexeme field NULL.

filter_dictionary file is not changed, it's attached only for consistency.
--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

Attachment Content-Type Size
unaccent-0.6.gz application/x-tar 5.9 KB
filter_dictionary-0.1.gz application/x-tar 1.0 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Teodor Sigaev <teodor(at)sigaev(dot)ru>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-07 03:46:57
Message-ID: 603c8f070908062046g3211617t4974ebe43deeb94d@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2009/8/6 Teodor Sigaev <teodor(at)sigaev(dot)ru>:
>> Isn't that function leaking "res" pointer?  Also, I'm curious why you're
>
> fixed
>
>> allocating 2*sizeof(TSLexeme) in unaccent_lexize ...
>
> That's is a dictionary's interface part: lexize returns an array of TSLexeme
> and last structure should have lexeme field NULL.
>
>
> filter_dictionary file is not changed, it's attached only for consistency.

I am not sure whether this has been formally reviewed by anyone yet;
do we think it's "Ready for Committer"?

Thanks,

...Robert


From: Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-07 15:44:44
Message-ID: 3073cc9b0908070844l12156403ue046f770da65e512@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 6, 2009 at 10:46 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>
> I am not sure whether this has been formally reviewed by anyone yet;
> do we think it's "Ready for Committer"?
>

i was trying to make some review of this but besides that it compiles
fine and passes regression tests doesn't know how to test it

--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


From: Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-11 05:28:24
Message-ID: 3073cc9b0908102228g5178cf83r32bf1d9f56dab7c7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 7, 2009 at 10:44 AM, Jaime
Casanova<jcasanov(at)systemguards(dot)com(dot)ec> wrote:
> On Thu, Aug 6, 2009 at 10:46 PM, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>>
>> I am not sure whether this has been formally reviewed by anyone yet;
>> do we think it's "Ready for Committer"?
>>
>
> i was trying to make some review of this but besides that it compiles
> fine and passes regression tests doesn't know how to test it
>

try to build the docs to see how to properly test this and seems like
you have to teach contrib.sgml and bookindex.sgml about
dict-unaccent... and when i did that i got this:

"""
openjade -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c
/usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d
stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml
openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128
openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129
openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130
openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131
openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132
openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133
openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134
openjade:dict-unaccent.sgml:116:4:E: element "B" undefined
make: *** [HTML.index] Error 1
make: *** Se borra el archivo `HTML.index'
"""
--
Atentamente,
Jaime Casanova
Soporte y capacitación de PostgreSQL
Asesoría y desarrollo de sistemas
Guayaquil - Ecuador
Cel. +59387171157


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>, Robert Haas <robertmhaas(at)gmail(dot)com>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-11 08:31:30
Message-ID: 200908111131.30725.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tuesday 11 August 2009 08:28:24 Jaime Casanova wrote:
> try to build the docs to see how to properly test this and seems like
> you have to teach contrib.sgml and bookindex.sgml about
> dict-unaccent... and when i did that i got this:
>
> """
> openjade -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c
> /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d
> stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml
> openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128
> openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129
> openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130
> openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131
> openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132
> openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133
> openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134
> openjade:dict-unaccent.sgml:116:4:E: element "B" undefined
> make: *** [HTML.index] Error 1
> make: *** Se borra el archivo `HTML.index'
> """

You should escape the special characters as well as the <b> that appears as
part of the example output using character entitities (&amp; etc.).


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Jaime Casanova <jcasanov(at)systemguards(dot)com(dot)ec>, Teodor Sigaev <teodor(at)sigaev(dot)ru>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-13 14:08:08
Message-ID: 603c8f070908130708w6cc9574by107adf62e169d54f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 11, 2009 at 4:31 AM, Peter Eisentraut<peter_e(at)gmx(dot)net> wrote:
> On Tuesday 11 August 2009 08:28:24 Jaime Casanova wrote:
>> try to build the docs to see how to properly test this and seems like
>> you have to teach contrib.sgml and bookindex.sgml about
>> dict-unaccent... and when i did that i got this:
>>
>> """
>> openjade  -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c
>> /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d
>> stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml
>> openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128
>> openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129
>> openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130
>> openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131
>> openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132
>> openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133
>> openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134
>> openjade:dict-unaccent.sgml:116:4:E: element "B" undefined
>> make: *** [HTML.index] Error 1
>> make: *** Se borra el archivo `HTML.index'
>> """
>
> You should escape the special characters as well as the <b> that appears as
> part of the example output using character entitities (&amp; etc.).

Sounds like this patch needs a little bit of doc adjustment per the
above and is then ready for committer?

...Robert


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-13 14:57:13
Message-ID: Pine.LNX.4.64.0908131855430.26817@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter,

how to write accented characters in sgml ? Is't not allowed to write them
as is ?

Oleg
On Tue, 11 Aug 2009, Peter Eisentraut wrote:

> On Tuesday 11 August 2009 08:28:24 Jaime Casanova wrote:
>> try to build the docs to see how to properly test this and seems like
>> you have to teach contrib.sgml and bookindex.sgml about
>> dict-unaccent... and when i did that i got this:
>>
>> """
>> openjade -wall -wno-unused-param -wno-empty -wfully-tagged -D . -c
>> /usr/share/sgml/docbook/stylesheet/dsssl/modular/catalog -d
>> stylesheet.dsl -t sgml -i output-html -V html-index postgres.sgml
>> openjade:dict-unaccent.sgml:48:1:E: non SGML character number 128
>> openjade:dict-unaccent.sgml:49:1:E: non SGML character number 129
>> openjade:dict-unaccent.sgml:50:1:E: non SGML character number 130
>> openjade:dict-unaccent.sgml:51:1:E: non SGML character number 131
>> openjade:dict-unaccent.sgml:52:1:E: non SGML character number 132
>> openjade:dict-unaccent.sgml:53:1:E: non SGML character number 133
>> openjade:dict-unaccent.sgml:54:1:E: non SGML character number 134
>> openjade:dict-unaccent.sgml:116:4:E: element "B" undefined
>> make: *** [HTML.index] Error 1
>> make: *** Se borra el archivo `HTML.index'
>> """
>
> You should escape the special characters as well as the <b> that appears as
> part of the example output using character entitities (&amp; etc.).
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-13 15:07:51
Message-ID: 20090813150751.GH5909@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Oleg Bartunov wrote:
> Peter,
>
> how to write accented characters in sgml ? Is't not allowed to write
> them as is ?

&aacute; for á, etc. You can't use characters that aren't in Latin-1 I think.
Writing them literally is not allowed.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-13 18:30:45
Message-ID: 200908132130.48114.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thursday 13 August 2009 18:07:51 Alvaro Herrera wrote:
> Oleg Bartunov wrote:
> > Peter,
> >
> > how to write accented characters in sgml ? Is't not allowed to write
> > them as is ?
>
> &aacute; for á, etc. You can't use characters that aren't in Latin-1 I
> think. Writing them literally is not allowed.

It's somehow possible, but it's not as straightforward as say with XML. And
you might get into a Latin-1 vs UTF-8 mixup. At least that's what I noticed
in my limited testing the other day.


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-13 19:14:29
Message-ID: 200908131914.n7DJETG02802@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut wrote:
> On Thursday 13 August 2009 18:07:51 Alvaro Herrera wrote:
> > Oleg Bartunov wrote:
> > > Peter,
> > >
> > > how to write accented characters in sgml ? Is't not allowed to write
> > > them as is ?
> >
> > &aacute; for ?, etc. You can't use characters that aren't in Latin-1 I
> > think. Writing them literally is not allowed.
>
> It's somehow possible, but it's not as straightforward as say with XML. And
> you might get into a Latin-1 vs UTF-8 mixup. At least that's what I noticed
> in my limited testing the other day.

The top of release.sgml has instructions on that because that is often
something we need to do for names in release notes:

non-ASCII characters convert to HTML4 entity (&) escapes

official: http://www.w3.org/TR/html4/sgml/entities.html
one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html
http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references

we cannot use UTF8 because SGML Docbook
does not support it
http://www.pemberley.com/janeinfo/latin1.html#latexta

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Oleg Bartunov <oleg(at)sai(dot)msu(dot)su>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Subject: Re: Filtering dictionaries support and unaccent dictionary
Date: 2009-08-14 10:52:27
Message-ID: Pine.LNX.4.64.0908141452170.26817@sn.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks, Bruce !

Oleg
On Thu, 13 Aug 2009, Bruce Momjian wrote:

> Peter Eisentraut wrote:
>> On Thursday 13 August 2009 18:07:51 Alvaro Herrera wrote:
>>> Oleg Bartunov wrote:
>>>> Peter,
>>>>
>>>> how to write accented characters in sgml ? Is't not allowed to write
>>>> them as is ?
>>>
>>> &aacute; for ?, etc. You can't use characters that aren't in Latin-1 I
>>> think. Writing them literally is not allowed.
>>
>> It's somehow possible, but it's not as straightforward as say with XML. And
>> you might get into a Latin-1 vs UTF-8 mixup. At least that's what I noticed
>> in my limited testing the other day.
>
> The top of release.sgml has instructions on that because that is often
> something we need to do for names in release notes:
>
> non-ASCII characters convert to HTML4 entity (&) escapes
>
> official: http://www.w3.org/TR/html4/sgml/entities.html
> one page: http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
> other lists: http://www.zipcon.net/~swhite/docs/computers/browsers/entities.html
> http://www.zipcon.net/~swhite/docs/computers/browsers/entities_page.html
> http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
>
> we cannot use UTF8 because SGML Docbook
> does not support it
> http://www.pemberley.com/janeinfo/latin1.html#latexta
>
>

Regards,
Oleg
_____________________________________________________________
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg(at)sai(dot)msu(dot)su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83