Quick Links

Re: pgsql: We're going to have to spell dotless i

Lists:	pgsql-committerspgsql-hackers

From:	tgl(at)postgresql(dot)org (Tom Lane)
To:	pgsql-committers(at)postgresql(dot)org
Subject:	pgsql: We're going to have to spell dotless i as plain i, because
Date:	2006-09-22 15:29:05
Message-ID:	20060922152905.0D1119FB3C6@postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Log Message:
-----------
We're going to have to spell dotless i as plain i, because dotless i is
not in the character set supported by DocBook nor standard HTML. (Sorry
Volkan.) Also replace random character-set references by a pointer to
the actual standard.

Modified Files:
--------------
pgsql/doc/src/sgml:
release.sgml (r1.450 -> r1.451)
(http://developer.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/release.sgml.diff?r1=1.450&r2=1.451)

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)postgresql(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i as plain i, because
Date:	2006-09-23 09:46:03
Message-ID:	20060923094603.GA24323@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Fri, Sep 22, 2006 at 12:29:05PM -0300, Tom Lane wrote:
> Log Message:
> -----------
> We're going to have to spell dotless i as plain i, because dotless i is
> not in the character set supported by DocBook nor standard HTML. (Sorry
> Volkan.) Also replace random character-set references by a pointer to
> the actual standard.

Well you could always use te HTML4 ı which most tools should
understand. At least browsers have good support for this kind of
entity.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Tom Lane <tgl(at)postgresql(dot)org>
Subject:	Re: pgsql: We're going to have to spell dotless i as plain i, because
Date:	2006-09-23 09:54:47
Message-ID:	200609231154.48185.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Martijn van Oosterhout wrote:
> Well you could always use te HTML4 ı which most tools should
> understand. At least browsers have good support for this kind of
> entity.

Please review the recent thread on pgsql-docs before reiterating all the
suggestions.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)postgresql(dot)org>
Subject:	Re: pgsql: We're going to have to spell dotless i as plain i, because
Date:	2006-09-23 12:19:06
Message-ID:	20060923121906.GB24323@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote:
> Martijn van Oosterhout wrote:
> > Well you could always use te HTML4 ı which most tools should
> > understand. At least browsers have good support for this kind of
> > entity.
>
> Please review the recent thread on pgsql-docs before reiterating all the
> suggestions.

Oh sorry, it wasn't clear from the commit entry. It's not that DocBook
doesn't support the character or that it can't be represented. It's
just not supported in the document encoding we're using.

Sorry for the noise.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)postgresql(dot)org>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-23 12:49:02
Message-ID:	200609231249.k8NCn2x24560@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Sat, Sep 23, 2006 at 11:54:47AM +0200, Peter Eisentraut wrote:
> > Martijn van Oosterhout wrote:
> > > Well you could always use te HTML4 ı which most tools should
> > > understand. At least browsers have good support for this kind of
> > > entity.
> >
> > Please review the recent thread on pgsql-docs before reiterating all the
> > suggestions.
>
> Oh sorry, it wasn't clear from the commit entry. It's not that DocBook
> doesn't support the character or that it can't be represented. It's
> just not supported in the document encoding we're using.

That's not how I understand it. The document encoding is only related
to how high-bit characters are interpreted, I am told by Peter, but for
some reason the toolchain just doesn't support UTF8, even though if you
use ı in SGML it does come out right in HTML, but new toolchains
throw an error for it.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)postgresql(dot)org>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-23 14:07:01
Message-ID:	20060923140701.GC24323@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sat, Sep 23, 2006 at 08:49:02AM -0400, Bruce Momjian wrote:
> That's not how I understand it. The document encoding is only related
> to how high-bit characters are interpreted, I am told by Peter, but for
> some reason the toolchain just doesn't support UTF8, even though if you
> use ı in SGML it does come out right in HTML, but new toolchains
> throw an error for it.

Dunno about UTF-8, but openjade only supports one character repertoire,
and that's Unicode (under character handling in the man page).

According to the docbook reference, a way to specify the dotless i
is &inodot;

http://www.oasis-open.org/docbook/documentation/reference/html/iso-lat2.html

But it's part of Latin-2, and if your stylesheet declares latin1 as
the only valid characters, then that character is invalid, no matter
how you represent it. I was just surprised, because &inodot; has been
part of docbook since version 3, which is quite some time ago now.

So to me (a more docbook novice) it seems like it's the stylesheet
that's limiting you to latin1, not the docbook parser.

Anyway, the problem has been solved, so we can all get back to testing
the beta now.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-23 16:27:51
Message-ID:	25430.1159028871@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> So to me (a more docbook novice) it seems like it's the stylesheet
> that's limiting you to latin1, not the docbook parser.

But the "stylesheet" in question is part of the basic docbook
infrastructure, so the above distinction is academic. (Or at least
that's what Peter stated upthread.)

To my mind the real problem is that one of the principal output formats
we are interested in is HTML, and there is no dotless-i entity in any
version of the HTML standard. I trust I need not point out again the
difference between "my browser recognizes this construct" and "it's in
the standard".

regards, tom lane

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)postgresql(dot)org>
Subject:	Re: pgsql: We're going to have to spell dotless i as plain i, because
Date:	2006-09-23 16:53:24
Message-ID:	200609231853.24619.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Martijn van Oosterhout wrote:
> Oh sorry, it wasn't clear from the commit entry. It's not that
> DocBook doesn't support the character or that it can't be
> represented. It's just not supported in the document encoding we're
> using.

No, no, and no.

The reason that it doesn't work is that the document character set for
DocBook is Latin 1, so any attempt to refer to a character not in this
set is going to fail.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-23 19:56:04
Message-ID:	20060923195604.GD24323@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sat, Sep 23, 2006 at 12:27:51PM -0400, Tom Lane wrote:
> To my mind the real problem is that one of the principal output formats
> we are interested in is HTML, and there is no dotless-i entity in any
> version of the HTML standard. I trust I need not point out again the
> difference between "my browser recognizes this construct" and "it's in
> the standard".

Sure there is, HTML4 includes all of Unicode, thus also the dotless-i.
They gave up assigning names to them after latin1, but numerical
references are in the standard also (decimal and hex).

I created a simple docbook document on my computer with &inodot; and
ran openjade over and in the output file it is converted to ı.
Openjade knows how to generate valid character references. The input
file is attached, I compiled it with the command:

openjade -V draft-mode -wall -wno-unused-param -wno-empty -i output-html -t sgml /tmp/a.sgml

For dsl file just copy the stylesheet.dsl file in the postgresql source
tree.

Why it doesn't work in the current docs I don't know, but I think we can
rule out limitations of HTML or Docbook.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment	Content-Type	Size
a.sgml	text/plain	412 bytes

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-23 20:18:11
Message-ID:	27742.1159042691@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> I created a simple docbook document on my computer with &inodot; and
> ran openjade over and in the output file it is converted to ı.

I experimented with that, and openjade didn't complain about it, but
it renders in my browser (Safari) as

Have the COPY command return a command tag that includes the number of rows copied (Volkan Yaz&inodot;c&inodot;)

So that hardly looks like a portable solution either.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-23 22:15:44
Message-ID:	20060923221544.GA4865@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Tom Lane wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> > I created a simple docbook document on my computer with &inodot; and
> > ran openjade over and in the output file it is converted to ı.
>
> I experimented with that, and openjade didn't complain about it, but
> it renders in my browser (Safari) as
>
> Have the COPY command return a command tag that includes the number of rows copied (Volkan Yaz&inodot;c&inodot;)

Well, if I put a &inodot; into an HTML document and open it on my
browser (Epiphany, which is Mozilla-based), it surely looks like
verbatim &inodot;. However, if I replace it with ı then it looks
like a dotless i. So maybe your Openjade is not exactly the same
Martijn was using, because what I understood was that Openjade replaced
the &inodot; with ı, which should work.

Does your browser display it correctly if you replace manually with ı?

On the other hand, I don't understand why DocBook would be Latin-1 only.
What would be the point of that limitation? Some googling seems to
reveal that people indeed uses other charsets, UTF-8 in particular (but
also Big5, Latin-2, etc), so apparently this isn't set in stone. (I
admit that they mainly talk about XML Docbook though).

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-23 22:43:39
Message-ID:	28966.1159051419@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> So maybe your Openjade is not exactly the same
> Martijn was using, because what I understood was that Openjade replaced
> the &inodot; with ı, which should work.

I think it's more likely that he was running with a non-DocBook
stylesheet (his openjade command did not explicitly select a catalog and
stylesheet the way that our Makefiles do). Or just a different version
of the stylesheet. I'm testing with whatever ships in Fedora Core 5.
I see definitions of &inodot; in some of the files under
/usr/share/sgml, but evidently none of them are included by docbook...

> Does your browser display it correctly if you replace manually with ı?

Doesn't really matter whether it does or not, since my gripe about that
is that DocBook rejects it.

> On the other hand, I don't understand why DocBook would be Latin-1 only.

I'm surprised too that it couldn't be easily overridden. Peter, any
idea why not?

regards, tom lane

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 08:20:22
Message-ID:	200609241020.23452.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Alvaro Herrera wrote:
> On the other hand, I don't understand why DocBook would be Latin-1
> only. What would be the point of that limitation? Some googling
> seems to reveal that people indeed uses other charsets, UTF-8 in
> particular (but also Big5, Latin-2, etc), so apparently this isn't
> set in stone. (I admit that they mainly talk about XML Docbook
> though).

DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 12:49:37
Message-ID:	1159102178.2917.1.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Ühel kenal päeval, P, 2006-09-24 kell 10:20, kirjutas Peter Eisentraut:
> Alvaro Herrera wrote:
> > On the other hand, I don't understand why DocBook would be Latin-1
> > only. What would be the point of that limitation? Some googling
> > seems to reveal that people indeed uses other charsets, UTF-8 in
> > particular (but also Big5, Latin-2, etc), so apparently this isn't
> > set in stone. (I admit that they mainly talk about XML Docbook
> > though).
>
> DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

Are you sure it's UCS-4 ? I've always thought that XML is what is given
in <xml > tag, and utf-8 if no charset is given.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Markus Schaber <schabi(at)logix-tt(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 12:56:30
Message-ID:	4516807E.3030800@logix-tt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hi, Hannu,

Hannu Krosing wrote:

> Are you sure it's UCS-4 ? I've always thought that XML is what is given
> in <xml > tag, and utf-8 if no charset is given.

You have to distinguish between the supported charset, and the document
encoding.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

From:	David Fetter <david(at)fetter(dot)org>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Bruce Momjian <bruce(at)momjian(dot)us>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 16:22:33
Message-ID:	20060924162233.GA12188@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sun, Sep 24, 2006 at 10:20:22AM +0200, Peter Eisentraut wrote:
> Alvaro Herrera wrote:
> > On the other hand, I don't understand why DocBook would be Latin-1
> > only. What would be the point of that limitation? Some googling
> > seems to reveal that people indeed uses other charsets, UTF-8 in
> > particular (but also Big5, Latin-2, etc), so apparently this isn't
> > set in stone. (I admit that they mainly talk about XML Docbook
> > though).
>
> DocBook SGML is Latin 1; DocBook XML, like all XML, is UCS-4.

This sheds a new light on the XML vs. SGML thing you said before.
While it's not necessarily compelling enough to force a switch, it is
a substantive difference that we can actually see.

Cheers,
D
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Markus Schaber <schabi(at)logix-tt(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 20:47:09
Message-ID:	1159130829.2917.22.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber:
> Hi, Hannu,
>
> Hannu Krosing wrote:
>
> > Are you sure it's UCS-4 ? I've always thought that XML is what is given
> > in <xml > tag, and utf-8 if no charset is given.
>
> You have to distinguish between the supported charset, and the document
> encoding.

UCS-4 and UTF-8 are both encodings for UNICODE

see: http://en.wikipedia.org/wiki/UTF-32

> HTH,
> Markus
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Markus Schaber <schabi(at)logix-tt(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 21:55:46
Message-ID:	4516FEE2.4090401@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hannu Krosing wrote:
> Ühel kenal päeval, P, 2006-09-24 kell 14:56, kirjutas Markus Schaber:
>
>> Hi, Hannu,
>>
>> Hannu Krosing wrote:
>>
>>
>>> Are you sure it's UCS-4 ? I've always thought that XML is what is given
>>> in <xml > tag, and utf-8 if no charset is given.
>>>
>> You have to distinguish between the supported charset, and the document
>> encoding.
>>
>
> UCS-4 and UTF-8 are both encodings for UNICODE
>
> see: http://en.wikipedia.org/wiki/UTF-32
>

If we want to quote references, we should quote the XML standard. For
example, see here to see the exact charset supported by XML:
http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.

A little lower down it defines the encodings allowed too.

cheers

andrew

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Andrew Dunstan <andrew(at)dunslane(dot)net>, Hannu Krosing <hannu(at)skype(dot)net>, Markus Schaber <schabi(at)logix-tt(dot)com>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 22:23:32
Message-ID:	200609250023.33111.peter_e@gmx.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Andrew Dunstan wrote:
> If we want to quote references, we should quote the XML standard. For
> example, see here to see the exact charset supported by XML:
> http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.

The actual cause of the processing problems we have been seeing are the
character set definitions in the SGML declarations of the respective
document types.

For DocBook SGML 4.2:

CHARSET

BASESET
"ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED

BASESET
"ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1"
DESCSET
128 32 UNUSED
160 96 32

For XML:

CHARSET
BASESET
"ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with implementation
level 3//ESC 2/5 2/15 4/6"
DESCSET
0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 55136 160
55296 2048 UNUSED -- surrogates --
57344 8190 57344
65534 2 UNUSED -- FFFE and FFFF --
65536 1048576 65536 -- 16 planes outside BMP --

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Markus Schaber <schabi(at)logix-tt(dot)com>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 23:21:30
Message-ID:	1159140090.2917.32.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Ühel kenal päeval, E, 2006-09-25 kell 00:23, kirjutas Peter Eisentraut:
> Andrew Dunstan wrote:
> > If we want to quote references, we should quote the XML standard. For
> > example, see here to see the exact charset supported by XML:
> > http://www.w3.org/TR/2006/REC-xml11-20060816/#charsets.
>
> The actual cause of the processing problems we have been seeing are the
> character set definitions in the SGML declarations of the respective
> document types.

I see charsets, but where are encodings defined ?

I don't think that any of our SGML documentation is actually in UCS-4
encoding.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Markus Schaber <schabi(at)logix-tt(dot)com>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-24 23:38:20
Message-ID:	13213.1159141100@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hannu Krosing <hannu(at)skype(dot)net> writes:
> I don't think that any of our SGML documentation is actually in UCS-4
> encoding.

The source files use nothing beyond plain ASCII (and should remain that
way, IMHO) so there isn't any need to inquire very far into exactly what
the toolchain thinks the "document encoding" is. The issue at hand here
is what the *output* character set is, which is to say the "document
character set" if I have the jargon right. That is the space over which
we are permitted to use &-entities.

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Markus Schaber <schabi(at)logix-tt(dot)com>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-25 00:36:16
Message-ID:	200609250036.k8P0aGF21395@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Tom Lane wrote:
> Hannu Krosing <hannu(at)skype(dot)net> writes:
> > I don't think that any of our SGML documentation is actually in UCS-4
> > encoding.
>
> The source files use nothing beyond plain ASCII (and should remain that
> way, IMHO) so there isn't any need to inquire very far into exactly what
> the toolchain thinks the "document encoding" is. The issue at hand here
> is what the *output* character set is, which is to say the "document
> character set" if I have the jargon right. That is the space over which
> we are permitted to use &-entities.

Just for reference, if we could support UTF8, I was hoping to add
non-Latin names as alternates to the ASCII versions, so we could have
Japanese and Russian-lettered names in the release notes. I thought it
would be a nice touch.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-hackers(at)postgresql(dot)org, Andrew Dunstan <andrew(at)dunslane(dot)net>, Markus Schaber <schabi(at)logix-tt(dot)com>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-25 05:43:03
Message-ID:	20060925054303.GA23636@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

On Sun, Sep 24, 2006 at 07:38:20PM -0400, Tom Lane wrote:
> Hannu Krosing <hannu(at)skype(dot)net> writes:
> > I don't think that any of our SGML documentation is actually in UCS-4
> > encoding.
>
> The source files use nothing beyond plain ASCII (and should remain that
> way, IMHO) so there isn't any need to inquire very far into exactly what
> the toolchain thinks the "document encoding" is. The issue at hand here
> is what the *output* character set is, which is to say the "document
> character set" if I have the jargon right. That is the space over which
> we are permitted to use &-entities.

What you're talking about is generally referred to as the "character
repertoire", the abstract set of characters a document is considered to
be composed of. For example: HTML4 (and XML IIRC) explicitly defines
the "character repertoire" to be Unicode, even though the "character
encoding" may only point to a subset of the total. Any others can be
generated via the &xxx; escape syntax.

I'm surprised about the difference in installations. I didn't use your
-c option because that directory does not exist on my computer, but
maybe that's all the difference...

http://www.unicode.org/unicode/reports/tr17/

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Markus Schaber <schabi(at)logix-tt(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-25 07:48:16
Message-ID:	451789C0.4050003@logix-tt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hi, Hannu,

Hannu Krosing wrote:

>>> Are you sure it's UCS-4 ? I've always thought that XML is what is given
>>> in <xml > tag, and utf-8 if no charset is given.
>> You have to distinguish between the supported charset, and the document
>> encoding.
> UCS-4 and UTF-8 are both encodings for UNICODE
> see: http://en.wikipedia.org/wiki/UTF-32

Yes, I know.

The Point I wanted to make was that the document encoding is independent
from the allowed charset (except having to be a subset).

That is what XML entities were defined for.

So even in an document using LATIN-1 as encoding, the charset still is
Unicode, giving us the possibility to use &entities; to use non-latin1
characters.

HTH,
Markus

--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

From:	Markus Schaber <schabi(at)logix-tt(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pgsql: We're going to have to spell dotless i
Date:	2006-09-25 08:02:59
Message-ID:	45178D33.9040608@logix-tt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-committers pgsql-hackers

Hi, Bruce,

Bruce Momjian wrote:

>>> I don't think that any of our SGML documentation is actually in UCS-4
>>> encoding.
>> The source files use nothing beyond plain ASCII (and should remain that
>> way, IMHO) so there isn't any need to inquire very far into exactly what
>> the toolchain thinks the "document encoding" is. The issue at hand here
>> is what the *output* character set is, which is to say the "document
>> character set" if I have the jargon right. That is the space over which
>> we are permitted to use &-entities.
>
> Just for reference, if we could support UTF8, I was hoping to add
> non-Latin names as alternates to the ASCII versions, so we could have
> Japanese and Russian-lettered names in the release notes. I thought it
> would be a nice touch.

We don't need UTF8 encoding for this. It's also possible using ASCII
encoding + ቧ entities.

But we need the Charset to be Unicode.

HTH,
Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf. | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org