Re: Large SGML Cleanup

Lists: pgsql-docs
From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: pgsql-docs(at)postgresql(dot)org
Subject: Large SGML Cleanup
Date: 2010-11-03 02:56:26
Message-ID: AANLkTi=1Sm9N3Khiued9UiMfdd_TKLimMiO9mCfHtL39@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

[Resending without large attachment, looks like the previous attempt
isn't going to make it]

Hi all,

I've gone through the SGML documentation, trying to push the output
HTML towards HTML 4.01 compliance. By far the most common problem I
found was incorrect nesting of <para> nodes, which results in invalid
HTML.

A common idiom I encountered was SGML like this:

<para>
...
<simplelist>
...
</simplelist>
...
</para>

This SGML would then produce HTML which looked like this:

<p>
...
<table>
...
</table>
...
</p>

This HTML fails validation, as one isn't supposed to be stuffing
tables inside <p> nodes. The attached patch fixes all the instances of
this I could find, by closing out <para> nodes before beginning lists
and tables.

I used the w3c-markup-validator package and the web service at
validator.w3.org to test HTML validity. A handy Perl package I found
for this was WebService::Validator, which includes the example script
"validate_files_in_dir.pl" to easily validate a directory full of html
files. With this patch, the number of invalid HTML files has been
reduced to 16 from many dozens.

Patch at:
http://kupershmidt.org/pg/sgml_fixup.patch.gz

Josh


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 03:15:26
Message-ID: 23257.1288754126@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
> This HTML fails validation, as one isn't supposed to be stuffing
> tables inside <p> nodes. The attached patch fixes all the instances of
> this I could find, by closing out <para> nodes before beginning lists
> and tables.

I think this isn't even worth thinking about applying, unless you can
provide a way to get future cases of that to fail during "make html".
The chances that it'll stay fixed without such a check are not
distinguishable from zero.

An alternative that might be more workable is to fix the toolchain
so that it generates valid HTML from what is evidently perfectly
acceptable SGML.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Kupershmidt <schmiddy(at)gmail(dot)com>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 04:12:24
Message-ID: 1288757512-sup-3380@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Excerpts from Tom Lane's message of mié nov 03 00:15:26 -0300 2010:
> Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
> > This HTML fails validation, as one isn't supposed to be stuffing
> > tables inside <p> nodes. The attached patch fixes all the instances of
> > this I could find, by closing out <para> nodes before beginning lists
> > and tables.
>
> I think this isn't even worth thinking about applying, unless you can
> provide a way to get future cases of that to fail during "make html".
> The chances that it'll stay fixed without such a check are not
> distinguishable from zero.
>
> An alternative that might be more workable is to fix the toolchain
> so that it generates valid HTML from what is evidently perfectly
> acceptable SGML.

Maybe we could have additional commands in the "check" rule to invoke
some HTML validator.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 13:09:13
Message-ID: AANLkTi=pkprj_=Nn=UAcvDBvBrZps067qw_CD6WjqGsr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On Tue, Nov 2, 2010 at 11:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
>> This HTML fails validation, as one isn't supposed to be stuffing
>> tables inside <p> nodes. The attached patch fixes all the instances of
>> this I could find, by closing out <para> nodes before beginning lists
>> and tables.
>
> I think this isn't even worth thinking about applying, unless you can
> provide a way to get future cases of that to fail during "make html".
> The chances that it'll stay fixed without such a check are not
> distinguishable from zero.

I thought someone might say that :-)

That was actually going to be my next step.. I'll give it a shot.

> An alternative that might be more workable is to fix the toolchain
> so that it generates valid HTML from what is evidently perfectly
> acceptable SGML.

Yeah, that might work as well.

Josh


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 13:12:28
Message-ID: AANLkTinkOBt2OkW8w6ZNO_54hnWjCv_5c4+CRcmMYa9a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On Wed, Nov 3, 2010 at 12:12 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Maybe we could have additional commands in the "check" rule to invoke
> some HTML validator.

Yes, it would be very nice to do this once we have all the HTML output
fixed up to be valid.

The w3c-markup-validator package I installed actually installs some
CGI scripts under apache, and the Perl package I used had to
communicate with it as a web service to get validation results. Pretty
clunky IMO, and not great as a dependency for "make check", though
maybe there's a standalone validator out there.

Josh


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 13:14:47
Message-ID: 1288788816-sup-8169@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Excerpts from Josh Kupershmidt's message of mar nov 02 23:56:26 -0300 2010:
> [Resending without large attachment, looks like the previous attempt
> isn't going to make it]
>
> Hi all,
>
> I've gone through the SGML documentation, trying to push the output
> HTML towards HTML 4.01 compliance. By far the most common problem I
> found was incorrect nesting of <para> nodes, which results in invalid
> HTML.

I wonder if you're going to get stuck at some point due to invalid HTML
being output by the toolchain that cannot be fixed. For example, there
are <A> elements inside other <A> elements in history.html. How would
you fix that? There's also a systematic problem in bookindex.html
which is a generated file.

I wonder if it would be better to see about switching the toolchain. We
discussed jumping to XML; this was rejected in the past for no bigger
reason that not wanting to have to rename the files from .sgml to .xml.
Now that we're in Git, that's no longer a problem. If we can switch to
a better toolchain for producing the HTML, maybe these problems would go
away. (Of course, to be able to switch to XML we'd need to change a few
habits that no longer work such as using </> as closing tag).

Of course, switching to XML is going to meet a lot more resistance than
just committing these simple fixes to the SGML source.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 13:19:47
Message-ID: 1288790275-sup-8458@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Excerpts from Josh Kupershmidt's message of mié nov 03 10:12:28 -0300 2010:

> Yes, it would be very nice to do this once we have all the HTML output
> fixed up to be valid.
>
> The w3c-markup-validator package I installed actually installs some
> CGI scripts under apache, and the Perl package I used had to
> communicate with it as a web service to get validation results. Pretty
> clunky IMO, and not great as a dependency for "make check", though
> maybe there's a standalone validator out there.

I tried with the wdg-html-validator package and it seems to be much
simpler to use. I just invoke it as "/usr/bin/validate *.html"
And yes, your patch does indeed fix a large amount of the validation
problems.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 14:24:38
Message-ID: 1288794278.20884.2.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On tis, 2010-11-02 at 22:56 -0400, Josh Kupershmidt wrote:
> This HTML fails validation, as one isn't supposed to be stuffing
> tables inside <p> nodes. The attached patch fixes all the instances of
> this I could find, by closing out <para> nodes before beginning lists
> and tables.

Um, this is like moving around the C code because the compiler generates
invalid assembly code. Fix the compiler.

That said, we have the following in stylesheet.dsl:

;; Block elements are allowed in PARA in DocBook, but not in P in
;; HTML. With %fix-para-wrappers% turned on, the stylesheets attempt
;; to avoid putting block elements in HTML P tags by outputting
;; additional end/begin P pairs around them.
(define %fix-para-wrappers% #t)

So evidently someone thought of this before and put something in to
prevent some/many/most cases.

In general, I think the more efficient way to address this overall
problem is to run the resulting HTML through tidy and be done with it.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Josh Kupershmidt <schmiddy(at)gmail(dot)com>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 14:25:58
Message-ID: 1288794358.20884.3.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On ons, 2010-11-03 at 10:14 -0300, Alvaro Herrera wrote:
> I wonder if it would be better to see about switching the toolchain.
> We discussed jumping to XML;

Well, run

make xslthtml

and see if the output is better.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Josh Kupershmidt <schmiddy(at)gmail(dot)com>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 14:38:28
Message-ID: 5983.1288795108@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> I wonder if it would be better to see about switching the toolchain.

We certainly seem to be pushing the limits of the existing toolchain
... but are there any genuinely better alternatives out there?

regards, tom lane


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Josh Kupershmidt <schmiddy(at)gmail(dot)com>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 15:35:34
Message-ID: 1288798534.20884.12.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On ons, 2010-11-03 at 16:25 +0200, Peter Eisentraut wrote:
> On ons, 2010-11-03 at 10:14 -0300, Alvaro Herrera wrote:
> > I wonder if it would be better to see about switching the toolchain.
> > We discussed jumping to XML;
>
> Well, run
>
> make xslthtml
>
> and see if the output is better.

Here is some data to entertain along the way:

make html 107.85s user 0.57s system 92% cpu 1:56.65 total

make xslthtml 13.98s user 1331.22s system 98% cpu 22:46.46 total


From: Chris Browne <cbbrowne(at)acm(dot)org>
To: pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 16:52:31
Message-ID: 87zktqs628.fsf@cbbrowne.afilias-int.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

peter_e(at)gmx(dot)net (Peter Eisentraut) writes:
> In general, I think the more efficient way to address this overall
> problem is to run the resulting HTML through tidy and be done with it.

+1. I use similar toolchain for my own web site, and one of the steps
is to run tidy on the output, which rectifies a number of issues. (I
haven't looked at just what they are in some years now :-))
--
http://www3.sympatico.ca/cbbrowne/slony.html
No lusers were harmed in the creation of this usenet article. AND I
WANT TO KNOW WHY NOT!
-- glmar0(at)twirl(dot)mcc(dot)ac(dot)uk in alt.sysadmin.recovery


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 22:59:58
Message-ID: AANLkTimb-3m+LGowDpaEA3ND=R5uh1Z0YqXnnVDxQCWy@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On Wed, Nov 3, 2010 at 9:19 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> I tried with the wdg-html-validator package and it seems to be much
> simpler to use.  I just invoke it as "/usr/bin/validate *.html"
> And yes, your patch does indeed fix a large amount of the validation
> problems.

Ah yes, that works much better.

Josh


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-03 23:10:26
Message-ID: AANLkTi=+yacNLFiejf=uO1kDHKmKjfV+AhKmf=1ATcRa@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On Wed, Nov 3, 2010 at 10:24 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On tis, 2010-11-02 at 22:56 -0400, Josh Kupershmidt wrote:
>> This HTML fails validation, as one isn't supposed to be stuffing
>> tables inside <p> nodes. The attached patch fixes all the instances of
>> this I could find, by closing out <para> nodes before beginning lists
>> and tables.
>
> Um, this is like moving around the C code because the compiler generates
> invalid assembly code.  Fix the compiler.

I agree with this sentiment.

> That said, we have the following in stylesheet.dsl:
>
> ;; Block elements are allowed in PARA in DocBook, but not in P in
> ;; HTML.  With %fix-para-wrappers% turned on, the stylesheets attempt
> ;; to avoid putting block elements in HTML P tags by outputting
> ;; additional end/begin P pairs around them.
> (define %fix-para-wrappers% #t)

Hrm, where is the code behind fix-para-wrappers? I don't see it inside
openjade, or anywhere inside Postgres?

> So evidently someone thought of this before and put something in to
> prevent some/many/most cases.
>
> In general, I think the more efficient way to address this overall
> problem is to run the resulting HTML through tidy and be done with it.

Hey, this actually works surprisingly well on a few files I tested.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-04 14:26:32
Message-ID: 1288880792.18213.0.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

On ons, 2010-11-03 at 19:10 -0400, Josh Kupershmidt wrote:
> Hrm, where is the code behind fix-para-wrappers? I don't see it inside
> openjade, or anywhere inside Postgres?

It's in the DSSSL stylesheet. On my system it's
at /usr/share/sgml/docbook/stylesheet/dsssl/modular/html.


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Josh Kupershmidt <schmiddy(at)gmail(dot)com>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-04 17:51:10
Message-ID: 1288893046-sup-7812@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Excerpts from Peter Eisentraut's message of mié nov 03 12:35:34 -0300 2010:

> Here is some data to entertain along the way:
>
> make html 107.85s user 0.57s system 92% cpu 1:56.65 total
>
> make xslthtml 13.98s user 1331.22s system 98% cpu 22:46.46 total

Ugh, that's horrible :-(

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Josh Kupershmidt <schmiddy(at)gmail(dot)com>, pgsql-docs <pgsql-docs(at)postgresql(dot)org>
Subject: Re: Large SGML Cleanup
Date: 2010-11-04 18:34:12
Message-ID: 1288895489-sup-4625@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Excerpts from Alvaro Herrera's message of jue nov 04 14:51:10 -0300 2010:
> Excerpts from Peter Eisentraut's message of mié nov 03 12:35:34 -0300 2010:
>
> > Here is some data to entertain along the way:
> >
> > make html 107.85s user 0.57s system 92% cpu 1:56.65 total
> >
> > make xslthtml 13.98s user 1331.22s system 98% cpu 22:46.46 total
>
> Ugh, that's horrible :-(

This seems like a bug in xsltproc. A bit of strace shows that it's
full of this stuff:

stat("/usr/lib/libxslt-plugins/nwalsh_com_xslt_ext_com_nwalsh_saxon_UnwrapLinks.so", 0x7fff86f35f40) = -1 ENOENT (No such file or directory)
stat("/usr/lib/libxslt-plugins/nwalsh_com_xslt_ext_com_nwalsh_saxon_UnwrapLinks.so", 0x7fff86f357c0) = -1 ENOENT (No such file or directory)
stat("/usr/lib/libxslt-plugins/nwalsh_com_xslt_ext_com_nwalsh_saxon_UnwrapLinks.so", 0x7fff86f35f40) = -1 ENOENT (No such file or directory)

I didn't let it finish to verify that it's really the time sink, though.
But note that user time is a lot lower than with the SGML toolchain;
it's system time that's the problem (which is suspicious in itself).

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-14 18:45:10
Message-ID: AANLkTikG5A7Ma9yrDAn4xEVOkVzx1pBDUQyTAu-AAn0S@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

[Sorry, I'm revisiting this topic a bit late.. have had a lot of stuff
keeping me busy]

On Wed, Nov 3, 2010 at 10:24 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> In general, I think the more efficient way to address this overall
> problem is to run the resulting HTML through tidy and be done with it.

Running tidy -modify on the .html files in ./sgml/html/ gets the
number of invalid HTML files down to 4 from 224 [1]. I took a quick
look at a few of the resulting pages in Firefox, and couldn't see any
visual difference between the original and tidy'ed versions.

I think it would be a good idea to add a step in ./sgml/Makefile to
run tidy on all the .html files produced. Anyone interested in doing
this?

Josh

[1] For reference, the 4 remaining invalid files are: index.html,
functions-string.html, biblio.html, and ecpg-variables.html


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Kupershmidt <schmiddy(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, pgsql-docs(at)postgresql(dot)org
Subject: Re: Large SGML Cleanup
Date: 2010-11-14 19:46:29
Message-ID: 7535.1289763989@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-docs

Josh Kupershmidt <schmiddy(at)gmail(dot)com> writes:
> I think it would be a good idea to add a step in ./sgml/Makefile to
> run tidy on all the .html files produced. Anyone interested in doing
> this?

I'm a bit concerned about starting to depend on tidy as part of our
build process, because the upstream project seems in, um, rather un-tidy
shape. I see that it's packaged for Fedora so it wouldn't be any skin
off my own nose to get it on the machine I build the docs on. But
people who have to install it from source will be quite unhappy, because
there doesn't appear to have been an actual release in some time, maybe
not ever (the Fedora package is based on a random CVS pull, not a
published release tarball). Note the negative comments here:

http://sourceforge.net/projects/tidy/

IOW, I don't mind if you run the html files through tidy on your own
accord, but I don't think I want it done in the Makefiles.

regards, tom lane