Re: hint infrastructure setup (v3)

Lists: pgsql-patches
From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-05 16:44:22
Message-ID: Pine.LNX.4.58.0404051843530.8826@sablons.cri.ensmp.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Dear Tom,

> No, just redefine "yyerror" as a macro that passes additional parameters.

Ok! That's just the kind of the hack-hook I was looking for!!

> > The terminals are not available, they must be kept separatly if you
> > want them. This can be done in yylex().
>
> We don't need them. Any already-shifted tokens are to the left of where
> the error is, no?

Yes and no. I agree that you don't need them for computing the follow set.
However it would help a lot to generate nicer hints. I'm looking for help
the user, and the follow set is just part of the help.

Think of typical "SELECT foo, bla, FROM xxx" (it looks stupid on a one
line query, but not so on a very large query) which is quite easier to
detect and hint about because of FROM is just after a comma.

The rule part is also a real issue. There is no easy way to translate
states into rules, to know whether we're somewhere in a "ShowStmt" or
"AlterTableStmt"...

If you're after a comma in state 1232, should you just say IDENT... I'd
rather say "user name" or "table name" if I can... Otherwise the hint
stays at a low parser level. Good hints requires to know about the
current context, and it is not easy to get that deep in the automaton.

> Yeah, I had come to the same conclusion --- state moves made without
> consuming input would need to be backed out if we wanted to report the
> complete follow set. I haven't yet looked to see how to do that.

Well, following you're previous suggestion, one can redefine the YYLEX
macro to save the current state each time a token is required.

> > As you noted, for things like "SHOW 123", the follow set basically
> > includes all keywords although you can have SHOW ALL or SHOW name.
> > So, as you suggested, you can either say "ident" as a simplification, but
>
> You're ignoring the distinction between classes of keywords. I would
> not recommend treating reserved_keywords as a subclass of ident.

Ok, I agree that it can help in this very case a little bit here because
ALL is reserved. I did not make this distinction when I played with bison.

> > (5) anything that can be done would be hardwired to one version of bison.
>
> I think this argument is completely without merit.

Possible;-)

However I don't like writing hacks that depends on other people future
behavior.

> > (b) write a new "recursive descendant" parser, and drop "gram.y"
>
> Been there, done that, not impressed with the idea. There's a reason
> why people invented parser generators...

Sure. It was just a list of options;-)

> > As a side effect of my inspection is that the automaton generated by bison
> > could be simplified if a different tradeoff between the lexer, the parser
> > and the post-processing would be chosen. Namelly, all tokens that are
> > just identifiers should be dropped and processed differently.
>
> We are not going to reserve the keywords that are presently unreserved.

Reserving keywords would help, but I would not think it could be accepted,
because it is not SQL philosophy. SQL is about 30 years also, as old
as yacc ideas... but they did not care at the time;-) When you look at
the syntax, it was designed with a recursive parser in mind.

Part of my comment was to explain "why" the generated automaton is large.
SQL is a small language, but it has a big automaton in postgresql, and
this is IMHO the explanation.

> If you can think of a reasonable way to stop treating them as separate
> tokens inside the grammar without altering the user-visible behavior,
> I'm certainly interested.

I was thinking about type names especially, and maybe others.

I join a small proof-of-concept patch to drop some tokens out of the
parser. As a results, 6 tokens, 6 rules and 9 states are removed,
and the automaton table is reduced by 438 entries (1.5%). Not too bad;-)
I think others could be dealt with similarily, or with more work.

Thanks for the discussion,

--
Fabien.

Attachment Content-Type Size
less_keywords.patch text/plain 5.9 KB

From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-05 16:52:10
Message-ID: 200404051652.i35GqAN05855@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


As another teaching aid idea, is there something that would auto-format
queries to they are more readable?

---------------------------------------------------------------------------

Fabien COELHO wrote:
>
>
>
> Dear Tom,
>
> > No, just redefine "yyerror" as a macro that passes additional parameters.
>
> Ok! That's just the kind of the hack-hook I was looking for!!
>
>
> > > The terminals are not available, they must be kept separatly if you
> > > want them. This can be done in yylex().
> >
> > We don't need them. Any already-shifted tokens are to the left of where
> > the error is, no?
>
> Yes and no. I agree that you don't need them for computing the follow set.
> However it would help a lot to generate nicer hints. I'm looking for help
> the user, and the follow set is just part of the help.
>
> Think of typical "SELECT foo, bla, FROM xxx" (it looks stupid on a one
> line query, but not so on a very large query) which is quite easier to
> detect and hint about because of FROM is just after a comma.
>
> The rule part is also a real issue. There is no easy way to translate
> states into rules, to know whether we're somewhere in a "ShowStmt" or
> "AlterTableStmt"...
>
> If you're after a comma in state 1232, should you just say IDENT... I'd
> rather say "user name" or "table name" if I can... Otherwise the hint
> stays at a low parser level. Good hints requires to know about the
> current context, and it is not easy to get that deep in the automaton.
>
>
> > Yeah, I had come to the same conclusion --- state moves made without
> > consuming input would need to be backed out if we wanted to report the
> > complete follow set. I haven't yet looked to see how to do that.
>
> Well, following you're previous suggestion, one can redefine the YYLEX
> macro to save the current state each time a token is required.
>
>
> > > As you noted, for things like "SHOW 123", the follow set basically
> > > includes all keywords although you can have SHOW ALL or SHOW name.
> > > So, as you suggested, you can either say "ident" as a simplification, but
> >
> > You're ignoring the distinction between classes of keywords. I would
> > not recommend treating reserved_keywords as a subclass of ident.
>
> Ok, I agree that it can help in this very case a little bit here because
> ALL is reserved. I did not make this distinction when I played with bison.
>
>
> > > (5) anything that can be done would be hardwired to one version of bison.
> >
> > I think this argument is completely without merit.
>
> Possible;-)
>
> However I don't like writing hacks that depends on other people future
> behavior.
>
>
> > > (b) write a new "recursive descendant" parser, and drop "gram.y"
> >
> > Been there, done that, not impressed with the idea. There's a reason
> > why people invented parser generators...
>
> Sure. It was just a list of options;-)
>
>
> > > As a side effect of my inspection is that the automaton generated by bison
> > > could be simplified if a different tradeoff between the lexer, the parser
> > > and the post-processing would be chosen. Namelly, all tokens that are
> > > just identifiers should be dropped and processed differently.
> >
> > We are not going to reserve the keywords that are presently unreserved.
>
> Reserving keywords would help, but I would not think it could be accepted,
> because it is not SQL philosophy. SQL is about 30 years also, as old
> as yacc ideas... but they did not care at the time;-) When you look at
> the syntax, it was designed with a recursive parser in mind.
>
> Part of my comment was to explain "why" the generated automaton is large.
> SQL is a small language, but it has a big automaton in postgresql, and
> this is IMHO the explanation.
>
>
> > If you can think of a reasonable way to stop treating them as separate
> > tokens inside the grammar without altering the user-visible behavior,
> > I'm certainly interested.
>
> I was thinking about type names especially, and maybe others.
>
> I join a small proof-of-concept patch to drop some tokens out of the
> parser. As a results, 6 tokens, 6 rules and 9 states are removed,
> and the automaton table is reduced by 438 entries (1.5%). Not too bad;-)
> I think others could be dealt with similarily, or with more work.
>
>
> Thanks for the discussion,
>
> --
> Fabien.

Content-Description:

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
> joining column's datatypes do not match

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-05 17:01:25
Message-ID: Pine.LNX.4.58.0404051855080.8826@sablons.cri.ensmp.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


> As another teaching aid idea, is there something that would auto-format
> queries to they are more readable?

Not that I know of. But I guess it must exist in some interface,
maybe with Oracle or DB2.

My emacs does seem to do that. It just put colors on keywords.

If I had to put it somewhere, I would try to enhance emacs sql-mode.
But I don't like much lisp programming.

--
Fabien Coelho - coelho(at)cri(dot)ensmp(dot)fr


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-05 17:37:06
Message-ID: 13372.1081186626@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> writes:
>> If you can think of a reasonable way to stop treating them as separate
>> tokens inside the grammar without altering the user-visible behavior,
>> I'm certainly interested.

> I join a small proof-of-concept patch to drop some tokens out of the
> parser.

I believe these were treated this way *specifically* because of the
keyword-is-not-an-identifier issue. SQL99 calls out most of these
as being keywords:

SQL defines predefined data types named by the following <key
word>s: CHARACTER, CHARACTER VARYING, CHARACTER LARGE OBJECT,
BINARY LARGE OBJECT, BIT, BIT VARYING, NUMERIC, DECIMAL, INTEGER,
SMALLINT, FLOAT, REAL, DOUBLE PRECISION, BOOLEAN, DATE, TIME,
TIMESTAMP, and INTERVAL. These names are used in the type

and if we don't treat them as keywords then we will have a couple of
problems. One is case-conversion issues in locales where the standard
downcasing is not an extension of ASCII (Turkish is the only one I know
of offhand). Another is that depending on where you put the renaming
that this patch removes without replacing :-(, it would be possible for
the renaming transformation to get applied to user-defined types with
similar names, or for user-defined types to unexpectedly shadow system
definitions. The former would be surprising and the latter would
violate the spec. Check the archives; I'm sure this was discussed in
the 7.3 development cycle and we concluded that treating these names
as keywords was the only practical solution.

regards, tom lane


From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-06 08:18:11
Message-ID: Pine.LNX.4.58.0404060901020.8826@sablons.cri.ensmp.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


Dear Tom,

> > I join a small proof-of-concept patch to drop some tokens out of the
> > parser.
>
> I believe these were treated this way *specifically* because of the
> keyword-is-not-an-identifier issue. SQL99 calls out most of these
> as being keywords:

Well, I think that the "reserved keywords" are fine as tokens in a
lexer/parser, but think that the "unreserved keywords" should be dropped
of the token status if possible.

> and if we don't treat them as keywords then we will have a couple of
> problems. One is case-conversion issues in locales where the standard
> downcasing is not an extension of ASCII (Turkish is the only one I know
> of offhand).

Do you mean it should use an ASCII-only strcasecmp, not a possibly
"localised" version? I agree, but this is just a "proof of concept"
patch to show that you don't need so many tokens in the parser.

> Another is that depending on where you put the renaming that this patch
> removes without replacing :-(,

I do not understand your point. It seems to me that the renaming is
performed when a type name is expected? The "boolean" keyword (not token)
is translated to system "bool" type in the GenericType rule?? ???

> it would be possible for the renaming transformation to get applied to
> user-defined types with similar names, or for user-defined types to
> unexpectedly shadow system definitions.

I don't think that the patch changes the result of the parsing. It drops
*TOKENS* out of the lexer, but they are still *KEYWORDS*, although they
are not explicitly in the lexer list.

"keyword.c" deals with tokens, the file name was ill-chosen. If you think
that keywords can only be lexical tokens, then you end-up with an
automaton larger than necessary, IMVHO.

Note that the removed tokens are still "keywords" as they are treated
*especially* anyway. It is not a semantical transformation.

Also, if you don't want these names as candidate function names, they
could be filtered out at some other point in the parser. They really don't
need to be special tokens.

My point is that you can have the very same *semantical* result with a
smaller automaton if you chose a different trade-off within the
lexer/parser/post filtering. I don't want to change the language.

> The former would be surprising and the latter would violate the spec.

I'm really not sure this is the case with the patch I sent.

> Check the archives; I'm sure this was discussed in the 7.3 development
> cycle and we concluded that treating these names as keywords was the
> only practical solution.

Hmmm... I can check the archive, but I cannot see how different the
language is with the patch. Maybe there is a missing filter out, or
strcasecmp is not the right version, but no more.

I think it is a small technical issue in the parser internals, and has
nothing to do with great principles and whether this or that is a keyword.
It's about what keywords need to be tokens.

--
Fabien Coelho - coelho(at)cri(dot)ensmp(dot)fr


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
Cc: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-06 13:56:13
Message-ID: 23868.1081259773@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches

Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr> writes:
>> Another is that depending on where you put the renaming that this patch
>> removes without replacing :-(,

> I do not understand your point. It seems to me that the renaming is
> performed when a type name is expected? The "boolean" keyword (not token)
> is translated to system "bool" type in the GenericType rule?? ???

I mean that you removed functionality without putting it back; the
modified parser will fail to recognize BOOLEAN as a type name at all,
because it doesn't match "bool" which is in the catalogs. (And changing
the entry to "boolean" is not a solution, it just moves the problem.)

I assume you intended to handle this by doing the substitutions in type
name lookup elsewhere in the parser, but I don't think that is a valid
solution, because there is no longer enough information. In particular
you can't any longer tell the difference between BOOLEAN and "boolean"
(with quotes), which are not the same thing --- a quoted string is never
a keyword, per spec.

Possibly a better example than boolean is the REAL => pg_catalog.float4
transformation. If a user has defined his own type named foo.real,
he ought to be able to refer to it as "real" (with quotes) and not get
messed up by the keyword transformation. I think our original
motivation for converting all these things to keywords was the
realization that pg_dump would in fact screw up and fail to dump such
a type definition correctly if "real" wasn't recognized as conflicting
with a keyword (which is what prompts pg_dump to stick quotes on).

The basic point here is that eliminating tokens as you propose will
result in small changes in behavior, none of which are good or per spec.
Making the parser automaton smaller would be nice, but not at that
price.

> My point is that you can have the very same *semantical* result with a
> smaller automaton if you chose a different trade-off within the
> lexer/parser/post filtering. I don't want to change the language.

You have not proven that you can have the same result.

regards, tom lane


From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Patches <pgsql-patches(at)postgresql(dot)org>
Subject: Re: hint infrastructure setup (v3)
Date: 2004-04-06 16:21:55
Message-ID: Pine.LNX.4.58.0404061625030.9008@sablons.cri.ensmp.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-patches


Dear Tom,

> In particular you can't any longer tell the difference between BOOLEAN
> and "boolean" (with quotes), which are not the same thing --- a quoted
> string is never a keyword, per spec. [...]

Ok, so you mean that on -boolean- the lexer returns a BOOLEAN_P token, but
with -"boolean"- it returns an Ident and -boolean- as a lval. Indeed, in
such a case I cannot recognize that simply boolean vs "boolean" if they
are both idents that look the same.

As a matter of fact, this can also be fixed with some post-filtering. Say,
all quoted idents could be returned with a leading " to show it was
dquoted, and the IDENT rules in the parser could remove when it is not
needed anymore to distinguish the case.

Not beautiful, I agree, but my point is that the current number of tokens
and number of states and automaton size are not inherent to SQL but to the
way the lexing/parsing is performed in postgresql.

> The basic point here is that eliminating tokens as you propose will
> result in small changes in behavior, none of which are good or per spec.
> Making the parser automaton smaller would be nice, but not at that
> price.

Ok. I don't want to change the spec. I still stand that it can be done,
although some more twicking is required. It was just a "proof of concept",
not a patch submission. Well, a "proof of concept" must still be a proof;-)

I attach a small patch that solve the boolean vs "boolean" issue, still as
a proof of concept that it is 'doable' to preserve semantics with a
different lexer/parser balance. I don't claim that it should be applied, I
just claim that the automaton size could be smaller, especially by
shortening the "unreserved_keyword" list.

> You have not proven that you can have the same result.

Well, I passed the regression tests, but that does not indeed prove
anything, because these issues are not tested at all.

Maybe you could consider to add the "regression" part of the attached
patcht, which creates a user "boolean" type.

Anyway, my motivation is about "hints" and "advises", and that does not
help a lot to solve these issues.

--
Fabien.

Attachment Content-Type Size
less_keywords.patch text/plain 10.4 KB