Quick Links

Re: Recursive query syntax ambiguity

Lists:	pgsql-hackers

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Recursive query syntax ambiguity
Date:	2007-01-26 17:10:01
Message-ID:	87r6thaf4m.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hm, I had hoped that the DB2/ANSI syntax would only require making "WITH" a
fully reserved word, and not the other tokens it uses. Certainly for
non-recursive queries that's the case as the only other token it uses is "AS"
which is already a fully reserved word.

However to fully support the DB2/ANSI syntax we would definitely have an
ambiguity and I think we would have to make "CYCLE" a fully reserved word
which seems like a much bigger concession than "WITH". Observe the following
case:

WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x,y SET ...

The parser can't search arbitrarily far checking for a SET to see if the CYCLE
is a keyword or a binary operator. Even if it could things like this would be
entirely ambiguous:

WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x, y CYCLE y SET ...

I'm nowhere near actually implementing this functionality yet so there's no
pressing need for action. In fact I think the search clause is actually an
ANSIism that isn't supported by DB2 itself yet either.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recursive query syntax ambiguity
Date:	2007-01-26 20:42:51
Message-ID:	20070126204251.GC24276@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 26, 2007 at 05:10:01PM +0000, Gregory Stark wrote:
> However to fully support the DB2/ANSI syntax we would definitely have an
> ambiguity and I think we would have to make "CYCLE" a fully reserved word
> which seems like a much bigger concession than "WITH". Observe the following
> case:
>
> WITH RECURSIVE foo (x,y) AS (select 1,2) SEARCH DEPTH FIRST BY x CYCLE x,y SET ...
>
> The parser can't search arbitrarily far checking for a SET to see if the CYCLE
> is a keyword or a binary operator.

Er, CYCLE isn't a binary operator, and users can't make binary
operators that are words, so I'm not sure of the problem here.
I think the parser can tell that the expression ends at the word
"cycle".

Or am I missing obvious?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recursive query syntax ambiguity
Date:	2007-01-26 21:15:46
Message-ID:	29102.1169846146@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> Er, CYCLE isn't a binary operator, and users can't make binary
> operators that are words, so I'm not sure of the problem here.

Well, the problem typically is not being able to tell whether an
operator is supposed to be infix or postfix; hence keywords that can
terminate arbitrary expressions usually have to be reserved words.
However, now that I look at the syntax I think Greg may be misreading
it. I see

<search clause> ::=
SEARCH <recursive search order> SET <sequence column>

<recursive search order> ::=
DEPTH FIRST BY <sort specification list>
| BREADTH FIRST BY <sort specification list>

<cycle clause> ::=
CYCLE <cycle column list> SET <cycle mark column> TO <cycle
mark value> DEFAULT <non-cycle mark value> USING <path column>

<cycle column list> ::= <cycle column> [ {<comma><cycle column>}...]

<non-cycle mark value> ::= <value expression>

and so CYCLE would come *after* "SET <sequence column>" not before it.
It looks to me like we'd have to promote SET to fully reserved status,
but that probably isn't going to surprise anyone. DEFAULT and USING
already are fully reserved. I don't see anything else here that looks
like it should need to be reserved.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "PostgreSQL Development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recursive query syntax ambiguity
Date:	2007-01-27 00:12:20
Message-ID:	87lkjp8h0b.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> <search clause> ::=
> SEARCH <recursive search order> SET <sequence column>
>
> and so CYCLE would come *after* "SET <sequence column>" not before it.

Ah, thanks, I had glossed right over the "SET <sequence column>" bit. The SET
that I had was the "SET <cycle column>" which remains after the CYCLE keyword.

> It looks to me like we'd have to promote SET to fully reserved status,
> but that probably isn't going to surprise anyone. DEFAULT and USING
> already are fully reserved. I don't see anything else here that looks
> like it should need to be reserved.

Having fixed that everything works fine with SET and WITH being reserved
keywords. You didn't mean to say I should be able to leave WITH unreserved did
you?

Of course that was the easy part...

Implementing non-recursive common table expressions should be fairly
mechanical though I think I'll have lots of questions about how to get all the
variable references fixed up.

Non-recursive common table expressions are always non-correlated. They can
refer to previous common table expressions but only to select from them either
in the FROM clause or in subqueries. So as far as I can see they can just go
in an InitPlan (or One-Time-Plan? I'm not sure what the distinction is) and be
referred to in the same way.

Recursive queries are of course a whole lot trickier. I've been slowly
wrapping my head around them. So far I have a pretty good idea how to churn
out a typical recursive query analogous to a CONNECT BY query.

But the spec is a lot more ambitious than that. I haven't quite wrapped my
head around the idea of mutually recursive or non-linearly-recursive queries
yet.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	"Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "PostgreSQL Development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recursive query syntax ambiguity
Date:	2007-01-27 04:49:29
Message-ID:	2163.1169873369@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Having fixed that everything works fine with SET and WITH being reserved
> keywords. You didn't mean to say I should be able to leave WITH unreserved did
> you?

I think we'd decided that was a lost cause, unless you see a way?

> Of course that was the easy part...

Yup ;-)

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "PostgreSQL Development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recursive query syntax ambiguity
Date:	2007-01-29 13:38:02
Message-ID:	878xfm6jid.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> Having fixed that everything works fine with SET and WITH being reserved
>> keywords. You didn't mean to say I should be able to leave WITH unreserved did
>> you?
>
> I think we'd decided that was a lost cause, unless you see a way?

No, I decided it was a lost cause right at the start of this. I was just
double checking that you didn't think otherwise when you said you didn't see
anything else that needed to be reserved.

>> Of course that was the easy part...
>
> Yup ;-)

I think I've finally wrapped my head around the idea of mutually recursive and
non-linearly recursive queries. Good thing too since in thinking about them I
realized that the approach I had intended to use for plain non-recursive
queries wasn't really adequate and needs the same machinery.

The fundamental feature Postgres needs to implement this that it doesn't have
is the ability to be running the same subplan from two different contexts in
the plan simultaneously. That is, the ability to start a scan from the
beginning without disturbing a scan already in progress for that same node.

In retrospect this isn't terribly surprising since it's the same
characteristic functions need in a programming language to be safely callable
recursively. They need to be reentrant.

The normal way to make functions reentrant is to remove all local state and
make the caller responsible for providing space to allocate memory for
temporary state. But teaching every node type about passing its state up to
the parent including bundling up all the state of its children seems like too
much work. Both for me and the computer :)

In any case that would result in an n^2 algorithm for recursive queries as it
would force the executor to re-execute each node for the same record over and
over. Much as the obvious recursive fibonacci algorithm is n^2 for example.
It also would mean making non-recursive WITH clauses pointless since they
would still be executed once per call-site.

Instead I suggest we create one type of reentrant node, which memoizes its
output. It would be a kind of on-demand Materialize. It would be paired with a
second node type which would be the only type of node which can invoke it.
This RecursiveCall node would invoke the Memoize node using a special
interface in which it passes enough state for the Memoize node to seek to the
correct place in its output.

Because it memoizes its output it should never actually need to be able to
handle recursive calls. When the call depth is maximum each call site would be
able to make at most 1 call to the Memoize node. A second one would
necessarily have been stopped higher up the call chain by the memoization
table. (I've convinced myself that that's true but I should probably work out
a good proof of it before I make all this depend on it.)

There are three general cases of the Memoize node. Breadth-first, Depth-first,
and non-linearly-recursive.

In the case of breadth-first search we would want to keep two tuplestore
around, one for the current generation, and one for its parent. Once we finish
a generation we can free its parent since we know a linearly-recursive scan
won't be needing it any more.

In the case of depth-first we really want to memoize into a stack. I'm not
sure how to fit that into our current model of tuplestores. At worst though,
we just allocate a tuplestore for each generation as it appears. We can still
throw away old generations once their scans are finished, it just won't be
until quite late.

Non-linearly-recursive searches are the same case as non-recursive queries.
Memoize would just allocate one tuplestore and store tuples blindly without
any idea of "generation". It could arrange to throw away old tuples whenever
all call-sites' scans have past them. That's the same optimization Simon wants
to make in the merge-join case when the mark is moved forward.

There are a few open issues to consider. Namely, how to cost a RecursiveCall
node.

Also, if a subplan has exactly one call-site we really ought to inline it as
it will get much more reliable plans. Similarly if there are two call sites we
should consider inlining it if the cost of materializing the results (and
reading them back) is more than n-call-sites x the cost of executing the
query. I would expect That would happen with plain sequential scans for
example.

Note that we need the Memoize/RecursiveCall nodes even for non-recursive
queries. Except in the simplest cases where each common table expression only
has a single call site (in which case we could have just inlined them anyways)
we need some way to retain state and avoid reexecuting the plan multiple
times.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recursive query syntax ambiguity
Date:	2007-02-05 16:18:23
Message-ID:	20070205161823.GC4811@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 29, 2007 at 01:38:02PM +0000, Gregory Stark wrote:
> Instead I suggest we create one type of reentrant node, which memoizes its
> output. It would be a kind of on-demand Materialize. It would be paired with a
> second node type which would be the only type of node which can invoke it.
> This RecursiveCall node would invoke the Memoize node using a special
> interface in which it passes enough state for the Memoize node to seek to the
> correct place in its output.

That I beleive is the right approach. I think an equivalent way of
looking at it is a "Loop" node with an InitPlan and a StepPlan.

Initially, the Loop node executes the InitPlan to get it's initial set
of tuples, storing them like a Materialize node does. After that it
keeps calling the StepNode, where somewhere inside the it has a node
that extracts a row from the aforementioned tuplestore. It stores these
returned tuples in the tuplestore also, thus giving you recursion.

<snip>

> (I've convinced myself that that's true but I should probably work out
> a good proof of it before I make all this depend on it.)

Yeah, proving it is going to be tricky, I'm not sure what the standard
says about infinite recursion.

> There are three general cases of the Memoize node. Breadth-first, Depth-first,
> and non-linearly-recursive.

I think the the only difference between depth and bredth-first searches
is (if you consider the tuplesort to be a queue) whether the new tuples
go to the front or the back of the list. But a data structures and
algorithms book will know this.

> There are a few open issues to consider. Namely, how to cost a RecursiveCall
> node.

One note: if you know that if you get p tuples out for every tuple in
(where p<1) then the asymptotic result of 1 + p + p*p+ ... is 1/(1-p)

However, I don't know it matters. You only need to cost the plan if
there are alternate paths and given the plan structure is strongly
constrained, I'm not sure how much it matters.

> Also, if a subplan has exactly one call-site we really ought to inline it as
> it will get much more reliable plans. Similarly if there are two call sites we
> should consider inlining it if the cost of materializing the results (and
> reading them back) is more than n-call-sites x the cost of executing the
> query. I would expect That would happen with plain sequential scans for
> example.

In the case where the subplan has side-effects, you can't optimise at
all. In the case of read-committed mode, will two seq-scans always
return the same result?

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL Development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Recursive query syntax ambiguity
Date:	2007-02-05 16:23:58
Message-ID:	446.1170692638@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> However, I don't know it matters. You only need to cost the plan if
> there are alternate paths and given the plan structure is strongly
> constrained, I'm not sure how much it matters.

It does, since the whole thing could be a subquery, in which case there
could be options available at the outer level. I doubt we'll be able to
be really smart, but that doesn't mean we can just punt.

> In the case of read-committed mode, will two seq-scans always
> return the same result?

They definitely should, since we'll be using the same snapshot
throughout the query.

regards, tom lane