Re: Recursive query syntax ambiguity

From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "PostgreSQL Development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Recursive query syntax ambiguity
Date: 2007-01-29 13:38:02
Message-ID: 878xfm6jid.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> Having fixed that everything works fine with SET and WITH being reserved
>> keywords. You didn't mean to say I should be able to leave WITH unreserved did
>> you?
>
> I think we'd decided that was a lost cause, unless you see a way?

No, I decided it was a lost cause right at the start of this. I was just
double checking that you didn't think otherwise when you said you didn't see
anything else that needed to be reserved.

>> Of course that was the easy part...
>
> Yup ;-)

I think I've finally wrapped my head around the idea of mutually recursive and
non-linearly recursive queries. Good thing too since in thinking about them I
realized that the approach I had intended to use for plain non-recursive
queries wasn't really adequate and needs the same machinery.

The fundamental feature Postgres needs to implement this that it doesn't have
is the ability to be running the same subplan from two different contexts in
the plan simultaneously. That is, the ability to start a scan from the
beginning without disturbing a scan already in progress for that same node.

In retrospect this isn't terribly surprising since it's the same
characteristic functions need in a programming language to be safely callable
recursively. They need to be reentrant.

The normal way to make functions reentrant is to remove all local state and
make the caller responsible for providing space to allocate memory for
temporary state. But teaching every node type about passing its state up to
the parent including bundling up all the state of its children seems like too
much work. Both for me and the computer :)

In any case that would result in an n^2 algorithm for recursive queries as it
would force the executor to re-execute each node for the same record over and
over. Much as the obvious recursive fibonacci algorithm is n^2 for example.
It also would mean making non-recursive WITH clauses pointless since they
would still be executed once per call-site.

Instead I suggest we create one type of reentrant node, which memoizes its
output. It would be a kind of on-demand Materialize. It would be paired with a
second node type which would be the only type of node which can invoke it.
This RecursiveCall node would invoke the Memoize node using a special
interface in which it passes enough state for the Memoize node to seek to the
correct place in its output.

Because it memoizes its output it should never actually need to be able to
handle recursive calls. When the call depth is maximum each call site would be
able to make at most 1 call to the Memoize node. A second one would
necessarily have been stopped higher up the call chain by the memoization
table. (I've convinced myself that that's true but I should probably work out
a good proof of it before I make all this depend on it.)

There are three general cases of the Memoize node. Breadth-first, Depth-first,
and non-linearly-recursive.

In the case of breadth-first search we would want to keep two tuplestore
around, one for the current generation, and one for its parent. Once we finish
a generation we can free its parent since we know a linearly-recursive scan
won't be needing it any more.

In the case of depth-first we really want to memoize into a stack. I'm not
sure how to fit that into our current model of tuplestores. At worst though,
we just allocate a tuplestore for each generation as it appears. We can still
throw away old generations once their scans are finished, it just won't be
until quite late.

Non-linearly-recursive searches are the same case as non-recursive queries.
Memoize would just allocate one tuplestore and store tuples blindly without
any idea of "generation". It could arrange to throw away old tuples whenever
all call-sites' scans have past them. That's the same optimization Simon wants
to make in the merge-join case when the mark is moved forward.

There are a few open issues to consider. Namely, how to cost a RecursiveCall
node.

Also, if a subplan has exactly one call-site we really ought to inline it as
it will get much more reliable plans. Similarly if there are two call sites we
should consider inlining it if the cost of materializing the results (and
reading them back) is more than n-call-sites x the cost of executing the
query. I would expect That would happen with plain sequential scans for
example.

Note that we need the Memoize/RecursiveCall nodes even for non-recursive
queries. Except in the simplest cases where each common table expression only
has a single call site (in which case we could have just inlined them anyways)
we need some way to retain state and avoid reexecuting the plan multiple
times.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-01-29 15:42:59 Re: Phantom command ids again
Previous Message Heikki Linnakangas 2007-01-29 13:30:02 Phantom command ids again