(Mis)using the PostgreSQL planner to get estimated row counts

Lists: pgsql-sql
From: Nick Johnson <arachnid(at)notdot(dot)net>
To: pgsql-sql(at)postgresql(dot)org
Subject: (Mis)using the PostgreSQL planner to get estimated row counts
Date: 2005-05-25 07:13:25
Message-ID: FD3B7D7D-18B1-443D-B229-C235CC7BF44E@notdot.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

I'm trying to write a PostgreSQL extension to estimate the number of
rows returned by a SELECT statement. Ideally, it'd be invoked along
the lines of "SELECT estimate_row_count('SELECT foo FROM bar INNER
JOIN baz ON (id) WHERE a=b');", and would be useful for estimating
the number of pages in a search result, for example.

I've got as far as figuring out how to get an estimated row count
from a Node object for a query (mostly by looking at the code for
EXPLAIN), but there I'm a bit mired - I can't figure out how to take
a string representing an SQL statement and parse it into a Node
object I can feed to the planner.

So, a couple of questions:
1) Can anyone suggest where I should look in the source for the
requisite functions for parsing an SQL string into a Node I can feed
to the planner so I can get an estimated row count?
2) Is this something that's reasonable to do in the first place? I'm
not sure if extensions are supposed to be allowed to delve into
PostgreSQL's internals this much.
3) Are there any other gotchas around this area? For example, there's
a lot going on with Snapshots and ActiveSnapshot that I really don't
have any idea about.

-Nick Johnson


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Nick Johnson <arachnid(at)notdot(dot)net>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: (Mis)using the PostgreSQL planner to get estimated row counts
Date: 2005-05-31 16:00:13
Message-ID: 12893.1117555213@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Nick Johnson <arachnid(at)notdot(dot)net> writes:
> I'm trying to write a PostgreSQL extension to estimate the number of
> rows returned by a SELECT statement.

Instead of fooling around at the C level, why don't you just do an
EXPLAIN and parse out the first row of the result? For instance

regression=# create function estimate_row_count(text) returns text as $$
regression$# declare x record;
regression$# begin
regression$# for x in execute 'EXPLAIN ' || $1 loop
regression$# return substring(x."QUERY PLAN" from 'rows=([0-9]+) width=');
regression$# end loop;
regression$# end$$ language plpgsql strict;
CREATE FUNCTION
regression=# select estimate_row_count('select * from tenk1');
estimate_row_count
--------------------
10000
(1 row)

Of course this is subject to future breakage due to changes in the
output textual format, etc etc, but it's surely less fragile than
anything written in C will be.

Depending on what you want the results for, it might be best to ignore
any top-level LIMIT node.

regards, tom lane