Pushing restrictions down into GROUP BYs?

From: Jay Levitt <jay(dot)levitt(at)gmail(dot)com>
To: PG Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Pushing restrictions down into GROUP BYs?
Date: 2012-09-22 18:02:28
Message-ID: 505DFD34.9020206@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

A while back, I posted a pathological minimal-case query where, in order
to select one row from a users table, Postgres needed to scan the whole
users table, because the restriction was not visible to the GROUP BY.

At the time, Tom wrote:

> Don't hold your breath waiting for that to change. To do what you're
> wishing for, we'd have to treat the GROUP BY subquery as if it were an
> inner indexscan, and push a join condition into it. That's not even
> possible today. It might be possible after I get done with the
> parameterized-path stuff I've been speculating about for a couple of
> years now; but I suspect that even if it is possible, we won't do it
> for subqueries because of the planner-performance hit we'd take from
> repeatedly replanning the same subquery.

http://archives.postgresql.org/pgsql-performance/2011-11/msg00131.php

Given all the work that's been done on parameterized paths and LATERAL,
is this something to revisit yet? In 9.3, it's at least possible to
manually tweak the SRF, thanks to LATERAL. But it'd be great to allow
set-returning functions to remain blissfully unaware of their
surroundings. Modular code, Single Responsibility Principle, all that.

I guess a more general question is: Are there cases where the planner
can *use* LATERAL functionality to push down restrictions like this? (Do
LATERAL and pushdown conceptually overlap? I think maybe they do.)

Example code below - and before you say "but you could just use 'where
exists'", trust me that the original queries were much more involved :)

-----

drop schema if exists jaytest cascade;
create schema jaytest;
set search_path to jaytest;

create table questions (
id int not null primary key,
user_id int not null
);
insert into questions
select generate_series(1,1100), (random()*2000000)::int;

create table users (
id int not null primary key
);
insert into users select generate_series(1, 2000000);

vacuum freeze analyze;

create function srf() returns table (user_id int) as $$
select u.id
from users as u
group by u.id
$$ language sql stable;

-- Option 1: Use the set-returning function

explain analyze
select questions.id
from questions
join srf()
on srf.user_id = questions.user_id
where questions.id = 1;

-- Option 2: Use the equivalent of the set-returning function
-- (remove any doubts about function call overhead)

explain analyze
select questions.id
from questions
join (
select u.id as user_id
from users as u
group by u.id
) as srf1
on srf1.user_id = questions.user_id
where questions.id = 1;

-- Option 3: Now that we have LATERAL, manually
-- push the join down into the set-returning function

create function srf_lateral(questions_user_id int) returns table
(user_id int) as $$
select u.id
from users as u
where u.id = questions_user_id
group by u.id
$$ language sql stable;

explain analyze
select questions.id
from questions, lateral srf_lateral(questions.user_id)
where questions.id = 1;

drop schema jaytest cascade;

-----

On my machine, options 1 and 2 usually have runtimes of 100-400 ms, and
option 3 is about 0.04ms.

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2012-09-22 18:49:38 Re: [PATCH] lock_timeout and common SIGALRM framework
Previous Message Thom Brown 2012-09-22 17:58:04 Re: Caught up