Views with unions

Lists: pgsql-performance
From: Mariusz Czułada <manieq(at)idea(dot)net(dot)pl>
To: pgsql-performance(at)postgresql(dot)org
Subject: Views with unions
Date: 2003-02-15 23:48:13
Message-ID: 200302160048.14681.manieq@idea.net.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hello,

While testing multi-select views I found some problems. Here are details. I have 3 tables and I created a view on them:

create view view123 as
select key, value from tab1 where key=1
union all
select key, value from tab2 where key=2
union all
select key, value from tab3 where key=3;

When querying with no conditions, I get plan:

test_db=# explain analyze select key, value from view123;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Subquery Scan view123 (cost=0.00..3.19 rows=15 width=11) (actual time=0.15..1.00 rows=15 loops=1)
-> Append (cost=0.00..3.19 rows=15 width=11) (actual time=0.14..0.80 rows=15 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..1.06 rows=5 width=11) (actual time=0.13..0.30 rows=5 loops=1)
-> Seq Scan on tab1 (cost=0.00..1.06 rows=5 width=11) (actual time=0.11..0.22 rows=5 loops=1)
Filter: ("key" = 1)
-> Subquery Scan "*SELECT* 2" (cost=0.00..1.06 rows=5 width=11) (actual time=0.07..0.22 rows=5 loops=1)
-> Seq Scan on tab2 (cost=0.00..1.06 rows=5 width=11) (actual time=0.05..0.15 rows=5 loops=1)
Filter: ("key" = 2)
-> Subquery Scan "*SELECT* 3" (cost=0.00..1.06 rows=5 width=11) (actual time=0.06..0.22 rows=5 loops=1)
-> Seq Scan on tab3 (cost=0.00..1.06 rows=5 width=11) (actual time=0.05..0.15 rows=5 loops=1)
Filter: ("key" = 3)
Total runtime: 1.57 msec
(12 rows)

But with "key = 3":

test_db# explain analyze select key, value from view123 where key=3;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Subquery Scan view123 (cost=0.00..3.22 rows=7 width=11) (actual time=0.40..0.65 rows=5 loops=1)
-> Append (cost=0.00..3.22 rows=7 width=11) (actual time=0.38..0.58 rows=5 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..1.07 rows=1 width=11) (actual time=0.18..0.18 rows=0 loops=1)
-> Seq Scan on tab1 (cost=0.00..1.07 rows=1 width=11) (actual time=0.17..0.17 rows=0 loops=1)
Filter: (("key" = 1) AND ("key" = 3))
-> Subquery Scan "*SELECT* 2" (cost=0.00..1.07 rows=1 width=11) (actual time=0.11..0.11 rows=0 loops=1)
-> Seq Scan on tab2 (cost=0.00..1.07 rows=1 width=11) (actual time=0.11..0.11 rows=0 loops=1)
Filter: (("key" = 2) AND ("key" = 3))
-> Subquery Scan "*SELECT* 3" (cost=0.00..1.07 rows=5 width=11) (actual time=0.08..0.25 rows=5 loops=1)
-> Seq Scan on tab3 (cost=0.00..1.07 rows=5 width=11) (actual time=0.06..0.18 rows=5 loops=1)
Filter: (("key" = 3) AND ("key" = 3))
Total runtime: 1.22 msec
(12 rows)

I would expect, that false filters, like (("key" = 1) AND ("key" = 3)) will make table full scan unnecessary. So I expected plan like:

test_db# explain analyze select key, value from view123 where key=3;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Subquery Scan view123 (cost=0.00..3.22 rows=7 width=11) (actual time=0.40..0.65 rows=5 loops=1)
-> Append (cost=0.00..3.22 rows=7 width=11) (actual time=0.38..0.58 rows=5 loops=1)
-> Subquery Scan "*SELECT* 1" (cost=0.00..1.07 rows=1 width=11) (actual time=0.18..0.18 rows=0 loops=1)
-> Result (cost=0.00..0.00 rows=0 width=11) (actual time=0.01..0.01 rows=0 loops=1)
^^^^^^^^^^^ my change
Filter: (("key" = 1) AND ("key" = 3)) [always false]
^^^^^^^^^^^ my change
-> Subquery Scan "*SELECT* 2" (cost=0.00..1.07 rows=1 width=11) (actual time=0.11..0.11 rows=0 loops=1)
-> Result (cost=0.00..0.00 rows=0 width=11) (actual time=0.01..0.01 rows=0 loops=1)
^^^^^^^^^^^ my change
Filter: (("key" = 2) AND ("key" = 3)) [always false]
^^^^^^^^^^^ my change
-> Subquery Scan "*SELECT* 3" (cost=0.00..1.07 rows=5 width=11) (actual time=0.08..0.25 rows=5 loops=1)
-> Seq Scan on tab3 (cost=0.00..1.07 rows=5 width=11) (actual time=0.06..0.18 rows=5 loops=1)
Filter: (("key" = 3) AND ("key" = 3))
Total runtime: 1.22 msec
(12 rows)

No "Seq Scan" on tables where filter is false.

I realize that's how it works now, but:

a) is there any way to avoid such scans?
b) is it possible (or in TODO) to optimize for such cases?

Regards,

Mariusz Czułada


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Mariusz Czułada <manieq(at)idea(dot)net(dot)pl>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Views with unions
Date: 2003-02-16 03:54:33
Message-ID: 200302151954.33740.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Mariusz,

> While testing multi-select views I found some problems. Here are details. I
> have 3 tables and I created a view on them:

What version of PostgreSQL are you using? UNION views optimized extremely
poorly through 7.2.4; things have been improved in 7.3

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Mariusz Czułada <manieq(at)idea(dot)net(dot)pl>
To: Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Views with unions
Date: 2003-02-16 08:19:06
Message-ID: 200302160919.07970.manieq@idea.net.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi,

Dnia nie 16. lutego 2003 04:54, Josh Berkus napisał:
>
> What version of PostgreSQL are you using? UNION views optimized extremely
> poorly through 7.2.4; things have been improved in 7.3

PostgreSQL 7.3 on sparc-sun-solaris2.9, compiled by GCC gcc (GCC) 3.2 (self
compiled on SunBlade 100).

Mariusz


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Mariusz Czułada <manieq(at)idea(dot)net(dot)pl>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Views with unions
Date: 2003-02-16 18:05:34
Message-ID: 20030216095555.K94589-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sat, 15 Feb 2003, Josh Berkus wrote:

> Mariusz,
>
> > While testing multi-select views I found some problems. Here are details. I
> > have 3 tables and I created a view on them:
>
> What version of PostgreSQL are you using? UNION views optimized extremely
> poorly through 7.2.4; things have been improved in 7.3

Yeah, but I think what he's hoping is that it'll notice that
"key=1 and key=3" would be noticed as a false condition so that it doesn't
scan those tables since a row presumably can't satisify both. The question
would be, is the expense of checking the condition for all queries
greater than the potential gain for these sorts of queries. In addition,
you'd have to be careful to make it work correctly with operator
overloading, since someone could make operators whose semantics in
cross-datatype comparisons are wierd.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Mariusz Czułada <manieq(at)idea(dot)net(dot)pl>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Views with unions
Date: 2003-02-16 18:51:18
Message-ID: 28375.1045421478@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> Yeah, but I think what he's hoping is that it'll notice that
> "key=1 and key=3" would be noticed as a false condition so that it doesn't
> scan those tables since a row presumably can't satisify both. The question
> would be, is the expense of checking the condition for all queries
> greater than the potential gain for these sorts of queries.

Yes, this is the key point: we won't put in an optimization that wins on
a small class of queries unless there is no material cost added for
planning cases where it doesn't apply.

> In addition, you'd have to be careful to make it work correctly with
> operator overloading, since someone could make operators whose
> semantics in cross-datatype comparisons are wierd.

In practice we would restrict such deductions to mergejoinable =
operators, which are sufficiently semantics-constrained that I think
you can treat equality at face value.

Actually, in CVS tip we are on the hairy edge of being able to do this:
generate_implied_equalities() actually detects that the given conditions
imply that two constants are equal. But it doesn't do anything with the
knowledge, because I couldn't figure out just what to do --- it's not
always correct to add a "WHERE false" constraint to the top level, but
where exactly do we add it? Exactly which relations are guaranteed to
produce zero rows in such a case? (When there are outer joins in the
picture, zero rows out of some relations doesn't mean zero rows out
overall.) And how do we exploit that knowledge once we've got it?
It'd be a pretty considerable amount of work to optimize a plan tree
fully for this sort of thing (eg, suppressing unnecessary joins), and
I doubt it's worth the trouble.

regards, tom lane


From: Mariusz Czułada <manieq(at)idea(dot)net(dot)pl>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Views with unions
Date: 2003-02-16 20:27:31
Message-ID: 200302162127.31324.manieq@idea.net.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Dnia nie 16. lutego 2003 19:51, Tom Lane napisał:
> Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com> writes:
> > Yeah, but I think what he's hoping is that it'll notice that
> > "key=1 and key=3" would be noticed as a false condition so that it
> > doesn't scan those tables since a row presumably can't satisify both. The

Yes, that is what I expected.

>
> Yes, this is the key point: we won't put in an optimization that wins on
> a small class of queries unless there is no material cost added for
> planning cases where it doesn't apply.
>
> > In addition, you'd have to be careful to make it work correctly with
> > operator overloading, since someone could make operators whose
> > semantics in cross-datatype comparisons are wierd.
>
> It'd be a pretty considerable amount of work to optimize a plan tree
> fully for this sort of thing (eg, suppressing unnecessary joins), and
> I doubt it's worth the trouble.

Ok, perhaps I should give some explaination about my case.

We are gathering lots of log data in a few tables. Each table grows by some
300.000...500.000 rows a day. With average row size of 100 bytes we get up to
50MB of data per day. Keeping data for 1 year only gives us some 18GB per
table. Also, in each table there is a field with very low cardinality (5..20
unique values). This field appears in most of our queries to the table, in
'where' clause (mostly key_field = 5, some times key_field in (1,2,3)).

What I was thinking of is to implement some kind of horizontal table
partitioning. I wanted to split physical storage of data to few smaller
tables. In my case it could be come 12 subtables, 1..2 GB each. Now, with
'union-all' view (and lots of rules, of course) I could simultate partitioned
table as Oracle implements it. IMHO while querying this view (supertable) for
one or few 'key_field' values it should be much faster for scan 5 GB of 3
partitions (subtables) than 18GB for one big table.

I realize it is not the only solution. Perhaps it could be implemented by a
function taking key_filed value and returning all rows from proper table
(p[lus functions for insert/update/delete). Perhaps application code (log
feeder and report module) could be recoded to know about splitted tables.
Still I think it is 'elegant' and clear.

I wait for your comments,

Mariusz Czulada