Lists: pgsql-hackerspgsql-sql
From: Bradley Baetz <bbaetz(at)student(dot)usyd(dot)edu(dot)au>
To: pgsql-sql(at)postgresql(dot)org
Subject:
Date: 2002-02-23 10:21:07
Message-ID: Pine.LNX.4.44.0202232045320.2350-100000@tomato.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-sql

I'm having some problems with query optimisation, using postgresql 7.2.

Basically, I'm hacking on porting bugzilla to pgsql, and I'm trying
various queries on a very small subset of the schema:

CREATE TABLE bugs (
bug_id integer NOT NULL auto_increment,
assigned_to integer NOT NULL,
reporter integer NOT NULL,
PRIMARY KEY (bug_id)
);
CREATE INDEX bugs_assigned_to_idx ON bugs(assigned_to);
CREATE INDEX bugs_reporter_idx ON bugs(reporter);

CREATE TABLE cc (
bug_id integer NOT NULL,
who integer NOT NULL
);
CREATE INDEX cc_who_idx ON cc(who);
CREATE UNIQUE INDEX cc_bug_id_who_idx ON cc(bug_id,who);

I've used a perl script to insert 25000 users, 10000 bugs, and 30000 cc
entries randomly, and I then ran ANALYZE on the database. The script just
uses rand, (plus some hashes to check that uniqueness still hold) so the
data should be fairly evenly distributed.

Now I want to search for all bugs which user #86 is ccd on:

bugs=> explain analyze SELECT bugs.bug_id FROM bugs LEFT JOIN cc ON
bugs.bug_id = cc.bug_id WHERE cc.who=86;
NOTICE: QUERY PLAN:

Merge Join (cost=0.00..1817.34 rows=10030 width=12) (actual
time=1516.75..2951.11 rows=1 loops=1)
-> Index Scan using bugs_pkey on bugs (cost=0.00..201.30 rows=10030
width=4) (actual time=0.33..135.67 rows=10000 loops=1)
-> Index Scan using cc_bug_id_who_idx on cc (cost=0.00..1065.97
rows=30000 width=8) (actual time=0.37..2522.77 rows=30000 loops=1)
Total runtime: 2951.37 msec

EXPLAIN
bugs=>

However, if I disable merge joins:

bugs=> set enable_mergejoin=0;
SET VARIABLE
bugs=> explain analyze SELECT bugs.bug_id FROM bugs LEFT JOIN cc ON
bugs.bug_id = cc.bug_id WHERE cc.who=86;
NOTICE: QUERY PLAN:

Hash Join (cost=1212.86..3288.98 rows=10030 width=12) (actual
time=1024.50..1237.59 rows=1 loops=1)
-> Seq Scan on bugs (cost=0.00..159.30 rows=10030 width=4) (actual
time=0.16..70.25 rows=10000 loops=1)
-> Hash (cost=463.00..463.00 rows=30000 width=8) (actual
time=284.51..284.51 rows=0 loops=1)
-> Seq Scan on cc (cost=0.00..463.00 rows=30000 width=8) (actual
time=0.13..159.54 rows=30000 loops=1)
Total runtime: 1237.78 msec

EXPLAIN
bugs=>

Then the time taken more than halves - 2951ms to 1237ms. Is this a bug in
the optimiser?

Also, in this case it would be better to use an inner join rather than a
left join. Since there is a condition on cc.who which won't match NULL
values from the cc table, an inner join should give the same results,
shouldn't it? Using an inner join makes the query take 0.98msec, so it
would be good if postgres could do that optimisation automatically,
assuming that it is valid. (This particular query is generated from a perl
script, and some cases do need the left join. It should probably be fixed
on that side, but it would be nice if pg could do it automatically.)

Thanks,

Bradley


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bradley Baetz <bbaetz(at)student(dot)usyd(dot)edu(dot)au>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re:
Date: 2002-02-25 16:48:22
Message-ID: 21639.1014655702@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-sql

Bradley Baetz <bbaetz(at)student(dot)usyd(dot)edu(dot)au> writes:
> I'm having some problems with query optimisation, using postgresql 7.2.

It looks like on your hardware, seqscans are much cheaper relative to
indexscans than the optimizer is expecting. Note the ratios of cost
estimates to actual runtimes. You might care to experiment with the
optimizer parameters, such as random_page_cost, to see if you can get
closer to the actual behavior of your configuration.

> Also, in this case it would be better to use an inner join rather than a
> left join. Since there is a condition on cc.who which won't match NULL
> values from the cc table, an inner join should give the same results,
> shouldn't it?

It's not so much that you want an inner join as that you want to be able
to figure out that the cc.who=86 condition could be applied before
joining rather than after. I have not thought hard about how the
optimizer could determine whether this is a safe transformation. In
general it's obviously not safe when dealing with an outer join --- but
maybe in some cases we could allow it. Can anyone propose a rule?

regards, tom lane

PS: this discussion would be better suited for pgsql-hackers, I think.


From: Bradley Baetz <bbaetz(at)student(dot)usyd(dot)edu(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [SQL] query optimistaion problems
Date: 2002-02-26 08:09:55
Message-ID: Pine.LNX.4.44.0202261818150.18500-100000@tomato.home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-sql

My original post with the schema is at
http://archives.postgresql.org/pgsql-sql/2002-02/msg00353.php, I'm moving
this to -hackers.

The query in question (run on postgres 7.2) is:

SELECT bugs.bug_id FROM bugs LEFT JOIN cc ON bugs.bug_id = cc.bug_id WHERE
cc.who=86;

All the columns being indexed, tables ANALYZEd, and there are 10,000 bugs
with 30,000 entries in the cc table. The full schema is in my original
mail.

On Mon, 25 Feb 2002, Tom Lane wrote:

> Bradley Baetz <bbaetz(at)student(dot)usyd(dot)edu(dot)au> writes:
> > I'm having some problems with query optimisation, using postgresql 7.2.
>
> It looks like on your hardware, seqscans are much cheaper relative to
> indexscans than the optimizer is expecting. Note the ratios of cost
> estimates to actual runtimes. You might care to experiment with the
> optimizer parameters, such as random_page_cost, to see if you can get
> closer to the actual behavior of your configuration.
>

Any idea what to set this too? Is there some way to mesure this, or a
standard test which I can run several times and average the result? Won't
this be affected by the disk pages always being in cache if I run the same
test multiple times? I have a standard PIII-500 - I'm not running anything
fancy.

> > Also, in this case it would be better to use an inner join rather than a
> > left join. Since there is a condition on cc.who which won't match NULL
> > values from the cc table, an inner join should give the same results,
> > shouldn't it?
>
> It's not so much that you want an inner join as that you want to be able
> to figure out that the cc.who=86 condition could be applied before
> joining rather than after. I have not thought hard about how the
> optimizer could determine whether this is a safe transformation. In
> general it's obviously not safe when dealing with an outer join --- but
> maybe in some cases we could allow it. Can anyone propose a rule?

I don't know. It seems to me that an outer join is equivalent to an inner
join if at least one of the columns on the outer table has a condition
which restricts it to a non-null value (comparison to a constant, IS NOT
NULL, comparison to an inner join'd NOT NULL column, etc, possibly applied
recursive through other outer joins). I'm not an sql guru though - is that
assumption always correct?

On a possibly related note, consider:

bugs=> explain analyze select bugs.bug_id from bugs,longdescs where
bugs.bug_id=longdescs.bug_id AND (bugs.reporter=500);
NOTICE: QUERY PLAN:

Nested Loop (cost=0.00..51.78 rows=8 width=8) (actual time=0.06..0.06
rows=0 loops=1)
-> Index Scan using bugs_reporter_idx on bugs (cost=0.00..9.82 rows=2
width=4) (actual time=0.05..0.05 rows=0 loops=1)
-> Index Scan using longdescs_bug_id_idx on longdescs
(cost=0.00..21.11 rows=5 width=4)
Total runtime: 0.27 msec

EXPLAIN
bugs=> explain analyze select bugs.bug_id from bugs,longdescs where
bugs.bug_id=longdescs.bug_id AND (longdescs.who=500);
NOTICE: QUERY PLAN:

Nested Loop (cost=0.00..30.22 rows=4 width=8) (actual time=0.13..0.28
rows=3 loops=1)
-> Index Scan using longdescs_who_idx on longdescs (cost=0.00..18.03
rows=4 width=4) (actual time=0.07..0.10 rows=3 loops=1)
-> Index Scan using bugs_pkey on bugs (cost=0.00..3.01 rows=1 width=4)
(actual time=0.04..0.04 rows=1 loops=3)
Total runtime: 0.49 msec

EXPLAIN
bugs=> explain analyze select bugs.bug_id from bugs,longdescs where
bugs.bug_id=longdescs.bug_id AND (bugs.reporter=500 OR longdescs.who=500);
NOTICE: QUERY PLAN:

Hash Join (cost=184.38..2107.54 rows=12 width=16) (actual
time=261.61..1527.52 rows=3 loops=1)
-> Seq Scan on longdescs (cost=0.00..618.45 rows=40145 width=8)
(actual time=0.13..245.82 rows=40000 loops=1)
-> Hash (cost=159.30..159.30 rows=10030 width=8) (actual
time=84.84..84.84 rows=0 loops=1)
-> Seq Scan on bugs (cost=0.00..159.30 rows=10030 width=8)
(actual time=0.14..50.74 rows=10000 loops=1)
Total runtime: 1527.74 msec

EXPLAIN
bugs=>

I'd run these queries before just before this run, which is probably why
the first two times are so fast, but the OR query is still much much
slower. Its not using an index at all - is this the same problem?

If I disable seqscan and/or hashjoin, I get:

NOTICE: QUERY PLAN:

Merge Join (cost=0.00..2428.65 rows=12 width=16) (actual
time=408.46..4262.60 rows=3 loops=1)
-> Index Scan using bugs_pkey on bugs (cost=0.00..201.30 rows=10030
width=8) (actual time=0.42..147.50 rows=10000 loops=1)
-> Index Scan using longdescs_bug_id_idx on longdescs
(cost=0.00..1399.38 rows=40145 width=8) (actual time=0.36..3661.06
rows=40000 loops=1)
Total runtime: 4262.88 msec

which is worse. If I then drop longdescs_bug_id_idx, (still with seqscan
disabled), I get:

NOTICE: QUERY PLAN:

Hash Join (cost=100000226.00..100002243.00 rows=12 width=16) (actual
time=328.60..1565.86 rows=3 loops=1)
-> Seq Scan on longdescs (cost=100000000.00..100000617.00 rows=40000
width=8) (actual time=0.13..253.66 rows=40000 loops=1)
-> Hash (cost=201.00..201.00 rows=10000 width=8) (actual
time=150.23..150.23 rows=0 loops=1)
-> Index Scan using bugs_pkey on bugs (cost=0.00..201.00
rows=10000 width=8) (actual time=0.40..109.76 rows=10000 loops=1)
Total runtime: 1566.11 msec

The longdescs cost seems a bit too high, and rerunning avcuum analyze
didn't change the results. That cost is probably why removing the index
sped up the actual result.

A union select is roughly as fast as the individual queries combined.

> regards, tom lane
>
> PS: this discussion would be better suited for pgsql-hackers, I think.
>

OK, moved.

Bradley