Planner doesn't look at LIMIT?

From: Dawid Kuroczko <qnex42(at)gmail(dot)com>
To: Postgresql Performance <pgsql-performance(at)postgresql(dot)org>
Subject: Planner doesn't look at LIMIT?
Date: 2005-07-22 09:10:05
Message-ID: 758d5e7f0507220210bfa4978@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

Hello, I have PostgreSQL 8.0.3 running on a "workstation" with 768 MB
of RAM, under FreeBSD. And I have a 47-milion row table:

qnex=# explain select * from log;
QUERY PLAN
-----------------------------------------------------------------------
Seq Scan on log (cost=0.00..1741852.36 rows=47044336 width=180)
(1 row)

...which is joined with a few smaller ones, like:

qnex=# explain select * from useragents;
QUERY PLAN
-------------------------------------------------------------------
Seq Scan on useragents (cost=0.00..9475.96 rows=364896 width=96)
(1 row)

shared_buffers = 5000
random_page_cost = 3
work_mem = 102400
effective_cache_size = 60000

Now, if I do a SELECT:

qnex=# EXPLAIN SELECT * FROM log NATURAL JOIN useragents LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=15912.20..15912.31 rows=1 width=272)
-> Hash Join (cost=15912.20..5328368.96 rows=47044336 width=272)
Hash Cond: ("outer".useragent_id = "inner".useragent_id)
-> Seq Scan on log (cost=0.00..1741852.36 rows=47044336 width=180)
-> Hash (cost=9475.96..9475.96 rows=364896 width=96)
-> Seq Scan on useragents (cost=0.00..9475.96
rows=364896 width=96)
(6 rows)

Or:

qnex=# EXPLAIN SELECT * FROM log NATURAL LEFT JOIN useragents LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------
Limit (cost=15912.20..15912.31 rows=1 width=272)
-> Hash Left Join (cost=15912.20..5328368.96 rows=47044336 width=272)
Hash Cond: ("outer".useragent_id = "inner".useragent_id)
-> Seq Scan on log (cost=0.00..1741852.36 rows=47044336 width=180)
-> Hash (cost=9475.96..9475.96 rows=364896 width=96)
-> Seq Scan on useragents (cost=0.00..9475.96
rows=364896 width=96)
(6 rows)

Time: 2.688 ms

...the query seems to last forever (its hashing 47 million rows!)

If I set enable_hashjoin=false:

qnex=# EXPLAIN ANALYZE SELECT * FROM log NATURAL LEFT JOIN useragents LIMIT 1;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..3.07 rows=1 width=272) (actual time=74.214..74.216
rows=1 loops=1)
-> Nested Loop Left Join (cost=0.00..144295895.01 rows=47044336
width=272) (actual time=74.204..74.204 rows=1 loops=1)
-> Seq Scan on log (cost=0.00..1741852.36 rows=47044336
width=180) (actual time=23.270..23.270 rows=1 loops=1)
-> Index Scan using useragents_pkey on useragents
(cost=0.00..3.02 rows=1 width=96) (actual time=50.867..50.867 rows=1
loops=1)
Index Cond: ("outer".useragent_id = useragents.useragent_id)
Total runtime: 74.483 ms

...which is way faster. Of course if I did:

qnex=# EXPLAIN ANALYZE SELECT * FROM log NATURAL LEFT JOIN useragents
WHERE logid = (SELECT logid FROM log LIMIT 1);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
Nested Loop Left Join (cost=0.04..6.09 rows=1 width=272) (actual
time=61.403..61.419 rows=1 loops=1)
InitPlan
-> Limit (cost=0.00..0.04 rows=1 width=4) (actual
time=0.029..0.032 rows=1 loops=1)
-> Seq Scan on log (cost=0.00..1741852.36 rows=47044336
width=4) (actual time=0.023..0.023 rows=1 loops=1)
-> Index Scan using log_pkey on log (cost=0.00..3.02 rows=1
width=180) (actual time=61.316..61.319 rows=1 loops=1)
Index Cond: (logid = $0)
-> Index Scan using useragents_pkey on useragents
(cost=0.00..3.02 rows=1 width=96) (actual time=0.036..0.042 rows=1
loops=1)
Index Cond: ("outer".useragent_id = useragents.useragent_id)
Total runtime: 61.741 ms
(9 rows)

...I tried tweaking cpu_*, work_mem, effective_cache and so on, but without
any luck. 47 milion table is huge compared to useragents (I actually need
to join the log with 3 similar to useragents tables, and create a view out
of it). Also tried using LEFT/RIGHT JOINS insead of (inner) JOINs...
Of course the database is freshly vacuum analyzed, and statistics are
set at 50...

My view of the problem is that planner ignores the "LIMIT" part. It assumes
it _needs_ to return all 47 million rows joined with the useragents table, so
the hashjoin is the only sane approach. But chances are that unless I'll
use LIMIT 200000, the nested loop will be much faster.

Any ideas how to make it work (other than rewriting the query to use
subselects, use explicit id-rows, disabling hashjoin completely)?
Or is this a bug?

Regards,
Dawid

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message ohp 2005-07-22 09:28:52 regressin failure on latest CVS
Previous Message Simon Riggs 2005-07-22 07:51:26 Re: Writing Commit Status hint bits (was Re: [HACKERS] Constant

Browse pgsql-performance by date

  From Date Subject
Next Message PFC 2005-07-22 13:16:57 Re: Planner doesn't look at LIMIT?
Previous Message Andrew Dunstan 2005-07-22 05:27:41 Re: COPY FROM performance improvements