Lists: | pgsql-hackers |
---|
From: | jaba the mobzy <makaronaforna(at)yahoo(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Bitmap Heap Scan anomaly |
Date: | 2007-05-03 21:33:23 |
Message-ID: | 930247.19794.qm@web63709.mail.re1.yahoo.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
I have done the following test and I am unable to understand the results. I have tried debugging the code and I have reached down to the Storage Layer. I am playing with the optimizer etc.. I no very little about the internals of the Executor.
If you could point out to me what possible explanation for such anomaly I would be very glad.
Thanks,
Makarona
My Test:
Setup:
-------
I have created two very similar tables mycorr_10 and mycorr_100, attribute names are {key,a,b} for both tables.
I added 16 M rows in both tables in the following fashion:
I gave a random value to each attribute key ( dont care )
Values in a,b take a random value from [1-16M]
In the case of mycorr_10 I set a random 10% of the a=b
In the case of mycorr_100 I set all a=b
I create index{a,b} on both tables
I VACUUM ANALYZE
p.s. I am trying to simulate an optimizer cardinality estimation error due to Independence assumption.
Query :
SELECT count(key)
FROM mycorr_10 -- (or mycorr_100)
WHERE a>15900000 and b>15900000;
Explain:
----------
As expected using the independence assumption the Planner chooses to use the index for both tables cases:
Aggregate([4130.82][4130.83][1][94083.95][94083.96][1] width=4)
-> Bitmap Heap Scan on mycorr_100([1997.92][4129.41][566][2021.57][93846.00][95177] width=4)
Recheck Cond: ((a > 15900000) AND (b > 15900000))
-> Bitmap Index Scan on ab_100([0.00][1997.77][566][0.00][1997.77][95177] width=0)
Index Cond: ((a > 15900000) AND (b > 15900000))
(5 rows)
p.s.
Explain output may seem weird as i have changes it a bit.
Explain Analyze
---------------------
restart postgres
echo 1 > /proc/sys/vm/drop_caches (drop file system caches)
explain analyze select count(key) from mycorr_10 where a>15900000 and b>15900000;
restart postgres
echo 1 > /proc/sys/vm/drop_caches
explain analyze select count(key) from mycorr_100 where a>15900000 and b>15900000;
Result for mycorr_100:
---------------------------
Aggregate([4130.82][4130.83][1][94083.95][94083.96][1] width=4) (actual time=11424.077..11424.078 rows=1 loops=1)
-> Bitmap Heap Scan on mycorr_100([1997.92][4129.41][566][2021.57][93846.00][95177] width=4) (actual time=167.979..11304.413 rows=100000 loops=1)
Recheck Cond: ((a > 15900000) AND (b > 15900000))
-> Bitmap Index Scan on ab_100([0.00][1997.77][566][0.00][1997.77][95177] width=0) (actual time=120.127..120.127 rows=100000 loops=1)
Index Cond: ((a > 15900000) AND (b > 15900000))
Total runtime: 11426.329 ms
(6 rows)
Result for mycorr_10:
---------------------------
Aggregate([4608.36][4608.37][1][94197.91][94197.92][1] width=4) (actual time=24393.058..24393.058 rows=1 loops=1)
-> Bitmap Heap Scan on mycorr_10([2249.51][4606.79][629][2272.83][93963.14][93908] width=4) (actual time=108.219..24374.050 rows=10563 loops=1)
Recheck Cond: ((a > 15900000) AND (b > 15900000))
-> Bitmap Index Scan on ab_10([0.00][2249.35][629][0.00][2249.35][93908] width=0) (actual time=89.432..89.432 rows=10563 loops=1)
Index Cond: ((a > 15900000) AND (b > 15900000))
Total runtime: 24393.555 ms
(6 rows)
-------------------------------------------------------------------------------------------------------------
Goodies:
-----------
pg_statio_all_tables ->
heap_blks_read=9931 (in case of mycorr_10)
heap_blks_read=118693 (in case of mycorr_100)
I have repeated the test more than 20 times up till now.
I have also made the same test with different table sizes and correlation level and the same anomaly persists.
Question:
------------
mycorr_100 took 11.4 s to run although it had to fetch 100000 row from the base table.
mycorr_10 took 24.4 s to run although it had to fetch 10563 row from the base table.
Any explanation for that?
Thank you for your patience.
-------------------------------------------------------------------------------------------------------------
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | jaba the mobzy <makaronaforna(at)yahoo(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Bitmap Heap Scan anomaly |
Date: | 2007-05-03 23:50:42 |
Message-ID: | 1178236242.28383.304.camel@dogma.v10.wvs |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Thu, 2007-05-03 at 14:33 -0700, jaba the mobzy wrote:
> mycorr_100 took 11.4 s to run although it had to fetch 100000 row from
> the base table.
> mycorr_10 took 24.4 s to run although it had to fetch 10563 row from
> the base table.
This is because the physical distribution of data is different. The
mycorr_10 table has tuples in which a and b are > 15.9M spread all
throughout. mycorr_100 has them all collected together at the end of the
physical file. Less disk seeking.
You can test this by doing a CLUSTER on both tables and run the same
queries again.
Regards,
Jeff Davis
From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Jeff Davis <pgsql(at)j-davis(dot)com> |
Cc: | jaba the mobzy <makaronaforna(at)yahoo(dot)com>, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Bitmap Heap Scan anomaly |
Date: | 2007-05-04 03:42:32 |
Message-ID: | 9867.1178250152@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Thu, 2007-05-03 at 14:33 -0700, jaba the mobzy wrote:
>> mycorr_100 took 11.4 s to run although it had to fetch 100000 row from
>> the base table.
>> mycorr_10 took 24.4 s to run although it had to fetch 10563 row from
>> the base table.
> This is because the physical distribution of data is different. The
> mycorr_10 table has tuples in which a and b are > 15.9M spread all
> throughout. mycorr_100 has them all collected together at the end of the
> physical file. Less disk seeking.
If the OP had generated the data randomly, as claimed, the rows
shouldn't be particularly more clumped in one table than the other.
But I sure agree that it sounds like a nonrandom distribution in the
mycorr_100 table. FWIW I tried to duplicate the behavior, and could
not, using tables made up like this:
create table src as
select int4(16*1024*1024*random()) as key,
int4(16*1024*1024*random()) as a,
int4(16*1024*1024*random()) as b
from generate_series(1,16*1024*1024);
create table mycorr_10 as
select key, a,
case when random() < 0.1 then a else b end as b
from src;
create table mycorr_100 as
select key, a, a as b
from src;
create index mycorr_10i on mycorr_10(a,b);
create index mycorr_100i on mycorr_100(a,b);
vacuum analyze mycorr_10;
vacuum analyze mycorr_100;
regards, tom lane