Re: two index bitmap scan of a big table & hash_seq_search

Lists: pgsql-hackers
From: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: two index bitmap scan of a big table & hash_seq_search
Date: 2011-08-19 17:56:06
Message-ID: alpine.LRH.2.00.1108191956360.23450@lnfm1.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello Hackers,

I've recently noticed a particularly strange behaviour of one of my simple
queries:
SELECT ra,dec FROM sdssdr7.photoobjall WHERE ra BETWEEN 175 and 190 AND
dec BETWEEN 17 And 23 ;
The table is quite large (1.1Tb, 10^9 rows, and has Btree indexes on
"ra","dec" columns).
The plan is a combination of two bitmap scans:
wsdb=# explain SELECT ra,dec FROM sdssdr7.photoobjall WHERE ra BETWEEN 175 and 190 AND dec BETWEEN 17 And 23 ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on photoobjall (cost=1854918.62..13401835.50 rows=3266290 width=16)
Recheck Cond: ((ra >= 175::double precision) AND (ra <= 190::double precision) AND ("dec" >= 17::double precision) AND ("dec" <= 23::double precision))
-> BitmapAnd (cost=1854918.62..1854918.62 rows=3266290 width=0)
-> Bitmap Index Scan on sdssdr7_photoobjall_ra_idx (cost=0.00..607217.21 rows=30530306 width=0)
Index Cond: ((ra >= 175::double precision) AND (ra <= 190::double precision))
-> Bitmap Index Scan on sdssdr7_photoobjall_dec_idx (cost=0.00..1246068.01 rows=62654186 width=0)
Index Cond: (("dec" >= 17::double precision) AND ("dec" <= 23::double precision))

But the funny thing I noticed is that the query after running a certain
amount of time doing I/O, starts to use 100%CPU and spend 99% the time in
hash_seq_search. Here is the oprofile of PG during that period:
--------
CPU: Intel Core/i7, speed 2.268e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
303404 99.3562 hash_seq_search
1163 0.3808 tbm_lossify
639 0.2093 hash_search_with_hash_value
....
--------

During that it very rarely tries to actually do any I/O. Every second or
so it does issue an read of the main relation(not the index). Here is
how strace looks like during that time:
read(455, "\0\0\0\0\0\0\0\0\1\0\0\0t\6\200\6\360\37\4 \0\0\0\0\200\206 \0\340\237 \0\320"..., 8192) = 8192
... repeated 100 or more times(issued every second or so) ..
lseek(455, 603635712, SEEK_SET) = 603635712
read(455, "\0\0\0\0\0\0\0\0\1\0\0\0t\6\200\6\360\37\4 \0\0\0\0\200\206 \0\340\237 \0\320"..., 8192) = 8192
... repeated a lot of times...
...etc...
So it seems that it actually reads the same thing all over again and
again.

Another funny thing is that I have the exactly the same table, but with
significantly smaller number of columns (so it occupies 180 gb
instead) but has the same indexes. Also all the rows in the table are
exactly in the same order (both tables were clustered using the same
thing). For that table the query succeeds without problems in 20 seconds
or so:
wsdb=> explain analyze SELECT ra,dec FROM sdssdr7.phototag WHERE ra BETWEEN 175 and 190 AND dec BETWEEN 17 And 23 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on phototag (cost=1824731.39..10750473.21 rows=3204629 width=16) (actual time=21146.520..23136.512 rows=2973205 loops=1)
Recheck Cond: ((ra >= 175::double precision) AND (ra <= 190::double precision) AND ("dec" >= 17::double precision) AND ("dec" <= 23::double precision))
-> BitmapAnd (cost=1824731.39..1824731.39 rows=3204629 width=0) (actual time=21093.906..21093.906 rows=0 loops=1)
-> Bitmap Index Scan on sdssdr7_phototag2_ra_idx (cost=0.00..613910.25 rows=30866810 width=0) (actual time=6502.589..6502.589 rows=30418322 loops=1)
Index Cond: ((ra >= 175::double precision) AND (ra <= 190::double precision))
-> Bitmap Index Scan on sdssdr7_phototag2_dec_idx (cost=0.00..1209218.57 rows=60801242 width=0) (actual time=14201.455..14201.455 rows=63031352 loops=1)
Index Cond: (("dec" >= 17::double precision) AND ("dec" <= 23::double precision))
Total runtime: 23295.384 ms
(8 rows)

While for "wider" table it is still running the query for more than an
hour with 100%CPU an almost no I/O

Additional info:
PG version 8.4.8,
OS: debian 5.0
hardware: 2xXeon E5520, 12GB RAM, RAID50 with BBU
The tables are completely static(e.g. no write activity on them), and have
been vacuum analyzed.
The system is not busy at all (e.g. not much if any concurrent queries).
The tables have large number of columns of real/int/bigint/double
precision time.

Some config parameters:
effective_cache_size | 6GB
work_mem | 1GB
effective_io_concurrency | 0
shared_buffers | 4GB

Any ideas what can be wrong? Any info I can provide ?

Thanks in advance,
Sergey

*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, Cambridge UK /Sternberg Astronomical Institute
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math(at)sai(dot)msu(dot)ru


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: two index bitmap scan of a big table & hash_seq_search
Date: 2011-08-20 03:03:40
Message-ID: 3266.1313809420@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Sergey E. Koposov" <math(at)sai(dot)msu(dot)ru> writes:
> But the funny thing I noticed is that the query after running a certain
> amount of time doing I/O, starts to use 100%CPU and spend 99% the time in
> hash_seq_search. Here is the oprofile of PG during that period:
> --------
> CPU: Intel Core/i7, speed 2.268e+06 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
> samples % symbol name
> 303404 99.3562 hash_seq_search
> 1163 0.3808 tbm_lossify
> 639 0.2093 hash_search_with_hash_value

It seems like you've uncovered a scaling limitation in the tidbitmap
logic when it has to deal with very very large numbers of pages.

I might be reading too much into the mention of tbm_lossify, but
I wonder if the problem is repeated invocations of tbm_lossify()
as the bitmap gets larger. Maybe that function needs to be more
aggressive about how much information it deletes per call.

regards, tom lane


From: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: two index bitmap scan of a big table & hash_seq_search
Date: 2011-08-20 16:50:21
Message-ID: alpine.LRH.2.00.1108202038070.23450@lnfm1.sai.msu.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 19 Aug 2011, Tom Lane wrote:
> I might be reading too much into the mention of tbm_lossify, but
> I wonder if the problem is repeated invocations of tbm_lossify()
> as the bitmap gets larger. Maybe that function needs to be more
> aggressive about how much information it deletes per call.
Thanks for idea, Tom.

Yes, it turns out that the problem was in lossify'ing the bitmap to
intensely. I've put the elogs around the lossification in tbm_add_tuples()
if (tbm->nentries > tbm->maxentries)
{
elog(WARNING, "lossifying %d %d", tbm->nentries,
tbm->maxentries);
tbm_lossify(tbm);
elog(WARNING, "lossified %d", tbm->nentries);
}

And I saw in my log
koposov:wsdb:2011-08-20 17:31:46 BST:21524 WARNING: lossifying 13421773 13421772
koposov:wsdb:2011-08-20 17:31:46 BST:21524 WARNING: lossified 13421772
issued with a rate of 20000 per second. E.g. it lossifies one page per
lossify call (and does a lot of hash_seq_search operations too) ...

After that I changed the check in tbm_lossify()
from:
if (tbm->nentries <= tbm->maxentries)
to:
if (tbm->nentries <= (0.8*tbm->maxentries))

which allowed the query finish in 75 seconds (comparing to 3hours).

I'm not entirely sure that my fix of the tbm_lossify function is a proper
one, but it looks all right.
Do you think that this should be fixed ?

Sergey

*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, Cambridge/Sternberg Astronomical Institute
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math(at)sai(dot)msu(dot)ru


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Sergey E(dot) Koposov" <math(at)sai(dot)msu(dot)ru>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: two index bitmap scan of a big table & hash_seq_search
Date: 2011-08-20 17:03:57
Message-ID: 1359.1313859837@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Sergey E. Koposov" <math(at)sai(dot)msu(dot)ru> writes:
> Yes, it turns out that the problem was in lossify'ing the bitmap to
> intensely.

Yeah, I had just been coming to the same conclusion. Your table has
about 134M pages, and if the planner estimate of 62M rows was right
(and there's no reason it shouldn't be pretty close on that) then
we're talking about a bitmap that's going to contain about one bit
set in each of about half of the pages. The page structures are
50-some bytes apiece so a non-lossy representation would run to
3-plus GB, well beyond your work_mem limit. So it would fill up
to work_mem and then start lossifying pages ... one at a time.
I had suspected that there might be a performance issue there,
as per the comment at line 954, but we hadn't actually seen it
reported from the field before.

> After that I changed the check in tbm_lossify()
> from:
> if (tbm->nentries <= tbm->maxentries)
> to:
> if (tbm->nentries <= (0.8*tbm->maxentries))
> which allowed the query finish in 75 seconds (comparing to 3hours).

I was about to propose using tbm->maxentries/2, which is in the same
spirit but a tad cheaper to calculate.

I think that we also need to consider the possibility that tbm_lossify
finishes its loop without ever getting under maxentries --- that could
only happen with very large tables and very small work_mem, but it could
happen. If it did, then all subsequent operations would keep on calling
tbm_lossify, and it would keep scanning the entire hashtable and
probably not accomplishing much, and taking forever to do it. Unless
somebody has a better idea, what I think we should do then is just
artificially inflate maxentries --- that is, accept that we are not
going to fit in the originally requested work_mem, and we might as well
set a more realistic goal.

> Do you think that this should be fixed ?

Yes, definitely.

regards, tom lane