Re: 27 second plan times

Lists: pgsql-hackers
From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: 27 second plan times
Date: 2007-04-20 20:17:54
Message-ID: 87647qajgt.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Following up on some complaints we've had about extremely large plan times for
large partitioned tables I've been doing some profiling. I've constructed a
situation where it takes upwards of 30 seconds to plan a simple query like:

postgres=# explain select * from a where i between 999 and 1001;
QUERY PLAN
----------------------------------------------------------------------
Result (cost=0.00..8.63 rows=58 width=108)
-> Append (cost=0.00..8.63 rows=58 width=108)
-> Seq Scan on a (cost=0.00..1.01 rows=1 width=108)
Filter: ((i >= 999) AND (i <= 1001))
-> Seq Scan on a997 a (cost=0.00..1.27 rows=1 width=108)
Filter: ((i >= 999) AND (i <= 1001))
-> Seq Scan on a998 a (cost=0.00..1.27 rows=1 width=108)
Filter: ((i >= 999) AND (i <= 1001))
-> Seq Scan on a999 a (cost=0.00..1.27 rows=18 width=108)
Filter: ((i >= 999) AND (i <= 1001))
-> Seq Scan on a1000 a (cost=0.00..1.27 rows=18 width=108)
Filter: ((i >= 999) AND (i <= 1001))
-> Seq Scan on a1001 a (cost=0.00..1.27 rows=18 width=108)
Filter: ((i >= 999) AND (i <= 1001))
-> Seq Scan on a1002 a (cost=0.00..1.27 rows=1 width=108)
Filter: ((i >= 999) AND (i <= 1001))
(16 rows)

Time: 46324.627 ms

Table "a" is a parent table with 2,000 partitions each of which have 102
columns, two of which are covered by constraints of the form "WHERE i BETWEEN
90 AND 110".

The gprof output is pretty damning:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
62.58 203.76 203.76 21,474,937 0.00 0.00 SearchCatCache
9.01 233.08 29.32 1,923,789 0.00 0.00 list_nth_cell
5.34 250.46 17.38 2386465 0.00 0.00 pgstat_initstats
2.80 259.57 9.11 235691652 0.00 0.00 AllocSetAlloc
1.95 265.91 6.34 219852840 0.00 0.00 nocachegetattr
1.19 269.78 3.87 256569078 0.00 0.00 FunctionCall2
0.74 272.18 2.40 107923848 0.00 0.00 MemoryContextAllocZeroAligned

The SearchCatCache here is the one in get_attavgwidth called to estimate the
relation width. There are 200k attributes being measured here but I'm not
clear why it's causing 21M calls.

The first thing that comes to mind is that we're doing the
constraint_exclusion code *after* estimating the width of the relations we're
going to exclude. If we push the constraint exclusion up a few lines the
planning time goes down the 1.7s.

I think there's still a problem here with some kind of n^2 behaviour for
appends of very wide tables but I haven't quite nailed it yet. In any case is
there any reason not to make the following small change to move the constraint
exclusion ahead of the size estimates and index checks and save ourselves
potentially a lot of work?

Index: allpaths.c
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/optimizer/path/allpaths.c,v
retrieving revision 1.161
diff -c -r1.161 allpaths.c
*** allpaths.c 22 Feb 2007 22:00:23 -0000 1.161
--- allpaths.c 20 Apr 2007 18:12:40 -0000
***************
*** 196,215 ****
static void
set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte)
{
- /* Mark rel with estimated output rows, width, etc */
- set_baserel_size_estimates(root, rel);
-
- /* Test any partial indexes of rel for applicability */
- check_partial_indexes(root, rel);
-
- /*
- * Check to see if we can extract any restriction conditions from join
- * quals that are OR-of-AND structures. If so, add them to the rel's
- * restriction list, and recompute the size estimates.
- */
- if (create_or_index_quals(root, rel))
- set_baserel_size_estimates(root, rel);
-
/*
* If we can prove we don't need to scan the rel via constraint exclusion,
* set up a single dummy path for it. (Rather than inventing a special
--- 196,201 ----
***************
*** 228,233 ****
--- 214,233 ----
return;
}

+ /* Mark rel with estimated output rows, width, etc */
+ set_baserel_size_estimates(root, rel);
+
+ /* Test any partial indexes of rel for applicability */
+ check_partial_indexes(root, rel);
+
+ /*
+ * Check to see if we can extract any restriction conditions from join
+ * quals that are OR-of-AND structures. If so, add them to the rel's
+ * restriction list, and recompute the size estimates.
+ */
+ if (create_or_index_quals(root, rel))
+ set_baserel_size_estimates(root, rel);
+
/*
* Generate paths and add them to the rel's pathlist.
*

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-20 21:47:15
Message-ID: 22785.1177105635@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> The SearchCatCache here is the one in get_attavgwidth called to estimate the
> relation width.

Hmm, that information is supposed to be cached ... could you provide the
test case?

> The first thing that comes to mind is that we're doing the
> constraint_exclusion code *after* estimating the width of the relations we're
> going to exclude. If we push the constraint exclusion up a few lines the
> planning time goes down the 1.7s.

... and probably breaks a few things; are you sure there are no needed
side effects of the skipped code?

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-20 23:02:42
Message-ID: 87tzva8x9p.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> The SearchCatCache here is the one in get_attavgwidth called to estimate the
>> relation width.
>
> Hmm, that information is supposed to be cached ... could you provide the
> test case?

Cached meaning it's stored in the cache once it's looked up to avoid future
lookups? In which case it will still mean 200k lookups since there really are
that many attributes.

Or cached meaning we prime the cache in one batch operation? It may well be
avoiding i/o, but there are suspiciously large numbers of calls to
SearchCatCache itself.

I'll send the pg_dump to pgsql-patches. But it's much larger than really
necessary to reproduce the effect. The planning time was already through the
roof before I added data and extra constraints. The key seems to be the wide
tables though, planning 2,000 child tables is fast if they only have a couple
columns.

>> The first thing that comes to mind is that we're doing the
>> constraint_exclusion code *after* estimating the width of the relations we're
>> going to exclude. If we push the constraint exclusion up a few lines the
>> planning time goes down the 1.7s.
>
> ... and probably breaks a few things; are you sure there are no needed
> side effects of the skipped code?

Well that's why I asked whether there was a reason we had to do it this way. I
haven't checked yet what those other pieces of code are doing exactly. But the
constraint exclusion doesn't depend on much else, I can't see anything related
to stats or indexes affecting it. The regression tests all pass but that
doesn't prove anything.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-21 00:50:21
Message-ID: 87lkgm8saa.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> The SearchCatCache here is the one in get_attavgwidth called to estimate the
>> relation width.
>
> Hmm, that information is supposed to be cached ... could you provide the
> test case?

This case caused my machine to thrash and swap terribly with the backend
requiring over 1.3G of RAM to handle the explain.

What's really strange is that I got warnings about checkpoints happening too
frequently. How could an explain of a select cause checkpoints?! There are no
log entries for autovacuum but that's the only thing I can think of.

I'll dive back into gprof tomorrow, but for now it's time for sleep.

postgres=# copy (select 'create table t (' union all select 'a'||generate_series(1,1599)||' bool,' union all select 'a1600 bool);') to '/tmp/t';
COPY 1601
postgres=# \i /tmp/t
CREATE TABLE
postgres=# copy (select 'create table t'||generate_series(1,1000)||' () inherits (t);') to '/tmp/u';
COPY 1000
Time: 5.002 ms
postgres=#
postgres=# \i /tmp/u
postgres=# explain select * from t;
QUERY PLAN
------------------------------------------------------------------------
Result (cost=0.00..10510.50 rows=50050 width=1600)
-> Append (cost=0.00..10510.50 rows=50050 width=1600)
-> Seq Scan on t (cost=0.00..10.50 rows=50 width=1600)
-> Seq Scan on t1 t (cost=0.00..10.50 rows=50 width=1600)
-> Seq Scan on t2 t (cost=0.00..10.50 rows=50 width=1600)
...
-> Seq Scan on t999 t (cost=0.00..10.50 rows=50 width=1600)
-> Seq Scan on t1000 t (cost=0.00..10.50 rows=50 width=1600)
(1003 rows)

Time: 889727.115 ms

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-21 01:20:37
Message-ID: 3643.1177118437@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> Following up on some complaints we've had about extremely large plan times for
> large partitioned tables I've been doing some profiling.

> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 62.58 203.76 203.76 21,474,937 0.00 0.00 SearchCatCache
> 9.01 233.08 29.32 1,923,789 0.00 0.00 list_nth_cell
> 5.34 250.46 17.38 2386465 0.00 0.00 pgstat_initstats
> 2.80 259.57 9.11 235691652 0.00 0.00 AllocSetAlloc
> 1.95 265.91 6.34 219852840 0.00 0.00 nocachegetattr
> 1.19 269.78 3.87 256569078 0.00 0.00 FunctionCall2

Is this supposed to be for a single run of the query? Cause the numbers
I get are a lot different:

% cumulative self self total
time seconds seconds calls s/call s/call name
35.24 15.23 15.23 353301 0.00 0.00 SearchCatCache
30.08 28.23 13.00 422469 0.00 0.00 list_nth_cell
21.03 37.32 9.09 642488 0.00 0.00 pgstat_initstats
1.39 37.92 0.60 4281770 0.00 0.00 hash_search_with_hash_value
1.23 38.45 0.53 5130941 0.00 0.00 AllocSetAlloc
1.16 38.95 0.50 4154885 0.00 0.00 _bt_compare

(This is CVS HEAD as of today, which might or might not make a difference.)

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: PostgreSQL-development Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-21 06:20:57
Message-ID: 19950.1177136457@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> I think there's still a problem here with some kind of n^2 behaviour for
> appends of very wide tables but I haven't quite nailed it yet. In any case is
> there any reason not to make the following small change to move the constraint
> exclusion ahead of the size estimates and index checks and save ourselves
> potentially a lot of work?

Applied along with some other hacking to reduce the costs of the
lower-level functions that this example shows to be inefficient.
They'd still be slow in large queries, whether CE applies or not.

regards, tom lane


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-21 15:46:46
Message-ID: 87odlh20ih.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Applied along with some other hacking to reduce the costs of the
> lower-level functions that this example shows to be inefficient.
> They'd still be slow in large queries, whether CE applies or not.

BIG difference. The case that caused swapping and took almost 15m to plan is
now down to 2.5s. The profile still looks a bit odd but I can't argue with
the results.

stark(at)oxford:/var/tmp/db$ gprof /usr/local/pgsql/bin/postgres gmon.out
Flat profile:

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
24.36 2.46 2.46 418517 0.00 0.00 SearchCatCache
7.33 3.20 0.74 2564235 0.00 0.00 hash_any
6.34 3.84 0.64 4283964 0.00 0.00 hash_search_with_hash_value
4.36 4.28 0.44 216316 0.00 0.00 list_nth_cell
3.96 4.68 0.40 6535943 0.00 0.00 AllocSetAlloc
3.37 5.02 0.34 4165664 0.00 0.00 _bt_compare
2.67 5.29 0.27 2266696 0.00 0.00 MemoryContextAllocZeroAligned

...
0.01 0.03 2000/424529 get_namespace_name [164]
0.01 0.03 2001/424529 pg_class_aclmask [167]
0.01 0.03 2001/424529 get_rel_name [163]
0.01 0.03 2002/424529 has_subclass [165]
1.21 2.69 204102/424529 get_attavgwidth [37]
1.21 2.69 204308/424529 TupleDescInitEntry [36]
[632] 0.0 0.00 0.00 418517 SearchSysCache <cycle 9> [632]
418517 SearchCatCache <cycle 9> [15]

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-21 16:32:09
Message-ID: 87hcr91yeu.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Gregory Stark" <stark(at)enterprisedb(dot)com> writes:

> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>> Applied along with some other hacking to reduce the costs of the
>> lower-level functions that this example shows to be inefficient.
>> They'd still be slow in large queries, whether CE applies or not.
>
> BIG difference. The case that caused swapping and took almost 15m to plan is
> now down to 2.5s. The profile still looks a bit odd but I can't argue with
> the results.
>
> stark(at)oxford:/var/tmp/db$ gprof /usr/local/pgsql/bin/postgres gmon.out
> Flat profile:
>
> Each sample counts as 0.01 seconds.
> % cumulative self self total
> time seconds seconds calls s/call s/call name
> 24.36 2.46 2.46 418517 0.00 0.00 SearchCatCache
> 7.33 3.20 0.74 2564235 0.00 0.00 hash_any
> 6.34 3.84 0.64 4283964 0.00 0.00 hash_search_with_hash_value
> 4.36 4.28 0.44 216316 0.00 0.00 list_nth_cell
> 3.96 4.68 0.40 6535943 0.00 0.00 AllocSetAlloc
> 3.37 5.02 0.34 4165664 0.00 0.00 _bt_compare
> 2.67 5.29 0.27 2266696 0.00 0.00 MemoryContextAllocZeroAligned

For what it's worth if I defeat the "same column position" optimization
SearchCatCache and list_nth_cell pop back to the top. I had a version that
kept the col_mappings and translated_vars lists in arrays to fix the n^2
list_nth_cell behaviour. But the SearchCatCache is something else.

53.77 87.36 87.36 4813018 0.00 0.00 SearchCatCache
11.29 105.71 18.35 1610810 0.00 0.00 list_nth_cell
3.09 110.72 5.01 17738640 0.00 0.00 hash_any
2.50 114.78 4.05 30612499 0.00 0.00 hash_search_with_hash_value
2.16 118.29 3.51 54588745 0.00 0.00 AllocSetAlloc

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-21 16:45:15
Message-ID: 24820.1177173915@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> Applied along with some other hacking to reduce the costs of the
>> lower-level functions that this example shows to be inefficient.
>> They'd still be slow in large queries, whether CE applies or not.

> BIG difference. The case that caused swapping and took almost 15m to plan is
> now down to 2.5s. The profile still looks a bit odd but I can't argue with
> the results.

I'm still feeling a bit annoyed with the behavior of the stats machinery
(pgstat_initstats and related macros). Yesterday I fixed it so that
pgstat_initstats doesn't perform redundant searches of the tabstat
arrays, but there's still an issue, which is that any rel that's
heap_opened or index_opened within a transaction is going to get a
tabstat entry, whether any events are subsequently counted or not.
In typical scenarios I don't think this is a big deal, but in examples
such as yours we're going to be sending a whole lot of all-zero tabstat
messages; there'll be one for every heap or index that the planner even
momentarily considered. That means more UDP traffic and more work for
the stats collector. gprof won't show the resulting overhead since
it doesn't know anything about kernel-level overhead or activity in the
stats collector. (Hm, might be able to measure it in oprofile
though...)

We could fix this by not doing pgstat_initstats at heap_open time,
but postponing it until something more interesting happens. The trouble
is that that'd add at least a small amount of overhead at the places
where "something more interesting" is counted, since the pgstat macros
would have to check validity of the tabstat pointer. The added overhead
should be only about one extra comparison, but maybe that's enough to
make someone object?

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>, "PostgreSQL-development Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 27 second plan times
Date: 2007-04-22 02:34:57
Message-ID: 16495.1177209297@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> I'm still feeling a bit annoyed with the behavior of the stats machinery
> (pgstat_initstats and related macros).
> ... That means more UDP traffic and more work for
> the stats collector. gprof won't show the resulting overhead since
> it doesn't know anything about kernel-level overhead or activity in the
> stats collector. (Hm, might be able to measure it in oprofile
> though...)

I spent some time with oprofile and couldn't muster any evidence
suggesting that this was accounting for more than 1% or so of total
runtime. So for the moment I'll leave it alone. It might eventually
become worth worrying about, though.

The thing I saw as being more interesting than the tabstat overhead
is that the planner does RelationGetNumberOfBlocks (ie, an lseek kernel
call) on every child rel ... and would do it on every index of every
child rel, too, if the example had any. It would be nice if we could
postpone all of the work of get_relation_info() until after we've
checked for constraint exclusion. This looks like it'd require some
nontrivial refactoring though --- in particular, I'm not sure how we'd
handle the total_table_pages calculation.

If you're satisfied with the performance as it now stands, let's leave
this for the maybe-do-someday list.

regards, tom lane