Re: [PERFORM] not using index for select min(...)

Lists: pgsql-hackerspgsql-performance
From: Don Bowman <don(at)sandvine(dot)com>
To: "'pgsql-performance(at)postgresql(dot)org'" <pgsql-performance(at)postgresql(dot)org>
Subject: not using index for select min(...)
Date: 2003-01-31 21:12:38
Message-ID: FE045D4D9F7AED4CBFF1B3B813C8533701023616@mail.sandvine.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

I have a table which is very large (~65K rows). I have
a column in it which is indexed, and I wish to use for
a join. I'm finding that I'm using a sequential scan
for this when selecting a MIN.

I've boiled this down to something like this:

=> create table X( value int primary key );
=> explain select min(value) from x;
Aggregate (cost=22.50..22.50 rows=1 width=4)
-> Seq Scan on x (cost=0.00..20.00 rows=1000 width=4)
=> \d x
Table "public.x"
Column | Type | Modifiers
--------+---------+-----------
value | integer | not null
Indexes: x_pkey primary key btree (value)

Why wouldn't I be doing an index scan on this table?

--don


From: Andrew Sullivan <andrew(at)libertyrms(dot)info>
To: "'pgsql-performance(at)postgresql(dot)org'" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: not using index for select min(...)
Date: 2003-01-31 23:31:04
Message-ID: 20030131183104.L24535@mail.libertyrms.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Jan 31, 2003 at 04:12:38PM -0500, Don Bowman wrote:
> Why wouldn't I be doing an index scan on this table?

Because you're using the aggregate function min(). See

<http://www.ca.postgresql.org/docs/faq-english.html#4.8>

A

--
----
Andrew Sullivan 204-4141 Yonge Street
Liberty RMS Toronto, Ontario Canada
<andrew(at)libertyrms(dot)info> M2P 2A8
+1 416 646 3304 x110


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Don Bowman <don(at)sandvine(dot)com>, "'pgsql-performance(at)postgresql(dot)org'" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: not using index for select min(...)
Date: 2003-01-31 23:31:12
Message-ID: 200301311531.12605.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Don,

> I have a table which is very large (~65K rows). I have
> a column in it which is indexed, and I wish to use for
> a join. I'm finding that I'm using a sequential scan
> for this when selecting a MIN.

Due to Postgres' system of extensible aggregates (i.e. you can write your own
aggregates), all aggregates will trigger a Seq Scan in a query. It's a
known drawrback that nobody has yet found a good way around.

--
-Josh Berkus
Aglio Database Solutions
San Francisco


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Don Bowman <don(at)sandvine(dot)com>
Cc: "'pgsql-performance(at)postgresql(dot)org'" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: not using index for select min(...)
Date: 2003-02-01 01:02:29
Message-ID: 20030201010229.GA14084@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Fri, Jan 31, 2003 at 16:12:38 -0500,
Don Bowman <don(at)sandvine(dot)com> wrote:
> I have a table which is very large (~65K rows). I have
> a column in it which is indexed, and I wish to use for
> a join. I'm finding that I'm using a sequential scan
> for this when selecting a MIN.
>
> I've boiled this down to something like this:
>
> => create table X( value int primary key );
> => explain select min(value) from x;

Use the following instead:
select value from x order by value limit 1;


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Don Bowman <don(at)sandvine(dot)com>, "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-01 04:09:54
Message-ID: 20030201040954.GK15936@perrin.int.nxad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

> > I have a table which is very large (~65K rows). I have
> > a column in it which is indexed, and I wish to use for
> > a join. I'm finding that I'm using a sequential scan
> > for this when selecting a MIN.
>
> Due to Postgres' system of extensible aggregates (i.e. you can write
> your own aggregates), all aggregates will trigger a Seq Scan in a
> query. It's a known drawrback that nobody has yet found a good way
> around.

I've spent some time in the past thinking about this, and here's the
best idea that I can come up with:

Part one: setup an ALTER TABLE directive that allows for the
addition/removal of cached aggregates. Ex:

ALTER TABLE tab1 ADD AGGREGATE CACHE ON count(*);
ALTER TABLE tab1 ADD AGGREGATE CACHE ON sum(col2);
ALTER TABLE tab1 ADD AGGREGATE CACHE ON sum(col2) WHERE col2 > 100;
ALTER TABLE tab1 ADD AGGREGATE CACHE ON sum(col2) WHERE col2 <= 100;

Which would translate into some kind of action on a pg_aggregate_cache
catalog:

aggregate_cache_oid OID -- OID for the aggregate cache
aggregate_table_oid OID -- table OID
ins_aggfn_oid OID -- aggregate function id for inserts
upd_aggfn_oid OID -- aggregate function id for updates
del_aggfn_oid OID -- aggregate function id for deletes
cache_value INT -- the value of the cache
private_data INT[4] -- temporary data space for needed
-- data necessary to calculate cache_value
-- four is just a guesstimate for how much
-- space would be necessary to calculate
-- the most complex of aggregates
where_clause ??? -- I haven't the faintest idea how to
-- express some kind of conditional like this

Part two: setup a RULE or TRIGGER that runs on INSERT, UPDATE, or
DELETE. For the count(*) exercise, the ON UPDATE would be a no-op.
For ON INSERT, the count(*) rule would have to do something like:

UPDATE pg_catalog.pg_aggregate_cache SET cached_value = (cached_value + 1)
WHERE aggregate_cache_oid = 1111111;

For the sum(col2) aggregate cache, the math is a little more complex,
but I think it's quite reasonable given that it obviates a full table
scan. For an insert:

UPDATE pg_catalog.pg_aggregate_cache SET cached_value =
((cached_value * private_data[0] + NEW.col2) / (private_data[0] + 1))
WHERE aggregate_cache_oid = 1111112;

Now, there are some obvious problems:

1) avg requires a floating point return value, therefore an INT may
not be an appropriate data type for cache_value or private_data.

2) aggregate caching wouldn't speed up anything but full table
aggregates or regions of a column that are frequently needed.

3) all of the existing aggregates would have to be updated to include
an insert, update, delete procedure (total of 60 aggregates, but
only 7 by name).

4) the planner would have to be taught how to use/return values from
the cache.

5) Each aggregate type makes use of the private_data column
differently. It's up to the cached aggregate function authors to
not jumble up their private data space.

6) I don't know of a way to handle mixing of floating point numbers
and integers. That said, there's some margin of error that could
creep into the floating point calculations such as avg.

And some benefits:

1) You only get caching for aggregates that you frequently use
(sum(col2), count(*), etc.).

2) Aggregate function authors can write their own caching routines.

3) For tens of millions of rows, it can be very time consuming to
sum() fifty million rows, but it's easy to amortize the cost of
updating the cache on insert, update, delete over the course of a
month.

4) If an aggregate cache definition isn't setup, it should be easy for
the planner to fall back to a full table scan, as it currently is.

This definitely would be a performance boost and something that would
only be taken advantage of by DBAs that are intentionally performance
tuning their database, but for those that do, it could be a massive
win. Thoughts? -sc

--
Sean Chittenden


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Sean Chittenden <sean(at)chittenden(dot)org>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Don Bowman <don(at)sandvine(dot)com>, "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-01 04:35:32
Message-ID: 4278.1044074132@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Sean Chittenden <sean(at)chittenden(dot)org> writes:
> Now, there are some obvious problems:

You missed the real reason why this will never happen: it completely
kills any prospect of concurrent updates. If transaction A has issued
an update on some row, and gone and modified the relevant aggregate
cache entries, what happens when transaction B wants to update another
row? It has to wait for A to commit or not, so it knows whether to
believe A's changes to the aggregate cache entries.

For some aggregates you could imagine an 'undo' operator to allow
A's updates to be retroactively removed even after B has applied its
changes. But that doesn't work very well in general. And in any case,
you'd have to provide serialization interlocks on physical access to
each of the aggregate cache entries. That bottleneck applied to every
update would be likely to negate any possible benefit from using the
cached values.

regards, tom lane


From: Sean Chittenden <sean(at)chittenden(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, Don Bowman <don(at)sandvine(dot)com>, "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-01 05:09:35
Message-ID: 20030201050935.GM15936@perrin.int.nxad.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

> > Now, there are some obvious problems:
>
> You missed the real reason why this will never happen: it completely
> kills any prospect of concurrent updates. If transaction A has
> issued an update on some row, and gone and modified the relevant
> aggregate cache entries, what happens when transaction B wants to
> update another row? It has to wait for A to commit or not, so it
> knows whether to believe A's changes to the aggregate cache entries.

I never claimed it was perfect, :) but it'd be is no worse than a
table lock. For the types of applications that this would be of
biggest use to, there would likely be more reads than writes and it
wouldn't be as bad as one could imagine. A few examples:

# No contension
Transaction A begins
Transaction A updates tab1
Transaction B begins
Transaction B updates tab1
Transaction B commits
Transaction A commits

# contension
Transaction A begins
Transaction A updates tab1
Transaction B begins
Transaction B updates tab1
Transaction A commits
Transaction B commits

This is just about the only case that I can see where there would be
contension. In this case, transaction B would have to re-run its
trigger serially. In the worse case scenario:

Transaction A begins
Transaction A updates tab1
Transaction B begins
Transaction B updates tab1
Transaction A commits
Transaction B selects
Transaction B updates tab1 again
Transaction B commits

In my journals or books I haven't found any examples of a transaction
based cache that'd work any better than this. It ain't perfect, but,
AFAICT, it's as good as it's going to get. The only thing that I
could think of that would add some efficiency in this case would be to
have transaction B read trough the committed changes from a log file.
After a threshold, it could be more efficient than having transaction
B re-run its queries.

Like I said, it ain't perfect, but what would be a better solution?
::shrug:: Even OODB's with stats agents have this problem (though
their overhead for doing this kind of work is much much lower). -sc

--
Sean Chittenden


From: Kevin Brown <kevin(at)sysexperts(dot)com>
To: "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-01 16:43:37
Message-ID: 20030201164336.GP12957@filer
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane wrote:
> Sean Chittenden <sean(at)chittenden(dot)org> writes:
> > Now, there are some obvious problems:
>
> You missed the real reason why this will never happen: it completely
> kills any prospect of concurrent updates. If transaction A has issued
> an update on some row, and gone and modified the relevant aggregate
> cache entries, what happens when transaction B wants to update another
> row? It has to wait for A to commit or not, so it knows whether to
> believe A's changes to the aggregate cache entries.
>
> For some aggregates you could imagine an 'undo' operator to allow
> A's updates to be retroactively removed even after B has applied its
> changes. But that doesn't work very well in general. And in any case,
> you'd have to provide serialization interlocks on physical access to
> each of the aggregate cache entries. That bottleneck applied to every
> update would be likely to negate any possible benefit from using the
> cached values.

Hmm...any chance, then, of giving aggregate functions a means of
asking which table(s) and column(s) the original query referred to so
that it could do proper optimization on its own? For instance, for a
"SELECT min(x) FROM mytable" query, the min() function would be told
upon asking that it's operating on column x of mytable, whereas it
would be told "undefined" for the column if the query were "SELECT
min(x+y) FROM mytable". In the former case, it would be able to do a
"SELECT x FROM mytable ORDER BY x LIMIT 1" on its own, whereas in the
latter it would have no choice but to fetch the data to do its
calculation via the normal means.

But that may be more trouble than it's worth, if aggregate functions
aren't responsible for retrieving the values they're supposed to base
their computations on, or if it's not possible to get the system to
refrain from prefetching data for the aggregate function until the
function asks for it.

--
Kevin Brown kevin(at)sysexperts(dot)com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kevin Brown <kevin(at)sysexperts(dot)com>
Cc: "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-01 17:03:32
Message-ID: 7647.1044119012@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> Hmm...any chance, then, of giving aggregate functions a means of
> asking which table(s) and column(s) the original query referred to so
> that it could do proper optimization on its own?

You can't usefully do that without altering the aggregate paradigm.
It won't help for min() to intuit the answer quickly if the query
plan is going to insist on feeding every row to it anyway.

> For instance, for a
> "SELECT min(x) FROM mytable" query, the min() function would be told
> upon asking that it's operating on column x of mytable, whereas it
> would be told "undefined" for the column if the query were "SELECT
> min(x+y) FROM mytable". In the former case, it would be able to do a
> "SELECT x FROM mytable ORDER BY x LIMIT 1" on its own,

Don't forget that it would also need to be aware of whether there were
any WHERE clauses, joins, GROUP BY, perhaps other things I'm not
thinking of.

In the end, the only reasonable way to handle this kind of thing is
to teach the query planner about it. Considering the small number
of cases that are usefully optimizable (basically only MIN and MAX
on a single table without any WHERE or GROUP clauses), and the ready
availability of a SQL-level workaround, it strikes me as a very
low-priority TODO item.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: not using index for select min(...)
Date: 2003-02-01 19:41:42
Message-ID: 200302011141.42245.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Sean,

> I've spent some time in the past thinking about this, and here's the
> best idea that I can come up with:
>
> Part one: setup an ALTER TABLE directive that allows for the
> addition/removal of cached aggregates. Ex:

Actually, Joe Conway and I may be working on something like this for a client.
Joe's idea is to use a hacked version of the statistics collector to cache
selected aggregate values in memory. These aggregates would be
non-persistent, but the main concern for us is having aggregate values that
are instantly accessable, and that don't increase the cost of INSERTS and
UPDATES more than 10%.

This is to satisfy the needs of a particular client, though, so it may never
make it into the general PostgreSQL source. We'll post it somewhere if it
works, though.

We already implemented caching aggregates to tables, with is trivially easy to
do with triggers. The problem with this approach is the
UPDATE/INSERT/DELETE overhead; even with an SPI-optimized C trigger, it's
costing us up to 40% additional time when under heavy write activity ...
which is exactly when we can't afford delays.

For a database which has a low level of UPDATE activity, though, you can
already implement cached aggregates as tables without inventing any new
Postgres extensions.

--
Josh Berkus
Aglio Database Solutions
San Francisco


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Brown <kevin(at)sysexperts(dot)com>, "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-01 20:21:24
Message-ID: 87el6rsqez.fsf@stark.dyndns.tv
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Kevin Brown <kevin(at)sysexperts(dot)com> writes:
> > Hmm...any chance, then, of giving aggregate functions a means of
> > asking which table(s) and column(s) the original query referred to so
> > that it could do proper optimization on its own?
>
> You can't usefully do that without altering the aggregate paradigm.
> It won't help for min() to intuit the answer quickly if the query
> plan is going to insist on feeding every row to it anyway.

That just means you need some way for aggregates to declare which records they
need. The only values that seem like they would be useful would be "first
record" "last record" and "all records". Possibly something like "all-nonnull
records" for things like count(), but that might be harder.

> Don't forget that it would also need to be aware of whether there were
> any WHERE clauses, joins, GROUP BY, perhaps other things I'm not
> thinking of.
>
> In the end, the only reasonable way to handle this kind of thing is
> to teach the query planner about it. Considering the small number
> of cases that are usefully optimizable (basically only MIN and MAX
> on a single table without any WHERE or GROUP clauses), and the ready
> availability of a SQL-level workaround, it strikes me as a very
> low-priority TODO item.

All true, but I wouldn't be so quick to dismiss it as low-priority. In my
experience I've seen the idiom "select min(foo) from bar" more times than I
can count. The frequency with which this question occurs here probably is
indicative of how much people expect it to work. And it's probably used by a
lot of multi-database applications and in a lot of auto-matically generated
code where it would be hard to hack in special purpose workarounds.

--
greg


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Brown <kevin(at)sysexperts(dot)com>, "'pgsql-hackers(at)postgresql(dot)org'" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PERFORM] not using index for select min(...)
Date: 2003-02-02 02:41:56
Message-ID: 20030202024156.GA3353@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-performance

On Sat, Feb 01, 2003 at 15:21:24 -0500,
Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
> That just means you need some way for aggregates to declare which records they
> need. The only values that seem like they would be useful would be "first
> record" "last record" and "all records". Possibly something like "all-nonnull
> records" for things like count(), but that might be harder.

I don't see how this is going to be all that useful for aggregates in general.
min and max are special and it is unlikely that you are going to get much
speed up for general aggregate functions. For the case where you really
only need to scan a part of the data (say skipping nulls when nearly all
of the entries are null), a DBA can add an appropiate partial index and
where clause. This will probably happen infrequently enough that adding
special checks for this aren't going to pay off.

For min and max, it seems to me that putting special code to detect these
functions and replace them with equivalent subselects in the case where
an index exists (since a sort is worse than a linear scan) is a possible
long term solution to make porting easier.

In the short term education is the answer. At least the documentation of the
min and max functions and the FAQ, and the section with performance tips
should recommend the alternative form if there is an appropiate index.