Quick Links

Re: Seqscan in MAX(index_column)

Lists:	pgsql-hackers

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	"Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Cc:	Lance Rushing <rushing(at)armornetworks(dot)com>
Subject:	database corruption
Date:	2003-08-30 17:30:14
Message-ID:	3F50DF26.9050303@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I received this from a friend:

Lance Rushing wrote:
> Something bad happened in Postgres. I lost the ability to connect to the
> 'pegasus2' database.
>
> i was doing some updates when I got an error, saying something about the
> backend dying...
>
> When I restarted the database I could no longer connect to my database. I
> can connect to the other databases, just not 'pegasus2'.

This is a 7.3.3 installation on Red Hat (9 I think).

I don't have any more detail yet on exactly what he was doing at this
point, but I grabbed a copy of $PGDATA and looked at it on my own
machine (since he doesn't have debug and assert support). Logging into
any other database works fine, but the offending database produces this
backtrace:

...
#85355 0x0817c0b5 in RelationBuildDesc (buildinfo=
{infotype = 2, i = {info_id = 135924865, info_name = 0x81a0c81
"pg_trigger"}}, oldrelation=0x0) at relcache.c:890
#85356 0x0817cf54 in RelationSysNameGetRelation (relationName=0x81a0c81
"pg_trigger") at relcache.c:1591
#85357 0x0807c066 in relation_openr (sysRelationName=0x81a0c81
"pg_trigger", lockmode=1) at heapam.c:550
#85358 0x0807c20a in heap_openr (sysRelationName=0x81a0c81 "pg_trigger",
lockmode=1) at heapam.c:660
#85359 0x080d5872 in RelationBuildTriggers (relation=0x41ad740c) at
trigger.c:720
#85360 0x0817c0b5 in RelationBuildDesc (buildinfo=
{infotype = 2, i = {info_id = 135924865, info_name = 0x81a0c81
"pg_trigger"}}, oldrelation=0x0) at relcache.c:890
#85361 0x0817cf54 in RelationSysNameGetRelation (relationName=0x81a0c81
"pg_trigger") at relcache.c:1591
#85362 0x0807c066 in relation_openr (sysRelationName=0x81a0c81
"pg_trigger", lockmode=1) at heapam.c:550
#85363 0x0807c20a in heap_openr (sysRelationName=0x81a0c81 "pg_trigger",
lockmode=1) at heapam.c:660
#85364 0x080d5872 in RelationBuildTriggers (relation=0x41ad6b54) at
trigger.c:720
#85365 0x0817c0b5 in RelationBuildDesc (buildinfo=
{infotype = 2, i = {info_id = 135924865, info_name = 0x81a0c81
"pg_trigger"}}, oldrelation=0x0) at relcache.c:890
...

i.e. it recurses itself to a horrible death.

Any ideas? Is this an indication that pg_trigger is corrupted?

Thanks,

Joe

From:	Joe Conway <mail(at)joeconway(dot)com>
To:
Cc:	"Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>, Lance Rushing <rushing(at)armornetworks(dot)com>
Subject:	Re: database corruption
Date:	2003-08-30 19:57:26
Message-ID:	3F5101A6.2070000@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joe Conway wrote:
> I don't have any more detail yet on exactly what he was doing at this
> point, but I grabbed a copy of $PGDATA and looked at it on my own
> machine (since he doesn't have debug and assert support). Logging into
> any other database works fine, but the offending database produces this
> backtrace:

It turns out the "corruption" was user error. He ran a statement that
inadvertantly set reltriggers = 1 for every row in pg_class. This is
what led to the infinite recursion.

For the archives, I was able to hack my way into the database by
wrapping the body of RelationBuildTriggers() in "if (relation->rd_id >
17000) {}", thus excluding system tables. Once back in the database,
updating reltriggers in pg_class to appropriate values restored access
to the database with an unhacked backend.

Joe

From:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
To:	Joe Conway <mail(at)joeconway(dot)com>
Cc:	"Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: database corruption
Date:	2003-08-30 20:04:40
Message-ID:	20030830200439.GA12214@dcc.uchile.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Aug 30, 2003 at 12:57:26PM -0700, Joe Conway wrote:
> Joe Conway wrote:
> >I don't have any more detail yet on exactly what he was doing at this
> >point, but I grabbed a copy of $PGDATA and looked at it on my own
> >machine (since he doesn't have debug and assert support). Logging into
> >any other database works fine, but the offending database produces this
> >backtrace:
>
> It turns out the "corruption" was user error. He ran a statement that
> inadvertantly set reltriggers = 1 for every row in pg_class. This is
> what led to the infinite recursion.

For the record, how were you able to detect this?

--
Alvaro Herrera (<alvherre[a]dcc.uchile.cl>)
"Ellos andaban todos desnudos como su madre los parió, y también las mujeres,
aunque no vi más que una, harto moza, y todos los que yo vi eran todos
mancebos, que ninguno vi de edad de más de XXX años" (Cristóbal Colón)

From:	Joe Conway <mail(at)joeconway(dot)com>
To:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>
Cc:	"Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: database corruption
Date:	2003-08-30 20:16:42
Message-ID:	3F51062A.70106@joeconway.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> On Sat, Aug 30, 2003 at 12:57:26PM -0700, Joe Conway wrote:
>
>>Joe Conway wrote:
>>
>>>I don't have any more detail yet on exactly what he was doing at this
>>>point, but I grabbed a copy of $PGDATA and looked at it on my own
>>>machine (since he doesn't have debug and assert support). Logging into
>>>any other database works fine, but the offending database produces this
>>>backtrace:
>>
>>It turns out the "corruption" was user error. He ran a statement that
>>inadvertantly set reltriggers = 1 for every row in pg_class. This is
>>what led to the infinite recursion.
>
> For the record, how were you able to detect this?

The backtrace from the core file showed that the recursion was going on
between RelationSysNameGetRelation(), relation_openr(), heap_openr(),
RelationBuildTriggers() and RelationBuildDesc(). It seemed odd that
pg_trigger seemed to be getting a trigger applied to it, so I guessed
that bypassing RelationBuildTriggers() for system tables would allow me
in to the database.

About the time I got in and started looking around, I finally got in
touch with the user, who confirmed he had been trying to disable then
re-enable triggers when the problem occurred. He ran a statement like:
update pg_class set reltriggers = 1;
thinking that reltriggers was a "flag" (0 == off, 1 == on). That's when
it all suddenly made sense to me ;-)

Fortunately this was his own development database he was messing with,
but it was an interesting exercise none-the-less. Maybe this reinforces
the need to have a command for enabling/disabling triggers. I vaguely
remember discussion about that -- did it make it into 7.4?

Joe

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Joe Conway <mail(at)joeconway(dot)com>
Cc:	Alvaro Herrera <alvherre(at)dcc(dot)uchile(dot)cl>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: database corruption
Date:	2003-08-30 22:00:31
Message-ID:	3112.1062280831@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Fortunately this was his own development database he was messing with,
> but it was an interesting exercise none-the-less. Maybe this reinforces
> the need to have a command for enabling/disabling triggers.

No, it reinforces the cardinal rule about not experimenting with system
catalog alterations in databases you care about. If he'd done his
experimentation in a scratch database, he'd not have been in trouble.

Perhaps we need some documentation about What Not To Do As Superuser.

regards, tom lane

From:	"Paulo Scardine" <paulos(at)cimed(dot)ind(dot)br>
To:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Seqscan in MAX(index_column)
Date:	2003-09-04 14:32:15
Message-ID:	008a01c372f1$55f65e40$6301a8c0@Administrador
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(Perhaps a newbie question, but I tried to google this out without success).

Why postgres does an expensive seqscan to find the max(value) for an indexed
column? I think MAX() does not know or cares if a column is indexed, but...
Should not it? BTW, is there some smarter trick to do that?

I know I can just do a very fast (SELECT pk FROM foo ORDER BY pk DESC LIMIT
1) instead, but my coleagues are arguing that MAX(indexed_column) seems to
be a lot
more smarter in MS-SQLServer and I end up without a good response.

Thank you,
--
Paulo Scardine
Brazil

From:	"Shridhar Daithankar" <shridhar_daithankar(at)persistent(dot)co(dot)in>
To:	"Paulo Scardine" <paulos(at)cimed(dot)ind(dot)br>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-04 14:40:33
Message-ID:	3F579C39.22178.15C497F@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 4 Sep 2003 at 11:32, Paulo Scardine wrote:

> (Perhaps a newbie question, but I tried to google this out without success).
>
> Why postgres does an expensive seqscan to find the max(value) for an indexed
> column? I think MAX() does not know or cares if a column is indexed, but...
> Should not it? BTW, is there some smarter trick to do that?

No. Postgresql uses MVCC which mean there could be multiple views of sample
tuple active at the same time. There is no way to tell which is max. value for
a column as definition of a committed value can be a moving target.

It can not be cached, at least easily. That's the price to pay for MVCC. Same
goes for select count(*) from table. That query has to end up with a sequential
scan.

>
> I know I can just do a very fast (SELECT pk FROM foo ORDER BY pk DESC LIMIT
> 1) instead, but my coleagues are arguing that MAX(indexed_column) seems to
> be a lot
> more smarter in MS-SQLServer and I end up without a good response.

Well, postgresql earns solid concurrency due to MVCC. Set up postgresql and MS
SQL server on same machine and do a rudimentary benchmark with 100 clients
hitting database hard. See where you get more tps'.s

In postgresql, readers and writers don't block each other. AFAIK, in MS SQL
server rows are ocked for update. So if you lock a row in transaction and does
not commit for long, MS SQL will have serious problems.

All night long transactions are no problem to postgresql except for the fact
that vacuum can not clean the tuples locked in tranactions.

HTH

Bye
Shridhar

--
Blutarsky's Axiom: Nothing is impossible for the man who will not listen to
reason.

From:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To:	Shridhar Daithankar <shridhar_daithankar(at)persistent(dot)co(dot)in>
Cc:	Paulo Scardine <paulos(at)cimed(dot)ind(dot)br>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-04 15:10:55
Message-ID:	Pine.LNX.4.44.0309041703110.14280-100000@zigo.dhs.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 4 Sep 2003, Shridhar Daithankar wrote:

> > column? I think MAX() does not know or cares if a column is indexed, but...
>
> No. Postgresql uses MVCC which mean there could be multiple views of sample
> tuple active at the same time. There is no way to tell which is max. value for
> a column as definition of a committed value can be a moving target.
>
> It can not be cached, at least easily. That's the price to pay for MVCC. Same
> goes for select count(*) from table. That query has to end up with a sequential
> scan.

It does not have to be like that. Even with a mvcc database it can use the
index for max/min and in my opinion it should.

As far as I know the only reason why it's not implemented in postgresql is
because pg has a general aggregate model and max/min are implemented using
that. Still, max/min are special in that they are almost the only
aggregates that can use an index to deliver the result directly. Some day
someone should make max/min a special case in pg. Exactly how is the
question.

I don't know mssql much, but I guess you can't define your own aggregate
functions there? Then all aggregate functions are special anyway.

--
/Dennis

From:	Neil Conway <neilc(at)samurai(dot)com>
To:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc:	Shridhar Daithankar <shridhar_daithankar(at)persistent(dot)co(dot)in>, Paulo Scardine <paulos(at)cimed(dot)ind(dot)br>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-04 15:49:30
Message-ID:	1062690570.2233.5.camel@tokyo
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This is an FAQ, BTW -- try searching the archives again. It's also
mentioned in the documentation:

http://candle.pha.pa.us/main/writings/pgsql/sgml/functions-aggregate.html

On Thu, 2003-09-04 at 11:10, Dennis Bjorklund wrote:
> On Thu, 4 Sep 2003, Shridhar Daithankar wrote:
> > It can not be cached, at least easily. That's the price to pay for MVCC. Same
> > goes for select count(*) from table. That query has to end up with a sequential
> > scan.
>
> It does not have to be like that. Even with a mvcc database it can use the
> index for max/min and in my opinion it should.

Right, AFAIK MVCC isn't relevant to MAX() (given a btree index, you can
just read the index in the right order and return the first valid
tuple), although it makes optimizing COUNT(*) trickier, I believe.

> As far as I know the only reason why it's not implemented in postgresql is
> because pg has a general aggregate model and max/min are implemented using
> that. Still, max/min are special in that they are almost the only
> aggregates that can use an index to deliver the result directly. Some day
> someone should make max/min a special case in pg. Exactly how is the
> question.

Well, it's an open question whether it's worth uglifying the backend to
support this optimization, given that there is a trivial workaround that
people can use. It would make it easier to port code to PostgreSQL from
other RDBMSs, though...

-Neil

From:	Czuczy Gergely <phoemix(at)harmless(dot)hu>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-04 15:55:48
Message-ID:	Pine.LNX.4.44.0309041753260.18882-100000@localhost
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello

In my opinion, in 7.4 this optimized max() aggregate function would be a
very small, but significant improvement. As one of the members on the list
said, it would be a lot easier to port from/to other RDBMSes, with keeping
the same optimalization of the queries.

Bye,

Gergely Czuczy
mailto: phoemix(at)harmless(dot)hu
PGP: http://phoemix.harmless.hu/phoemix.pgp

The point is, that geeks are not necessarily the outcasts
society often believes they are. The fact is that society
isn't cool enough to be included in our activities.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-04 15:57:26
Message-ID:	873cfcbl7d.fsf@stark.dyndns.tv
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Shridhar Daithankar" <shridhar_daithankar(at)persistent(dot)co(dot)in> writes:

> On 4 Sep 2003 at 11:32, Paulo Scardine wrote:
>
> > (Perhaps a newbie question, but I tried to google this out without success).
> >
> > Why postgres does an expensive seqscan to find the max(value) for an indexed
> > column? I think MAX() does not know or cares if a column is indexed, but...
> > Should not it? BTW, is there some smarter trick to do that?
>
> No. Postgresql uses MVCC which mean there could be multiple views of sample
> tuple active at the same time. There is no way to tell which is max. value for
> a column as definition of a committed value can be a moving target.

It has nothing to do with MVCC. It has to do with implementing this is hard in
the general case.

Think of examples like:

select max(foo) group by bar;

select max(foo) where xyz = z;

To do it properly max/min have to be special-cased and tightly integrated with
other code to handle index scans and aggregates. As it currently stands
they're implemented the same way as any other aggregate, which means they get
to see all the records in the grouping.

This is a frequently asked question, I'm surprised you didn't find stuff
searching with google. There have been numerous long discussions on this topic
not long ago. People are still trying to think about how to handle this
better.

--
greg

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-04 17:19:58
Message-ID:	200309041719.h84HJwa28484@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark wrote:
> It has nothing to do with MVCC. It has to do with implementing this is hard in
> the general case.
>
> Think of examples like:
>
> select max(foo) group by bar;
>
> or
>
> select max(foo) where xyz = z;
>
> To do it properly max/min have to be special-cased and tightly integrated with
> other code to handle index scans and aggregates. As it currently stands
> they're implemented the same way as any other aggregate, which means they get
> to see all the records in the grouping.
>
> This is a frequently asked question, I'm surprised you didn't find stuff
> searching with google. There have been numerous long discussions on this topic
> not long ago. People are still trying to think about how to handle this
> better.

The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What
we don't have a workaround for is COUNT(*). I think that will require
some cached value that obeys MVCC rules of visibility.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 00:16:21
Message-ID:	3F57D5D5.9000509@pse-consulting.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:

>Greg Stark wrote:
>
>
>>It has nothing to do with MVCC. It has to do with implementing this is hard in
>>the general case.
>>
>>Think of examples like:
>>
>>select max(foo) group by bar;
>>
>>or
>>
>>select max(foo) where xyz = z;
>>
>>To do it properly max/min have to be special-cased and tightly integrated with
>>other code to handle index scans and aggregates. As it currently stands
>>they're implemented the same way as any other aggregate, which means they get
>>to see all the records in the grouping.
>>
>>This is a frequently asked question, I'm surprised you didn't find stuff
>>searching with google. There have been numerous long discussions on this topic
>>not long ago. People are still trying to think about how to handle this
>>better.
>>
>>
>
>The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What
>we don't have a workaround for is COUNT(*). I think that will require
>some cached value that obeys MVCC rules of visibility.
>
>
IMHO portability is an important point. People are used to MAX() and
COUNT(*), and will be surprised that they need some special treatment.
While the reasons for this are perfectly explainable, speeding up these
aggregates with some extra effort would make porting a bit easier.

Regards,
Andreas

From:	Christopher Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 01:35:06
Message-ID:	m31xuw586t.fsf@chvatal.cbbrowne.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The world rejoiced as pgadmin(at)pse-consulting(dot)de (Andreas Pflug) wrote:
> Bruce Momjian wrote:
>
>>Greg Stark wrote:
>>
>>
>>>It has nothing to do with MVCC. It has to do with implementing this is hard in
>>>the general case.
>>>
>>>Think of examples like:
>>>
>>>select max(foo) group by bar;
>>>
>>>or
>>>
>>>select max(foo) where xyz = z;
>>>
>>>To do it properly max/min have to be special-cased and tightly integrated with
>>>other code to handle index scans and aggregates. As it currently stands
>>>they're implemented the same way as any other aggregate, which means they get
>>>to see all the records in the grouping.
>>>
>>>This is a frequently asked question, I'm surprised you didn't find stuff
>>>searching with google. There have been numerous long discussions on this topic
>>>not long ago. People are still trying to think about how to handle this
>>>better.
>>>
>>>
>>
>>The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What
>>we don't have a workaround for is COUNT(*). I think that will require
>>some cached value that obeys MVCC rules of visibility.
>>
>>
> IMHO portability is an important point. People are used to MAX() and
> COUNT(*), and will be surprised that they need some special
> treatment. While the reasons for this are perfectly explainable,
> speeding up these aggregates with some extra effort would make porting
> a bit easier.

The availability of cleverness with MAX()/MIN() is no grand surprise;
it would be very nice to get some expansion of that to "SELECT VALUE
FROM TABLE WHERE (CRITERIA) ORDER BY VALUE DESCENDING LIMIT 1;"

But I'm _very_ curious as to what the anticipated treatment to collect
COUNT() more efficiently would be. I would expect that it would only
be able to get tuned much more if there's NO "where" clause, so that
it could use some ("magically-kept-up-to-date") stats on table size.

I don't see any way to optimize COUNT when numbers of rows can
continually vary. Storing stats somewhere will just make updates more
expensive. And if those stats are for the table, that doesn't help me
if I want "COUNT(*) FROM TABLE WHERE UPDATED_ON BETWEEN NOW() - '1
day' and NOW()".
--
(format nil "~S(at)~S" "aa454" "freenet.carleton.ca")
http://cbbrowne.com/info/linuxdistributions.html
"Recursion is the root of computation since it trades description for time."
-- Alan Perlis

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Christopher Browne <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 02:02:45
Message-ID:	200309050202.h8522jr06156@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Browne wrote:
> > IMHO portability is an important point. People are used to MAX() and
> > COUNT(*), and will be surprised that they need some special
> > treatment. While the reasons for this are perfectly explainable,
> > speeding up these aggregates with some extra effort would make porting
> > a bit easier.
>
> The availability of cleverness with MAX()/MIN() is no grand surprise;
> it would be very nice to get some expansion of that to "SELECT VALUE
> FROM TABLE WHERE (CRITERIA) ORDER BY VALUE DESCENDING LIMIT 1;"
>
> But I'm _very_ curious as to what the anticipated treatment to collect
> COUNT() more efficiently would be. I would expect that it would only
> be able to get tuned much more if there's NO "where" clause, so that
> it could use some ("magically-kept-up-to-date") stats on table size.
>
> I don't see any way to optimize COUNT when numbers of rows can
> continually vary. Storing stats somewhere will just make updates more
> expensive. And if those stats are for the table, that doesn't help me
> if I want "COUNT(*) FROM TABLE WHERE UPDATED_ON BETWEEN NOW() - '1
> day' and NOW()".

Yes, count would only use the cached stats for non-WHERE clause
COUNT(*).

My idea is that if a transaction doing a COUNT(*) would first look to
see if there already was a visible cached value, and if not, it would do
the COUNT(*) and insert into the cache table. Any INSERT/DELETE would
remove the value from the cache. As I see it, the commit of the
INSERT/DELETE transaction would then auto-invalidate the cache at the
exact time the transaction commits. This would allow MVCC visibility of
the counts.

A trickier idea would be for INSERT/DELETE to UPDATE the cached value.
It might be possible to always have a valid cache value for COUNT(*).
(COPY would also need to update the cache.)

From:	Neil Conway <neilc(at)samurai(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Christopher Browne <cbbrowne(at)acm(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 02:16:50
Message-ID:	1062728210.360.14.camel@tokyo
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2003-09-04 at 22:02, Bruce Momjian wrote:
> My idea is that if a transaction doing a COUNT(*) would first look to
> see if there already was a visible cached value, and if not, it would do
> the COUNT(*) and insert into the cache table. Any INSERT/DELETE would
> remove the value from the cache. As I see it, the commit of the
> INSERT/DELETE transaction would then auto-invalidate the cache at the
> exact time the transaction commits. This would allow MVCC visibility of
> the counts.

But this means that some of the time (indeed, *much* of the time),
COUNT(*) would require a seqscan of the entire table. Since at many
sites that will take an enormous amount of time (and disk I/O), that
makes this solution infeasible IMHO.

In general, I don't think this is worth doing.

-Neil

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Neil Conway <neilc(at)samurai(dot)com>
Cc:	Christopher Browne <cbbrowne(at)acm(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 02:20:22
Message-ID:	200309050220.h852KMG17572@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Neil Conway wrote:
> On Thu, 2003-09-04 at 22:02, Bruce Momjian wrote:
> > My idea is that if a transaction doing a COUNT(*) would first look to
> > see if there already was a visible cached value, and if not, it would do
> > the COUNT(*) and insert into the cache table. Any INSERT/DELETE would
> > remove the value from the cache. As I see it, the commit of the
> > INSERT/DELETE transaction would then auto-invalidate the cache at the
> > exact time the transaction commits. This would allow MVCC visibility of
> > the counts.
>
> But this means that some of the time (indeed, *much* of the time),
> COUNT(*) would require a seqscan of the entire table. Since at many
> sites that will take an enormous amount of time (and disk I/O), that
> makes this solution infeasible IMHO.
>
> In general, I don't think this is worth doing.

It is possible it isn't worth doing. Can the INSERT/DELETE
incrementing/decrementing the cached count work reliabily?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 03:08:17
Message-ID:	25643.1062731297@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Neil Conway wrote:
>> In general, I don't think this is worth doing.

> It is possible it isn't worth doing. Can the INSERT/DELETE
> incrementing/decrementing the cached count work reliabily?

I don't even see how the notion of a single cached value makes
theoretical sense, when in principle every transaction may have
a different idea of the correct answer.

You could doubtless maintain a fairly good approximate total this
way, and that would be highly useful for some applications ...
but it isn't COUNT(*).

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 03:25:10
Message-ID:	200309050325.h853PAa24389@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Neil Conway wrote:
> >> In general, I don't think this is worth doing.
>
> > It is possible it isn't worth doing. Can the INSERT/DELETE
> > incrementing/decrementing the cached count work reliabily?
>
> I don't even see how the notion of a single cached value makes
> theoretical sense, when in principle every transaction may have
> a different idea of the correct answer.
>
> You could doubtless maintain a fairly good approximate total this
> way, and that would be highly useful for some applications ...
> but it isn't COUNT(*).

With MVCC allowing multiple rows with only one visible, I thought the
INSERT/DELETE system would work --- once the delete becomes visible, the
change becomes visible.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 03:32:24
Message-ID:	27375.1062732744@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> You could doubtless maintain a fairly good approximate total this
>> way, and that would be highly useful for some applications ...
>> but it isn't COUNT(*).

> With MVCC allowing multiple rows with only one visible, I thought the
> INSERT/DELETE system would work --- once the delete becomes visible, the
> change becomes visible.

Oh, you're imagining the cache as being a row in an ordinary table?
I doubt that could work. Multiple transactions trying to update these
rows would suffer from contention and deadlock problems, wouldn't they?

regards, tom lane

From:	Christopher Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 03:44:31
Message-ID:	m3oexz3nmo.fsf@chvatal.cbbrowne.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oops! pgman(at)candle(dot)pha(dot)pa(dot)us (Bruce Momjian) was seen spray-painting on a wall:
> Neil Conway wrote:
>> On Thu, 2003-09-04 at 22:02, Bruce Momjian wrote:
>> > My idea is that if a transaction doing a COUNT(*) would first look to
>> > see if there already was a visible cached value, and if not, it would do
>> > the COUNT(*) and insert into the cache table. Any INSERT/DELETE would
>> > remove the value from the cache. As I see it, the commit of the
>> > INSERT/DELETE transaction would then auto-invalidate the cache at the
>> > exact time the transaction commits. This would allow MVCC visibility of
>> > the counts.
>>
>> But this means that some of the time (indeed, *much* of the time),
>> COUNT(*) would require a seqscan of the entire table. Since at many
>> sites that will take an enormous amount of time (and disk I/O),
>> that makes this solution infeasible IMHO.
>>
>> In general, I don't think this is worth doing.
>
> It is possible it isn't worth doing. Can the INSERT/DELETE
> incrementing/decrementing the cached count work reliabily?

Wouldn't this more or less be the same thing as having a trigger that
does, upon each insert/delete "update pg_counts set count = count + 1
where reltable = 45232;"? (... where 1 would be -1 for deletes, and where
45232 is the OID of the table...)

Technically, it seems _feasible_, albeit with the problem that it
turns pg_counts into a pretty horrid bottleneck. If lots of backends
are updating that table, then row 45232 in pg_counts becomes
troublesome because all those processes have to serialize around
updating it.

And if I have tables where I insert lots of data, but couldn't care
less how many rows they have, this effort is wasted.

When I was curious as to how COUNT might be maintained, I was pretty
sure that this wouldn't be the preferred method...
--
If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me
http://cbbrowne.com/info/emacs.html
:FATAL ERROR -- ERROR IN ERROR HANDLER

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Neil Conway <neilc(at)samurai(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 03:49:23
Message-ID:	200309050349.h853nN228200@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> You could doubtless maintain a fairly good approximate total this
> >> way, and that would be highly useful for some applications ...
> >> but it isn't COUNT(*).
>
> > With MVCC allowing multiple rows with only one visible, I thought the
> > INSERT/DELETE system would work --- once the delete becomes visible, the
> > change becomes visible.
>
> Oh, you're imagining the cache as being a row in an ordinary table?
> I doubt that could work. Multiple transactions trying to update these
> rows would suffer from contention and deadlock problems, wouldn't they?

Oh, they would, woudn't they. I was thinking of the counter UPDATE as a
DELETE and an INSERT. In fact, when we do UPDATE col SET col = col + 1,
we lock the row only so we know the count. Instead, could we insert
+/-1 records into the cache table that were visible only to the running
transaction, and on commit (ON COMMIT TRIGGER) adjust the cached
aggregate counts without requiring locks?

I know I am just shooting out ideas, but it might give someone else a
better idea.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Christopher Browne <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 04:28:25
Message-ID:	200309050428.h854SPk03147@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Browne wrote:
> Wouldn't this more or less be the same thing as having a trigger that
> does, upon each insert/delete "update pg_counts set count = count + 1
> where reltable = 45232;"? (... where 1 would be -1 for deletes, and where
> 45232 is the OID of the table...)
>
> Technically, it seems _feasible_, albeit with the problem that it
> turns pg_counts into a pretty horrid bottleneck. If lots of backends
> are updating that table, then row 45232 in pg_counts becomes
> troublesome because all those processes have to serialize around
> updating it.
>
> And if I have tables where I insert lots of data, but couldn't care
> less how many rows they have, this effort is wasted.
>
> When I was curious as to how COUNT might be maintained, I was pretty
> sure that this wouldn't be the preferred method...

See my later idea of the trigger doing +/-1 rather than locking the
value during the transaction.

If we don't do it this way, I can't think of another way that would
honor MVCC visibility.

From:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Christopher Browne <cbbrowne(at)acm(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 05:18:43
Message-ID:	Pine.LNX.4.44.0309050712470.14280-100000@zigo.dhs.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 5 Sep 2003, Bruce Momjian wrote:

> > When I was curious as to how COUNT might be maintained, I was pretty
> > sure that this wouldn't be the preferred method...
>
> See my later idea of the trigger doing +/-1 rather than locking the
> value during the transaction.
>
> If we don't do it this way, I can't think of another way that would
> honor MVCC visibility.

A general query cache is something that is fairly clean and which might
help both with count(*) and other queries.

Many databases has a lot of tables that are more or less stable where this
would work fine. From what I have heard mysql has something like this and
it works well. For tables that change a lot the the cached queries will
almost always be invalid so one might want to let the user decide which
tables should never be cached.

It could at least be an interesting experiment.

--
/Dennis

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	"Dennis Bjorklund" <db(at)zigo(dot)dhs(dot)org>, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 06:04:35
Message-ID:	043c01c37373$970b9da0$2800a8c0@mars
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> A general query cache is something that is fairly clean and which might
> help both with count(*) and other queries.
>
> Many databases has a lot of tables that are more or less stable where this
> would work fine. From what I have heard mysql has something like this and
> it works well. For tables that change a lot the the cached queries will
> almost always be invalid so one might want to let the user decide which
> tables should never be cached.

It works well because MySQL doesn't have MVCC...

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Browne <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 14:12:41
Message-ID:	11010.1062771161@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Browne <cbbrowne(at)acm(dot)org> writes:
> Wouldn't this more or less be the same thing as having a trigger that
> does, upon each insert/delete "update pg_counts set count = count + 1
> where reltable = 45232;"? (... where 1 would be -1 for deletes, and where
> 45232 is the OID of the table...)

I think that's exactly what Bruce was suggesting. A slightly more
efficient variant is for each transaction to save up its net deltas,
and issue a single UPDATE for each table it's touched just before it
commits. But that just reduces the volume of update traffic, it doesn't
fundamentally alter the concept.

> Technically, it seems _feasible_, albeit with the problem that it
> turns pg_counts into a pretty horrid bottleneck.

Not to mention a likely source of deadlocks. And it still doesn't solve
the fundamental objection that you can't get an MVCC-correct answer by
examining the table.

An idea I was toying with is to do something similar to what was just
suggested to David Skoll for his stats problem: instead of using
UPDATEs, use INSERTs of delta records. That is, every time a
transaction is about to commit, it INSERTs into the counts table a row
like "45232 +1" ("I inserted one row") or "45232 -10" ("I deleted ten
rows"). Assume that we somehow initialized the counts table with an
entry "45232 total-rows" for each table. Then, a COUNT(*) on table
45232 is equivalent to "SELECT SUM(deltas) FROM counts WHERE reltable =
45232". As long as the number of rows you have to look at to compute
this sum is smaller than the number of rows in the original table,
it's a win.

The cool thing about this approach is that it is actually MVCC-correct.
If some transaction has committed, but is uncommitted according to your
worldview, your SUM will automatically ignore its delta row. Another
cool thing is that the INSERTs don't conflict with each other, so
there's no contention or deadlock risk.

You would periodically (perhaps during VACUUM) update the counts table
with operations that are conceptually like

BEGIN;
INSERT INTO counts
SELECT reltable, SUM(deltas) FROM counts
WHERE xid < GLOBALXMIN
GROUP BY reltable;
DELETE FROM counts WHERE xid < GLOBALXMIN;
COMMIT;

to sweep together the past deltas from transactions that are so old no
one cares about their individual effects anymore (GLOBALXMIN is the same
cutoff used by VACUUM to decide it can remove a committed-dead tuple).
This prevents the number of delta rows from growing indefinitely large
over time.

> And if I have tables where I insert lots of data, but couldn't care
> less how many rows they have, this effort is wasted.

Yes, this mechanism would be hugely expensive in any case. I can't see
enabling it overall, it would have to be turned on only for specific
tables by user command. It'd be interesting to try to code it as a
contrib module that's fired by triggers on the tables you want to track.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 14:29:37
Message-ID:	87ad9j9ulq.fsf@stark.dyndns.tv
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

> The FAQ does have the example of using ORDER BY LIMIT 1 for MAX(). What
> we don't have a workaround for is COUNT(*). I think that will require
> some cached value that obeys MVCC rules of visibility.

Note that that only handles min()/max() for the whole table. It doesn't handle
the GROUP BY case, for that you need DISTINCT ON with an ORDER BY clause.

I don't see anything special about count(*) that makes it especially amenable
to optimization. In fact I think you're headed to a full updatable
materialized views implementation with this approach.

Materialized views are basically tables that are guaranteed to always contain
the results of a view. They're constructed by executing the specified query
(eg "select bar,count(*) n from foo group by bar"). Then updated every time
the underlying tables are modified (eg "insert into foo (bar) values (1)" does
an "update foo_count_view set n = n+1 where bar=1"). Then they're available
for the optimizer to substitute whenever it sees an expression they can
answer. (so if you do "select count(*) from foo where bar=1" it gets
transformed into "select n from foo_count_view where bar=1").

It's a big project.

I think the min/max optimization is one of those things that "has to happen
sometime". It's something people expect to work, and as long as it doesn't the
database just isn't using the data it already has as well as it could.

Materialized views would be nice, Oracle has them largely because they let the
Oracle techs make a *huge* increase in their spec numbers. They were at the
heart of that challenge a few years ago When Ellison said he would pay a
million dollars to anyone who showed that MSSQL could come within a factor of
10 of Oracle. It was impossible only because Oracle wasn't really doing the
same order of work because of materialized views.

But they're a "would be neat" kind of thing. Nobody comes to the database
expecting to find them. If postgres had them it would be really really cool.
But it seems like there are more important things to be working on.

--
greg

From:	"scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 15:56:15
Message-ID:	Pine.LNX.4.33.0309050954340.30557-100000@css120.ihs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Would it be possible to catch an unconstrained max(id)/min(id) and rewrite
it as "select id from table order by id [desc] limit1" on the fly in the
parser somewhere?

That would require fairly little code, and be transparent to the user.
I.e. low hanging fruit.

On 5 Sep 2003, Greg Stark wrote:

> Note that that only handles min()/max() for the whole table. It doesn't handle
> the GROUP BY case, for that you need DISTINCT ON with an ORDER BY clause.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 16:06:17
Message-ID:	87vfs78bk6.fsf@stark.dyndns.tv
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"scott.marlowe" <scott(dot)marlowe(at)ihs(dot)com> writes:

> Would it be possible to catch an unconstrained max(id)/min(id) and rewrite
> it as "select id from table order by id [desc] limit1" on the fly in the
> parser somewhere?
>
> That would require fairly little code, and be transparent to the user.
> I.e. low hanging fruit.

What if there's no index on id? Then it would actually be slower than the
straightforward approach. You would have to check both versions and take the
one with the lowest cost, or check before rewriting for possible paths on that
column.

The problem with low hanging fruit is sometimes it makes people stop looking
for real solutions. And I think the real solution is worthwhile here.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org, "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
Subject:	Re: Seqscan in MAX(index_column)
Date:	2003-09-05 16:58:13
Message-ID:	246.1062781093@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> "scott.marlowe" <scott(dot)marlowe(at)ihs(dot)com> writes:
>> Would it be possible to catch an unconstrained max(id)/min(id) and rewrite
>> it as "select id from table order by id [desc] limit1" on the fly in the
>> parser somewhere?
>> That would require fairly little code, and be transparent to the user.
>> I.e. low hanging fruit.

> What if there's no index on id? Then it would actually be slower than the
> straightforward approach. You would have to check both versions and take the
> one with the lowest cost, or check before rewriting for possible paths on that
> column.

If the fruit were all that low-hanging, it would've been done before
now, as I think this is all that people coming from other DBs expect.
But as Greg points out, it's not really a trivial planner change.

There are also semantic issues: how shall the planner decide which
aggregates are candidates for this treatment (I don't much care for
hardwiring some behavior to the names "max" and "min") and how shall
it decide which indexes match a given aggregate? In the presence of
multiple operator classes for a datatype, it's not obvious whether a
btree index has the same sort order that max/min need.

If you dig in the pghackers archives you can find some speculation about
extending aggregate definitions to associate max/min with appropriate
sort operators, but no one's done the legwork to make a concrete
proposal, let alone actually code it up.

regards, tom lane