Re: Performance of count(*) on large tables vs SQL Server

Lists: pgsql-performance
From: Andrew Mayo <ajmayo(at)kohuconsulting(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Performance of count(*) on large tables vs SQL Server
Date: 2005-02-01 12:41:43
Message-ID: 20050201124143.43977.qmail@web206.biz.mail.re2.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Doing some rather crude comparative performance tests
between PG 8.0.1 on Windows XP and SQL Server 2000, PG
whips SQL Server's ass on

insert into junk (select * from junk)

on a one column table defined as int.
If we start with a 1 row table and repeatedly execute
this command, PG can take the table from 500K rows to
1M rows in 20 seconds; SQL Server is at least twice as
slow.

BUT...

SQL Server can do

select count(*) on junk

in almost no time at all, probably because this query
can be optimised to go back and use catalogue
statistics.

PG, on the other hand, appears to do a full table scan
to answer this question, taking nearly 4 seconds to
process the query.

Doing an ANALYZE on the table and also VACUUM did not
seem to affect this.

Can PG find a table's row count more efficiently?.
This is not an unusual practice in commercial
applications which assume that count(*) with no WHERE
clause will be a cheap query - and use it to test if
a table is empty, for instance. (because for
Oracle/Sybase/SQL Server, count(*) is cheap).

(sure, I appreciate there are other ways of doing
this, but I am curious about the way PG works here).


From: Shridhar Daithankar <ghodechhap(at)ghodechhap(dot)net>
To: Andrew Mayo <ajmayo(at)kohuconsulting(dot)com>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance of count(*) on large tables vs SQL Server
Date: 2005-02-01 13:02:56
Message-ID: 200502011832.56333.ghodechhap@ghodechhap.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Tuesday 01 Feb 2005 6:11 pm, Andrew Mayo wrote:
> PG, on the other hand, appears to do a full table scan
> to answer this question, taking nearly 4 seconds to
> process the query.
>
> Doing an ANALYZE on the table and also VACUUM did not
> seem to affect this.
>
> Can PG find a table's row count more efficiently?.
> This is not an unusual practice in commercial
> applications which assume that count(*) with no WHERE
> clause will be a cheap query - and use it to test if
> a table is empty, for instance. (because for
> Oracle/Sybase/SQL Server, count(*) is cheap).

First of all, such an assumption is no good. It should hit concurrency under
heavy load but I know people do use it.

For the specific question, after a vacuum analyze, you can use

select reltuples from pg_class where relname='Foo';

Remember, you will get different results between 'analyze' and 'vacuum
analyze', since later actually visit every page in the table and hence is
expected to be more accurate.

> (sure, I appreciate there are other ways of doing
> this, but I am curious about the way PG works here).

Answer is MVCC and PG's inability use index alone. This has been a FAQ for a
loong time.. Furthermore PG has custom aggregates to complicate the matter..

Most of the pg developers/users think that unqualified select count(*) is of
no use. You can search the archives for more details..

HTH

Shridhar


From: Stef <stef(at)ummon(dot)com>
To: Shridhar Daithankar <ghodechhap(at)ghodechhap(dot)net>
Cc: Andrew Mayo <ajmayo(at)kohuconsulting(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance of count(*) on large tables vs SQL Server
Date: 2005-02-01 13:22:50
Message-ID: 20050201132250.GF32625@survivor.hades
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


Hello Andrew,
Everything that Shridhar says makes perfect
sense, and, speaking from experience in dealing with
this type of 'problem', everything you say does as
well. Such is life really :)

I would not be at -all- surprised if Sybase
and Oracle did query re-writing behind the scene's
to send un-defined count's to a temporary table which
holds the row count. For an example of such done in
postgreSQL (using triggers and a custom procedure)
look into the 'General Bits' newsletter. Specifically
http://www.varlena.com/varlena/GeneralBits/49.php

I know, giving a URL as an answer 'sucks', but,
well, it simply repeats my experience. Triggers and
Procedures.

Regards
Steph

On Tue, Feb 01, 2005 at 06:32:56PM +0530, Shridhar Daithankar wrote:
> On Tuesday 01 Feb 2005 6:11 pm, Andrew Mayo wrote:
> > PG, on the other hand, appears to do a full table scan
> > to answer this question, taking nearly 4 seconds to
> > process the query.
> >
> > Doing an ANALYZE on the table and also VACUUM did not
> > seem to affect this.
> >
> > Can PG find a table's row count more efficiently?.
> > This is not an unusual practice in commercial
> > applications which assume that count(*) with no WHERE
> > clause will be a cheap query - and use it to test if
> > a table is empty, for instance. (because for
> > Oracle/Sybase/SQL Server, count(*) is cheap).
>
> First of all, such an assumption is no good. It should hit concurrency under
> heavy load but I know people do use it.
>
> For the specific question, after a vacuum analyze, you can use
>
> select reltuples from pg_class where relname='Foo';
>
> Remember, you will get different results between 'analyze' and 'vacuum
> analyze', since later actually visit every page in the table and hence is
> expected to be more accurate.
>
> > (sure, I appreciate there are other ways of doing
> > this, but I am curious about the way PG works here).
>
> Answer is MVCC and PG's inability use index alone. This has been a FAQ for a
> loong time.. Furthermore PG has custom aggregates to complicate the matter..
>
> Most of the pg developers/users think that unqualified select count(*) is of
> no use. You can search the archives for more details..
>
> HTH
>
> Shridhar
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


From: PFC <lists(at)boutiquenumerique(dot)com>
To: "Andrew Mayo" <ajmayo(at)kohuconsulting(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Performance of count(*) on large tables vs SQL Server
Date: 2005-02-01 13:36:30
Message-ID: opslite4tuth1vuj@musicbox
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

> clause will be a cheap query - and use it to test if
> a table is empty, for instance. (because for
> Oracle/Sybase/SQL Server, count(*) is cheap).

To test if a table is empty, use a SELECT EXISTS or whatever SELECT with
a LIMIT 1...