Performance HOWTO - pseudo counter example

Lists: pgsql-adminpgsql-general
From: "S Grannis" <sjg(at)email(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Performance Issues with count()
Date: 2002-04-25 18:54:40
Message-ID: 20020425185441.10228.qmail@email.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

Thanks for all of the useful comments.

Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to 2 hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time.

Others have suggested the "fix" is in the future.

Stephan Szabo wrote:
"I didn't see anything about your settings in postgresql.conf,
but increasing the sort_mem parameter may help that really
expensive sort step. I think the desired fix for this would
probably be the TODO entry on hash based aggregates but that's
still in the future..."

The "non-default" postgresql.conf settings are as follows:

shared_buffers = 240000 # uses ~2GB of shared mem
sort_mem = 512
wal_files = 64
enable_seqscan = 0 # per a recommendation
enable_indexscan = true
enable_tidscan = true
enable_sort = true
enable_nestloop = true
enable_mergejoin = true
enable_hashjoin = true

I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl script to do the actual counting.

Thanks again for the feedback,

Shaun Grannis

----- Original Message -----
From: Michael Loftis <mloftis(at)wgops(dot)com>
Date: Wed, 24 Apr 2002 10:14:04 -0700
To: asdf asdasfa <sjg(at)email(dot)com>
Subject: Re: [GENERAL] Performance Issues with count()

> Can you humour me and
> set enable_seqscan=0
> And retry the query?
>
> Thanks :)
>
> S Grannis wrote:
>
> > Hi,
> >
> > I've found some performance issues with Postgres that
> > I'm hoping people on this list can help resolve. We're
> > working with a 65 million record table that includes year
> > of birth (data type INT). To count the frequency of dates
> > in the table, it takes 2 hours 26 minutes to execute.
> > (There's an approximately 100-year range of dates in the
> > 65 million records).
> >
> > # EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
> >
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
> > -> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
> > -> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
> > -> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
> > I can count data from the flat text data file with this
> > Perl script:
> >
> > #!/usr/bin/perl
> > # script to count YB frequencies in flat data file
> > open (IN, "$ARGV[0]");
> > open (OUT, ">$ARGV[0]\_cnt");
> > while (<IN>) {
> > chomp;
> > $years{$_}++;}
> > foreach $key (keys %years) {
> > print OUT "$key,$years{$key}\n";}
> >
> > The Perl script takes *1 minute*, 31 seconds to run.
> > Why is there such a discrepancy in times? I've noticed
> > that the Postgres count() function takes what seems to
> > be "longer than it should" in other cases as well. For
> > instance, counting the frequency of last names in the
> > same 65 million record table took *1 hour* and 31
> > minutes:
> >
> > # EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
> > NOTICE: QUERY PLAN:
> > Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
> > -> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
> > -> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
> > -> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
> >
> > The last name ( ln) and the year of birth ( yb) is
> > indexed, but that shouldn't matter because it's doing a
> > sequential scan, correct? Am I running into the
> > limitations of Postgres? We'd like to eventually get this
> > system into production, but if we can't get Postgres to
> > count() faster, we may not be able to use it.
> >
> > Here's the data_table schema:
> >
> > # \d data_table
> >
> > Table "data_table"
> > Column | Type | Modifiers
> > --------+---------------+-----------
> > ss | character(9) |
> > ln | character(15) |
> > fn | character(15) |
> > mi | character(1) |
> > ns | character(15) |
> > lny | character(15) |
> > fny | character(15) |
> > sny | character(15) |
> > g | character(1) |
> > mb | integer |
> > db | integer |
> > yb | integer |
> > md | integer |
> > dd | integer |
> > yd | integer |
> > Indexes: ssdi_ss_idx
> > ssdi_ln_idx
> >
> > We're working with Postgres v 7.2. The machine is a
> > dual-processor Athlon MP1900 (Tyan Tiger board) with
> > 3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
> > configured in a software RAID 0 Array running under
> > RedHat Linux v. 7.2.
> >
> > We've VACUUM ANALYZE'd the tables after creating the
> > indices. Is there something I'm missing here?
> >
> > Thanks for your suggestions.
> >
> > Shaun Grannis
--

_______________________________________________
Sign-up for your own FREE Personalized E-mail at Email.com
http://www.email.com/?sr=signup


From: Neil Conway <nconway(at)klamath(dot)dyndns(dot)org>
To: "S Grannis" <sjg(at)email(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Performance Issues with count()
Date: 2002-04-25 19:03:53
Message-ID: 20020425150353.2373c2d7.nconway@klamath.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

On Thu, 25 Apr 2002 13:54:40 -0500
"S Grannis" <sjg(at)email(dot)com> wrote:
> enable_seqscan = 0 # per a recommendation

This is a bad idea -- I'd recommend you leave this set to 'true',
particularly since it doesn't help your particular case.

Cheers,

Neil

--
Neil Conway <neilconway(at)rogers(dot)com>
PGP Key ID: DB3C29FC


From: Grant Johnson <grant(at)amadensor(dot)com>
To: S Grannis <sjg(at)email(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: Performance Issues with count()
Date: 2002-04-25 19:19:12
Message-ID: 3CC856B0.8060402@amadensor.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

Index the yb col. It might help the sort and group run better.

create index data_table_yb on data_table (yb);
vacuum analyze;

EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;

SELECT yb, count(1) FROM data_table GROUP BY yb;

Also try a count(*) instead, it will count any row with any non-null values.

S Grannis wrote:

>Thanks for all of the useful comments.
>
>Per recommendation, I set enable_seqscan=0 and re-ran the query. The time to count() yb went from 2 hours 26 minutes to 2 hours 17 minutes. That variation of 9 minutes is likely related to the load on the machine at the time.
>
>Others have suggested the "fix" is in the future.
>
>Stephan Szabo wrote:
>"I didn't see anything about your settings in postgresql.conf,
>but increasing the sort_mem parameter may help that really
>expensive sort step. I think the desired fix for this would
>probably be the TODO entry on hash based aggregates but that's
>still in the future..."
>
>The "non-default" postgresql.conf settings are as follows:
>
>shared_buffers = 240000 # uses ~2GB of shared mem
>sort_mem = 512
>wal_files = 64
>enable_seqscan = 0 # per a recommendation
>enable_indexscan = true
>enable_tidscan = true
>enable_sort = true
>enable_nestloop = true
>enable_mergejoin = true
>enable_hashjoin = true
>
>I think our work-around for now will be to SELECT the column we wish to analyze into a flat file and then run a Perl script to do the actual counting.
>
>Thanks again for the feedback,
>
>Shaun Grannis
>
>----- Original Message -----
>From: Michael Loftis <mloftis(at)wgops(dot)com>
>Date: Wed, 24 Apr 2002 10:14:04 -0700
>To: asdf asdasfa <sjg(at)email(dot)com>
>Subject: Re: [GENERAL] Performance Issues with count()
>
>
>>Can you humour me and
>>set enable_seqscan=0
>>And retry the query?
>>
>>Thanks :)
>>
>>S Grannis wrote:
>>
>>>Hi,
>>>
>>>I've found some performance issues with Postgres that
>>>I'm hoping people on this list can help resolve. We're
>>>working with a 65 million record table that includes year
>>>of birth (data type INT). To count the frequency of dates
>>>in the table, it takes 2 hours 26 minutes to execute.
>>>(There's an approximately 100-year range of dates in the
>>>65 million records).
>>>
>>># EXPLAIN SELECT yb, count(1) FROM data_table GROUP BY yb;
>>>
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=16397434.27..16723548.69 rows=6522288 width=4)
>>>-> Group (cost=16397434.27..16560491.48 rows=65222884 width=4)
>>>-> Sort (cost=16397434.27..16397434.27 rows=65222884 width=4)
>>>-> Seq Scan on data_table (cost=0.00..2368620.84 rows=65222884 width=4)
>>>I can count data from the flat text data file with this
>>>Perl script:
>>>
>>>#!/usr/bin/perl
>>># script to count YB frequencies in flat data file
>>>open (IN, "$ARGV[0]");
>>>open (OUT, ">$ARGV[0]\_cnt");
>>>while (<IN>) {
>>>chomp;
>>>$years{$_}++;}
>>>foreach $key (keys %years) {
>>>print OUT "$key,$years{$key}\n";}
>>>
>>>The Perl script takes *1 minute*, 31 seconds to run.
>>>Why is there such a discrepancy in times? I've noticed
>>>that the Postgres count() function takes what seems to
>>>be "longer than it should" in other cases as well. For
>>>instance, counting the frequency of last names in the
>>>same 65 million record table took *1 hour* and 31
>>>minutes:
>>>
>>># EXPLAIN SELECT ln, count(1) FROM data_table GROUP BY ln;
>>>NOTICE: QUERY PLAN:
>>>Aggregate (cost=19538149.27..19864263.69 rows=6522288 width=19)
>>>-> Group (cost=19538149.27..19701206.48 rows=65222884 width=19)
>>>-> Sort (cost=19538149.27..19538149.27 rows=65222884 width=19)
>>>-> Seq Scan on data_table (cost=0.00..2324610.84 rows=65222884 width=19)
>>>
>>>The last name ( ln) and the year of birth ( yb) is
>>>indexed, but that shouldn't matter because it's doing a
>>>sequential scan, correct? Am I running into the
>>>limitations of Postgres? We'd like to eventually get this
>>>system into production, but if we can't get Postgres to
>>>count() faster, we may not be able to use it.
>>>
>>>Here's the data_table schema:
>>>
>>># \d data_table
>>>
>>>Table "data_table"
>>> Column | Type | Modifiers
>>>--------+---------------+-----------
>>> ss | character(9) |
>>> ln | character(15) |
>>> fn | character(15) |
>>> mi | character(1) |
>>> ns | character(15) |
>>> lny | character(15) |
>>> fny | character(15) |
>>> sny | character(15) |
>>> g | character(1) |
>>> mb | integer |
>>> db | integer |
>>> yb | integer |
>>> md | integer |
>>> dd | integer |
>>> yd | integer |
>>>Indexes: ssdi_ss_idx
>>> ssdi_ln_idx
>>>
>>>We're working with Postgres v 7.2. The machine is a
>>>dual-processor Athlon MP1900 (Tyan Tiger board) with
>>>3GB of PC2100 DDR RAM, and 3-80GB IBM 120GXP hard drives
>>>configured in a software RAID 0 Array running under
>>>RedHat Linux v. 7.2.
>>>
>>>We've VACUUM ANALYZE'd the tables after creating the
>>>indices. Is there something I'm missing here?
>>>
>>>Thanks for your suggestions.
>>>
>>>Shaun Grannis
>>>


From: Stephan Szabo <sszabo(at)megazone23(dot)bigpanda(dot)com>
To: S Grannis <sjg(at)email(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Performance Issues with count()
Date: 2002-04-25 19:35:00
Message-ID: 20020425123052.N71187-100000@megazone23.bigpanda.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general


On Thu, 25 Apr 2002, S Grannis wrote:

> Others have suggested the "fix" is in the future.
>
> Stephan Szabo wrote:
> "I didn't see anything about your settings in postgresql.conf,
> but increasing the sort_mem parameter may help that really
> expensive sort step. I think the desired fix for this would
> probably be the TODO entry on hash based aggregates but that's
> still in the future..."
>
> The "non-default" postgresql.conf settings are as follows:
>
> shared_buffers = 240000 # uses ~2GB of shared mem
> sort_mem = 512

It might be interesting to try with sort_mem set up an order
of magnitude to see if that helps at all. I doubt it'll be
enough, but it's probably worth trying.

> I think our work-around for now will be to SELECT the column we wish
> to analyze into a flat file and then run a Perl script to do the
> actual counting.

Another option would be a C function that does runs a query via
SPI and does the hashing count and sticks it in a temporary table
but that may be a bit excessive.


From: Jean-Michel POURE <jm(dot)poure(at)freesurf(dot)fr>
To: "S Grannis" <sjg(at)email(dot)com>, pgsql-general(at)postgresql(dot)org
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Performance HOWTO - pseudo counter example
Date: 2002-04-29 15:45:08
Message-ID: 200204291745.08315.jm.poure@freesurf.fr
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin pgsql-general

Le Jeudi 25 Avril 2002 20:54, S Grannis a écrit :
> I think our work-around for now will be to SELECT the column we wish to
> analyze into a flat file and then run a Perl script to do the actual
> counting.

Dear all,

I wrote a small howto to solve S Grannis performance questions on Count()
function. The idea is to create and maintain a pseudo-count table using
triggers and PLpgSQL.

Unfortunately, I could not test the PLpgSQL scripts in 16 Million records for
lack of space reason (only on 1 Million records). Code is included to
generate fake test data. Could someone help me test the howto on 16 million
records?

Thank you for your feedback,
Cheers,
Jean-Michel

*************************************************************************
Performance HOWTO - pseudo counter example
This document is released under PostgreSQL license
*************************************************************************

This tutorial demonstrates how to create fast pseudo-counters in PostgreSQL
using PLpgSQL and triggers.

1) Performance background
This small howto is insprired in reply to an email on
pgsql-general(at)postgresql(dot)org complaining about PostgreSQL speed.

The user needs to run COUNT statements on a large database of 65.000.000
records. The table structure is basically as follows:

CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4
);
In our example, data_yd is a year value between 1950 and 2050.

The user needs to run the following query:
SELECT COUNT (data_yd) FROM data WHERE data_yd = foo_year.
where foo_year is a date between 1950 and 2050.

The query takes more than two hours to execute on a double-processor computer
running PostgreSQL and GNU/Linux. The proposed solution creates a
pseudo-counter on PostgreSQL using PLpgSQL and triggers. The aim is to return
a result in 0.005 second. Initilisation itself of the pseudo-counter table
should take less than 30 minutes.

1) INSTALLATION
a) Database creation

Open a terminal windows, connect as 'postgres' user:
root(at)localhost>su postgres

Create an empty database:
postgresql(at)localhost>psql template1;
template1=\CREATE DATABASE pseudo_counter;
template1=\q

b) PLpgSQL declaration

PLpgSQL is compiled by default in PostgreSQL. But you should enable PLpgSQL on
the database itself:

postgresql(at)localhost>CREATELANG plpgsql pseudo_counter

c) Data table

We first need to create the table stucture:
CREATE TABLE "data" (
"data_oid" serial8,
"data_yd" int4,
"data_counterenabled" bool DEFAULT 'f'
) WITH OIDS;
CREATE INDEX data_yd_idx ON data USING btree (data_yd);
CREATE INDEX data_counter_idx ON data USING btree (data_counterenabled);

And create a PLpgSQL function to add fake records:

CREATE FUNCTION "init_fakedata"("int8", "int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;
tStart timestamp ;

BEGIN
tStart = now ();
IF ($1>0) AND ($2 >0) AND ($3 >0) AND ($3>$2) THEN
FOR iLoop in 1 .. $1 LOOP
INSERT INTO data (data_yd)
VALUES (
int8 (random () * ($3-$2) +$2)
);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';

To insert 16 million records with a year range between 1950 and 2050, enter:
SELECT init_fakedata(16000000, 1950, 2050); which should take a while... and
fill mor than 3.2 Gb on disc.

If you cannot wait that long :

For testing, insert 126.953 records :
SELECT init_fakedata(126953, 1950, 2050);
This takes 40s on my server.

Then, repeat 8 times:
INSERT INTO data (data_yd)
SELECT data_yd FROM data;

This should produce 64999936 fake records more quickly as no random function
is
used.

Enter:
CHECKPOINT;
VACUUM ANALYSE;
to clear data cache and update statistics.

d) Pseudo-count table
Now, let's create a pseudo-count table.
CREATE TABLE "pseudo_count" (
"count_oid" serial,
"count_year" int4,
"count_value" int4 DEFAULT 0
) WITH OIDS;
CREATE INDEX count_value_idx ON pseudo_count USING btree (count_value);
CREATE INDEX count_year_idx ON pseudo_count USING btree (count_year);

... and initialize it with the required data (values in the 1950 - 2050 range)
:
CREATE FUNCTION "init_pseudocount"("int4", "int4") RETURNS "bool" AS '
DECLARE
iLoop int4;

BEGIN
IF (($1>0) AND ($2>0) AND ($2>=$1)) THEN
FOR iLoop in $1 .. $2 LOOP
INSERT INTO pseudo_count (count_year)
VALUES (iLoop);
END LOOP;
RETURN ''t'';
ELSE
RETURN ''f'';
END IF;
END;
' LANGUAGE 'plpgsql';

Example :
SELECT init_pseudocount(1950, 2050) ;
will create the required records for years 1900 to 2100.

e) PLpgSQL function and trigger

Pseudo count is handled using a single trigger 'tg_data' running 'tg_data()'
function. TG_OP is used to catch the trigger context ('insert', 'update' or
'delete').

CREATE FUNCTION "tg_data"() RETURNS "opaque" AS 'DECLARE
rec record;
BEGIN

IF (TG_OP=''UPDATE'') THEN
IF (new.data_counterenabled = ''t'') AND (old.data_counterenabled = ''f'')
THEN
UPDATE pseudo_count
SET count_value = count_value +1
WHERE count_year = new.data_yd
AND count_value >= 0;
END IF;

IF (new.data_counterenabled = ''f'') AND (old.data_counterenabled = ''t'')
THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = new.data_yd
AND count_value > 0;
END IF;

IF (old.data_yd <> new.data_yd) THEN
UPDATE pseudo_count
SET count_value = count_value -1
WHERE count_year = old.data_yd
AND count_value > 0;

UPDATE pseudo_count
SET count_value = count_value + 1
WHERE count_year = new.data_yd
AND count_value >= 0 ;
END IF;
END IF;

IF (TG_OP=''DELETE'') THEN
UPDATE pseudo_count
SET count_value = count_value - 1
WHERE count_year = old.data_yd
AND count_value >= 0 ;
END IF;

IF (TG_OP=''UPDATE'') THEN
RETURN new;
ELSE
RETURN old;
END IF;

END;
' LANGUAGE 'plpgsql';

CREATE TRIGGER "tg_data" BEFORE DELETE OR UPDATE ON "data" FOR EACH
ROW EXECUTE PROCEDURE tg_data();

2) USAGE

a) Initialisation
Set "pseudo_countenabled" flag on:

UPDATE data
SET data_counterenabled = 't'
WHERE data_counterenabled = 'f'

b) Run pseudo-count queries

Instead of :
SELECT COUNT (data_yd)
FROM data
WHERE data_yd = foo_year.

you now can run:
SELECT count_value
FROM pseudo_count
WHERE pseudo_date = foo_year

The anwer comes in 0.005 second.

c) Limits
Before loading large amount of data, triggers on table 'data' should be
dropped and recreated afterwards.