Re: **SPAM** Faster count(*)?

Lists: pgsql-sql
From: "Owen Jacobson" <ojacobson(at)osl(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: Faster count(*)?
Date: 2005-08-09 22:39:33
Message-ID: 000101c59d33$36a4a560$9b00015a@osl.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Salve.

I understand from various web searches and so on that PostgreSQL's MVCC
mechanism makes it very hard to use indices or table metadata to optimise
count(*). Is there a better way to guess the "approximate size" of a table?

I'm trying to write a trigger that fires on insert and performs some
maintenance (collapsing overlapping boxes into a single large box,
specifically) as the table grows. My initial attempt involved count(*) and,
as the number of pages in the table grew, that trigger bogged down the
database.

Any thoughts?

-O


From: dracula007(at)atlas(dot)cz
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: **SPAM** Faster count(*)?
Date: 2005-08-09 23:29:58
Message-ID: 196456155.20050810012958@karneval.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

I believe running count(*) means fulltable scan, and there's no way
to do it without it. But what about some "intermediate" table, with
the necessary counts?

That means to create a table with values (counts) you need, and on
every insert/delete/update increment or decrement the appropriate
values. This way you won't need the count(*) query anymore, and the
performance should be much better.

t.v.

> Salve.

> I understand from various web searches and so on that PostgreSQL's MVCC
> mechanism makes it very hard to use indices or table metadata to optimise
> count(*). Is there a better way to guess the "approximate size" of a table?

> I'm trying to write a trigger that fires on insert and performs some
> maintenance (collapsing overlapping boxes into a single large box,
> specifically) as the table grows. My initial attempt involved count(*) and,
> as the number of pages in the table grew, that trigger bogged down the
> database.

> Any thoughts?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: dracula007(at)atlas(dot)cz
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: **SPAM** Faster count(*)?
Date: 2005-08-10 02:49:14
Message-ID: 21626.1123642154@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

dracula007(at)atlas(dot)cz writes:
> I believe running count(*) means fulltable scan, and there's no way
> to do it without it. But what about some "intermediate" table, with
> the necessary counts?

There's a fairly complete discussion in the PG list archives of a
reasonably-efficient scheme for maintaining such counts via triggers.
It wasn't efficient enough that we were willing to impose the overhead
on every application ... but if you really NEED a fast count(*) you
could implement it. I'd like to see someone actually do it and put
up working code on pgfoundry; AFAIK it's only a paper design so far.

If you only want a very-approximate count, the best bet is to rely on
the planner's estimates, eg

regression=# explain select * from tenk1;
QUERY PLAN
-------------------------------------------------------------
Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
^^^^^

Current best practice is to run the explain and parse out the "rows"
figure using a perl (or axe-of-choice) regexp, though we could be
persuaded to supply a simpler API if there's enough demand for it.

regards, tom lane


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: dracula007(at)atlas(dot)cz, pgsql-sql(at)postgresql(dot)org
Subject: Re: **SPAM** Faster count(*)?
Date: 2005-08-10 03:29:13
Message-ID: 20050810032913.GA40917@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 09, 2005 at 10:49:14PM -0400, Tom Lane wrote:
> Current best practice is to run the explain and parse out the "rows"
> figure using a perl (or axe-of-choice) regexp, though we could be
> persuaded to supply a simpler API if there's enough demand for it.

Somebody else requested a row-count-estimate function a couple of
weeks ago:

http://archives.postgresql.org/pgsql-admin/2005-07/msg00256.php

--
Michael Fuhr


From: Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: **SPAM** Faster count(*)?
Date: 2005-08-10 10:50:59
Message-ID: 20050810105059.GA5448@phlogiston.dyndns.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 09, 2005 at 10:49:14PM -0400, Tom Lane wrote:
>
> Current best practice is to run the explain and parse out the "rows"
> figure using a perl (or axe-of-choice) regexp, though we could be
> persuaded to supply a simpler API if there's enough demand for it.

FWIW, this was another one of those things I must have heard a dozen
times at OSCON. I suspect the simpler API would be popular,
particularly since post-8.0 the estimates are more reliable than they
used to be.

A

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca
The whole tendency of modern prose is away from concreteness.
--George Orwell


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: dracula007(at)atlas(dot)cz, pgsql-sql(at)postgresql(dot)org
Subject: Re: **SPAM** Faster count(*)?
Date: 2005-08-10 13:31:57
Message-ID: 20050810133157.GA46247@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On Tue, Aug 09, 2005 at 09:29:13PM -0600, Michael Fuhr wrote:
> On Tue, Aug 09, 2005 at 10:49:14PM -0400, Tom Lane wrote:
> > Current best practice is to run the explain and parse out the "rows"
> > figure using a perl (or axe-of-choice) regexp, though we could be
> > persuaded to supply a simpler API if there's enough demand for it.
>
> Somebody else requested a row-count-estimate function a couple of
> weeks ago:
>
> http://archives.postgresql.org/pgsql-admin/2005-07/msg00256.php

Here's a simple example that parses EXPLAIN output. It should work
in 8.0.2 and later:

CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$
DECLARE
rec record;
rows integer;
BEGIN
FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
EXIT WHEN rows IS NOT NULL;
END LOOP;

RETURN rows;
END;
$$ LANGUAGE plpgsql VOLATILE STRICT;

CREATE TABLE foo (r double precision);
INSERT INTO foo SELECT random() FROM generate_series(1, 1000);
ANALYZE foo;

SELECT count_estimate('SELECT * FROM foo WHERE r < 0.1');
count_estimate
----------------
97
(1 row)

EXPLAIN SELECT * FROM foo WHERE r < 0.1;
QUERY PLAN
-----------------------------------------------------
Seq Scan on foo (cost=0.00..17.50 rows=97 width=8)
Filter: (r < 0.1::double precision)
(2 rows)

--
Michael Fuhr


From: Richard Huxton <dev(at)archonet(dot)com>
To: Owen Jacobson <ojacobson(at)osl(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: Faster count(*)?
Date: 2005-08-10 14:04:48
Message-ID: 42FA0980.1060104@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Owen Jacobson wrote:
> Salve.
>
> I understand from various web searches and so on that PostgreSQL's MVCC
> mechanism makes it very hard to use indices or table metadata to optimise
> count(*). Is there a better way to guess the "approximate size" of a table?

Plenty of good answers on how to estimate table-size, but it sounds like
you just want to run your maintenance function "every so often".

1. Create a sequence "my_table_tracker_seq"
2. On insert, call nextval(my_table_tracker_seq)
3. If value modulo 1000 = 0, run the maintenance routine

That's about as fast as you can get, and might meet your needs. Of
course you'll need to be more complex if you insert multiple rows at a time.

If you do a lot of deletion on the table and want to take that into
account, have a second sequence you increment on deletion and subtract
the one from the other.

Not always accurate enough, but it is quick.
--
Richard Huxton
Archonet Ltd


From: "Owen Jacobson" <ojacobson(at)osl(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: **SPAM** Faster count(*)?
Date: 2005-08-10 16:27:23
Message-ID: 000001c59dc8$641169d0$9b00015a@osl.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Tom Lane wrote:

> dracula007(at)atlas(dot)cz writes:
> > I believe running count(*) means fulltable scan, and there's no way
> > to do it without it. But what about some "intermediate" table, with
> > the necessary counts?
>
> There's a fairly complete discussion in the PG list archives of a
> reasonably-efficient scheme for maintaining such counts via triggers.
> It wasn't efficient enough that we were willing to impose the overhead
> on every application ... but if you really NEED a fast count(*) you
> could implement it. I'd like to see someone actually do it and put
> up working code on pgfoundry; AFAIK it's only a paper design so far.
>
> If you only want a very-approximate count, the best bet is to rely on
> the planner's estimates, eg
>
> regression=# explain select * from tenk1;
> QUERY PLAN
> -------------------------------------------------------------
> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244)
> ^^^^^
>
> Current best practice is to run the explain and parse out the "rows"
> figure using a perl (or axe-of-choice) regexp, though we could be
> persuaded to supply a simpler API if there's enough demand for it.

Yick. Ok, given all of that, I've rewritten the trigger in question to fire
on different, indexable criteria (difference between "earliest" and "latest"
rows in the table).

Thanks, everyone.

Owen


From: Keith Worthington <KeithW(at)NarrowPathInc(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, dracula007(at)atlas(dot)cz
Subject: Re: **SPAM** Faster count(*)?
Date: 2005-08-12 04:41:31
Message-ID: 42FC287B.7030609@NarrowPathInc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Tom Lane wrote:
> dracula007(at)atlas(dot)cz writes:
>
>>I believe running count(*) means fulltable scan, and there's no way
>>to do it without it. But what about some "intermediate" table, with
>>the necessary counts?
>
>
> There's a fairly complete discussion in the PG list archives of a
> reasonably-efficient scheme for maintaining such counts via triggers.
> It wasn't efficient enough that we were willing to impose the overhead
> on every application ... but if you really NEED a fast count(*) you
> could implement it. I'd like to see someone actually do it and put
> up working code on pgfoundry; AFAIK it's only a paper design so far.

I was kicking this around and came up with the following. I have hit a
couple of snags.

In the function I attempt to count the number of rows in a table being
checked for the first time. I wanted to use 'FROM TG_RELNAME' but as
you can see I had to hard code my test table and comment out the trigger
parameter. FROM tbl_demo--TG_RELNAME Can someone tell me why that won't
work?

Also the function doesn't seem to be getting ROW_COUNT properly. The
end result is that for this test the table is properly inserted into the
monitoring table after it's forth insert but it is never updated after
that. Can someone help me see the forest through the trees?

-- Clean up the environment.
DROP TABLE tbl_row_count;
DROP TABLE tbl_demo;
DROP FUNCTION tf_update_row_count();

-- Build the table for holding the row counts.
CREATE TABLE tbl_row_count
(
relid oid NOT NULL,
row_count int8 NOT NULL DEFAULT 0
)
WITHOUT OIDS;
ALTER TABLE tbl_row_count OWNER TO postgres;
COMMENT ON COLUMN tbl_row_count.relid IS 'Contains relation id number.';
COMMENT ON COLUMN tbl_row_count.row_count IS 'Contains the number of
rows in a relation.';

-- Build a table to test the trigger on.
CREATE TABLE tbl_demo
(
first_name varchar(30) NOT NULL
)
WITHOUT OIDS;
ALTER TABLE tbl_demo OWNER TO postgres;
COMMENT ON TABLE tbl_demo IS 'Table used for demonstrating a trigger.';

-- Create the trigger function to maintain the row counts.
CREATE OR REPLACE FUNCTION public.tf_update_row_count()
RETURNS "trigger" AS
$BODY$
DECLARE
v_row_count int8;
BEGIN
-- Store the row count before it disappears.
GET DIAGNOSTICS v_row_count = ROW_COUNT;
-- Check if this is a new table.
PERFORM relid
FROM public.tbl_row_count
WHERE relid = TG_RELID;
IF FOUND THEN
-- Data for this table is already in the row count table.
IF TG_OP = 'INSERT' THEN
UPDATE public.tbl_row_count
SET row_count = row_count + v_row_count
WHERE relid = TG_RELID;
ELSIF TG_OP = 'DELETE' THEN
UPDATE public.tbl_row_count
SET row_count = row_count - v_row_count
WHERE relid = TG_RELID;
END IF;
ELSE
-- This is a new table so it needs to be counted.
SELECT count(*)
FROM tbl_demo--TG_RELNAME
INTO v_row_count;
INSERT INTO public.tbl_row_count ( relid, row_count )
VALUES ( TG_RELID,
v_row_count
);
END IF;
RETURN NULL;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
ALTER FUNCTION public.tf_update_row_count() OWNER TO postgres;
GRANT EXECUTE ON FUNCTION public.tf_update_row_count() TO postgres;
GRANT EXECUTE ON FUNCTION public.tf_update_row_count() TO public;

-- Insert some initial data into the demo table.
INSERT INTO public.tbl_demo
( first_name )
VALUES ( 'Keith' );
INSERT INTO public.tbl_demo
( first_name )
VALUES ( 'Ed' );
INSERT INTO public.tbl_demo
( first_name )
VALUES ( 'Kryss' );

-- Create the trigger on the demo table.
CREATE TRIGGER tgr_update_row_count
AFTER INSERT OR DELETE
ON public.tbl_demo
FOR EACH STATEMENT
EXECUTE PROCEDURE public.tf_update_row_count();

-- Examine the starting state of the tables.
SELECT *
FROM public.tbl_demo;
SELECT *
FROM public.tbl_row_count;
SELECT relid,
relname,
row_count
FROM public.tbl_row_count
LEFT JOIN pg_class
ON ( tbl_row_count.relid = pg_class.oid );

-- Insert a row.
INSERT INTO public.tbl_demo
( first_name )
VALUES ( 'Jarus' );

-- Examine the new state of the tables.
SELECT *
FROM public.tbl_demo;
SELECT relid,
relname,
row_count
FROM public.tbl_row_count
LEFT JOIN pg_class
ON ( tbl_row_count.relid = pg_class.oid );

-- Insert two more rows.
INSERT INTO public.tbl_demo
( first_name )
VALUES ( 'Dani' );
INSERT INTO public.tbl_demo
( first_name )
VALUES ( 'Mary' );

-- Examine the final state of the tables.
SELECT *
FROM public.tbl_demo;
SELECT relid,
relname,
row_count
FROM public.tbl_row_count
LEFT JOIN pg_class
ON ( tbl_row_count.relid = pg_class.oid );

--
Kind Regards,
Keith