Re: Yet Another COUNT(*)...WHERE...question

Lists: pgsql-general
From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 14:36:45
Message-ID: e373d31e0708150736u3b49fb1eq1fe79f40aff826f8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I'm grappling with a lot of reporting code for our app that relies on
queries such as:

SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...

And I still do not find, from the discussions on this thread, any
truly viable solution for this. The one suggestion is to have a
separate counts table, which is fine for total aggregates related to,
say, an ID. E.g., a table with:

trader_id, trade_count

But this is an overall count for the trader (in my example). What if I
need a count of all his trades in the last one week. Then I need a
timestamp condition in there as well. The number of such possibilities
for multiple WHERE conditions is infinite...how should we account for
all these avenues?

Would love to hear experiences of others and what compromises they
have made. From a reporting perspective, waiting for 10 minutes for a
simple count to return seems untenable.

TIA!


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 14:43:29
Message-ID: 87fy2krh7y.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com> writes:

> I'm grappling with a lot of reporting code for our app that relies on
> queries such as:
>
> SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
>...
> The number of such possibilities for multiple WHERE conditions is
> infinite...

Depends on the "conditions" bit. You can't solve all of the infinite
possibilities -- well you can, just run the query above -- but if you want to
do better it's all about understanding your data.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: "Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 14:48:55
Message-ID: e373d31e0708150748i7447338dlf3634882c18879a2@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 15/08/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com> writes:
>
> > I'm grappling with a lot of reporting code for our app that relies on
> > queries such as:
> >
> > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
> >...
> > The number of such possibilities for multiple WHERE conditions is
> > infinite...
>
> Depends on the "conditions" bit. You can't solve all of the infinite
> possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.

I am not sure what the advice here is. The WHERE condition comes from
the indices. So if the query was not "COUNT(*)" but just a couple of
columns, the query executes in less than a second. Just that COUNT(*)
becomes horribly slow. And since the file system based query caching
feature of PG is unclear to me (I am just moving from MySQL where the
cache is quite powerful) I don't quite know what to do to speed up
these queries!


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 14:55:01
Message-ID: dcc563d10708150755u8015412s110a4259a5cebc3c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/15/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> I'm grappling with a lot of reporting code for our app that relies on
> queries such as:
>
> SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
>
> And I still do not find, from the discussions on this thread, any
> truly viable solution for this. The one suggestion is to have a
> separate counts table, which is fine for total aggregates related to,
> say, an ID. E.g., a table with:
>
> trader_id, trade_count
>
> But this is an overall count for the trader (in my example). What if I
> need a count of all his trades in the last one week. Then I need a
> timestamp condition in there as well. The number of such possibilities
> for multiple WHERE conditions is infinite...how should we account for
> all these avenues?
>
> Would love to hear experiences of others and what compromises they
> have made. From a reporting perspective, waiting for 10 minutes for a
> simple count to return seems untenable.

Generally, for these kinds of things it's often best to use
materialized views / rollup tables so that you aren't re-aggregating
the same data over and over.


From: Richard Huxton <dev(at)archonet(dot)com>
To: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 14:56:18
Message-ID: 46C31412.7080305@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Phoenix Kiula wrote:
>>>
>>> SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
>
> I am not sure what the advice here is. The WHERE condition comes from
> the indices. So if the query was not "COUNT(*)" but just a couple of
> columns, the query executes in less than a second. Just that COUNT(*)
> becomes horribly slow.

The count(*) shouldn't slow things down compared to running the query to
fetch columns. It should be at least as fast, or faster if the columns
you fetch are large.
1. Do you have an example?
2. You're not running a query to get the columns, then a separate
count(*) to get a rowcount are you?

> And since the file system based query caching
> feature of PG is unclear to me (I am just moving from MySQL where the
> cache is quite powerful) I don't quite know what to do to speed up
> these queries!

There isn't a "file system based query caching" feature, there's your
operating-systems file-cache and PG's buffers. Neither of which cache
query-results, but cache disk pages instead.

--
Richard Huxton
Archonet Ltd


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 14:59:59
Message-ID: dcc563d10708150759r7e59c5bfp6c8207e685faa2c1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/15/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> On 15/08/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> > "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com> writes:
> >
> > > I'm grappling with a lot of reporting code for our app that relies on
> > > queries such as:
> > >
> > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
> > >...
> > > The number of such possibilities for multiple WHERE conditions is
> > > infinite...
> >
> > Depends on the "conditions" bit. You can't solve all of the infinite
> > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.
>
>
> I am not sure what the advice here is. The WHERE condition comes from
> the indices. So if the query was not "COUNT(*)" but just a couple of
> columns, the query executes in less than a second. Just that COUNT(*)
> becomes horribly slow.

Sorry, but I don't believe you. if you're doing a count(*) on the
same dataset that returns in < 1 second, then the count(*) with the
same where clause will run in < 1 second. I haven't seen pgsql do
anything else.

> And since the file system based query caching
> feature of PG is unclear to me

There is no "query caching" in pgsql. There is data caching. Each
query has to get planned and executed though (unless prepared, then
just executed)

> (I am just moving from MySQL where the
> cache is quite powerful)

As long as nothing is changing behind the query, and invalidating the
query cache. It is useful for reporting apps, but in a constantly
updating db pretty much useless.

> I don't quite know what to do to speed up
> these queries!

Post them with explain analyze output. i.e.

explain analyze yourqueryhere

cut and past the query and the output. as well as the table schema.


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
Cc: Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:06:57
Message-ID: 745725.37493.qm@web31807.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

--- Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> Generally, for these kinds of things it's often best to use
> materialized views / rollup tables so that you aren't re-aggregating
> the same data over and over.

I don't know if this was already mentioned, but here is one of the links that describe the method
of implementing a materialized view.
http://www.jonathangardner.net/PostgreSQL/materialized_views/matviews.html

other useful docs like this one can be found here:

http://www.postgresql.org/docs/techdocs.2

Regards,
Richard Broersma Jr.


From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:12:14
Message-ID: e373d31e0708150812g65297504lf4cb1b05e748133c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 15/08/07, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> On 8/15/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> > On 15/08/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> > > "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com> writes:
> > >
> > > > I'm grappling with a lot of reporting code for our app that relies on
> > > > queries such as:
> > > >
> > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
> > > >...
> > > > The number of such possibilities for multiple WHERE conditions is
> > > > infinite...
> > >
> > > Depends on the "conditions" bit. You can't solve all of the infinite
> > > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.
> >
> >
> > I am not sure what the advice here is. The WHERE condition comes from
> > the indices. So if the query was not "COUNT(*)" but just a couple of
> > columns, the query executes in less than a second. Just that COUNT(*)
> > becomes horribly slow.
>
> Sorry, but I don't believe you. if you're doing a count(*) on the
> same dataset that returns in < 1 second, then the count(*) with the
> same where clause will run in < 1 second. I haven't seen pgsql do
> anything else.

Sorry I was not clear. Imagine an Amazon.com search results page. It
has about 15 results on Page 1, then it shows "Page 1 of 190".

To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
Page 1. However, to calculate the total number of pages, they probably
do a separate counts query, because doing a "select *" and then
counting the number of rows returned would be even more inefficient
than a count(*).

So, in reporting, two queries are fairly common I would think, unless
I am missing something?


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:24:13
Message-ID: dcc563d10708150824q2ed10e27g6457dcec8cf14671@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/15/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> On 15/08/07, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> > On 8/15/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> > > On 15/08/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> > > > "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com> writes:
> > > >
> > > > > I'm grappling with a lot of reporting code for our app that relies on
> > > > > queries such as:
> > > > >
> > > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
> > > > >...
> > > > > The number of such possibilities for multiple WHERE conditions is
> > > > > infinite...
> > > >
> > > > Depends on the "conditions" bit. You can't solve all of the infinite
> > > > possibilities -- well you can, just run the query above -- but if you want > to do better it's all about understanding your data.
> > >
> > >
> > > I am not sure what the advice here is. The WHERE condition comes from
> > > the indices. So if the query was not "COUNT(*)" but just a couple of
> > > columns, the query executes in less than a second. Just that COUNT(*)
> > > becomes horribly slow.
> >
> > Sorry, but I don't believe you. if you're doing a count(*) on the
> > same dataset that returns in < 1 second, then the count(*) with the
> > same where clause will run in < 1 second. I haven't seen pgsql do
> > anything else.
>
>
>
> Sorry I was not clear. Imagine an Amazon.com search results page. It
> has about 15 results on Page 1, then it shows "Page 1 of 190".
>
> To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
> Page 1. However, to calculate the total number of pages, they probably
> do a separate counts query, because doing a "select *" and then
> counting the number of rows returned would be even more inefficient
> than a count(*).

When I go to amazon.com I only ever get three pages of results. ever.
Because they know that returning 190 pages is not that useful, as
hardly anyone is going to wander through that many pages.

Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
blacksmith" i.e. it's guesstimating as well. no reason for google to
look at every single row for blacksmith to know that there's about 5.6
million.

> So, in reporting, two queries are fairly common I would think, unless
> I am missing something?

Yes, optimization. :) You don't need an exact count to tell someone
that there's more data and they can go to it. Note that if you are
planning on doing things google sized, you'll need to do what they
did, invent your own specialized database.

For us mere mortals, it's quite likely that you can do something like:

explain select * from table where field like 'abc%';

and then parse the explain output for an approximate number.


From: Richard Broersma Jr <rabroersma(at)yahoo(dot)com>
To: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, Postgres General <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:29:30
Message-ID: 764391.8517.qm@web31801.mail.mud.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general


--- Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> Sorry I was not clear. Imagine an Amazon.com search results page. It
> has about 15 results on Page 1, then it shows "Page 1 of 190".

I don't think that amazon or google really need to give an accurate count in determining an
estimated number of pages...

Could you determine the number of pages quickly from postgresql:

[ row count estimate ] / [ number of rows you want per page]

The estimated row count is updated every time you vacuum your tables. And getting the estimate
takes very little time.

> To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
> Page 1.

The "LIMIT 15 OFFSET 1500" technique can be a performance killer since offset does not use an
index.

Is is better to use the last entry of each page in the query for the next page, so you can write
your query this way:

SELECT *
FROM your_table
WHERE item_nbr > [: last item on previous page :]
ORDER BY item_nbr
LIMIT 15;

This method was discuss on the list a couple of months ago.

Regards,
Richard Broersma Jr.


From: Erik Jones <erik(at)myemma(dot)com>
To: Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:35:57
Message-ID: 573E6F06-2FD1-4413-B4AF-F696DDC369D8@myemma.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Aug 15, 2007, at 9:36 AM, Phoenix Kiula wrote:

> I'm grappling with a lot of reporting code for our app that relies on
> queries such as:
>
> SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
>
> And I still do not find, from the discussions on this thread, any
> truly viable solution for this. The one suggestion is to have a
> separate counts table, which is fine for total aggregates related to,
> say, an ID. E.g., a table with:
>
> trader_id, trade_count
>
> But this is an overall count for the trader (in my example). What if I
> need a count of all his trades in the last one week. Then I need a
> timestamp condition in there as well. The number of such possibilities
> for multiple WHERE conditions is infinite...how should we account for
> all these avenues?

There is no general solution. While theoretically the multiple WHERE
conditions are infinite, in reality their limited to your actual use
cases and the solutions are thereby dictated by those. Using a
separate cache table is often a viable option used in situations
where constantly up to date realtime values. Another common option
is smart usage of indexes, i.e remember that you can index on the
results of a function applied to row values as well as partial
indexes. Another is table partitioning. Asking how to optimize
"SELECT COUNT(*) FROM TABLE WHER... (conditions)" is not a good
question as the solution is dependent on those conditions. Pick your
most common conditions and optimize for those.

Also, in many cases for reporting apps, 10 minutes is not long at
all. If you have reports that you can't make happen faster, schedule
and automate them.

Erik Jones

Software Developer | Emma®
erik(at)myemma(dot)com
800.595.4401 or 615.292.5888
615.292.0777 (fax)

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com


From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
Cc: "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:36:52
Message-ID: e373d31e0708150836s1115447aw4944c133f1b247b4@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Yes, optimization. :) You don't need an exact count to tell someone
> that there's more data and they can go to it.

In general, I agree. But my example of Amazon was only to illustrate
the point about two queries and why they may be needed. I seem to see
many more pages than you do, but in any case, Google and Amazon can
afford to be less precise.

Thanks for the suggestion of using EXPLAIN and parsing an
approximation, but when you need to show a trader how many trades he
has made, for instance, then approximation is not a possibility at
all. Especially not if the numbers sway so wildly --

FIRSTDB=# explain select * from trades where t_id = 'kXjha';
QUERY PLAN
-----------------------------------------------------------------------------------
Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224)
Recheck Cond: ((t_id)::text = 'kXjha'::text)
-> Bitmap Index Scan on trades_tid_date (cost=0.00..15.67 rows=374 width=0)
Index Cond: ((t_id)::text = 'kXjha'::text)
(4 rows)

FIRSTDB=# select count(*) from trades where t_id = 'kXjha';
count
-------
3891
(1 row)

Could I do something so that the EXPLAIN showed up with slightly more
close-to-accurate stats? The above query is just after a "vacuum
analyze"!

Much appreciate the suggestions.


From: "Andrew Armstrong" <andrew(at)mammoth(dot)com(dot)au>
To: "'Scott Marlowe'" <scott(dot)marlowe(at)gmail(dot)com>, "'Phoenix Kiula'" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "'Gregory Stark'" <stark(at)enterprisedb(dot)com>, "'Postgres General'" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:40:18
Message-ID: 001401c7df52$95b30c00$b6f3a305@plasma
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

I don't know how PSQL does it, but MySQL has an SQL_CALC_FOUND_ROWS
extension which allows the query to also return how many rows exist without
the LIMIT clause. Perhaps there is similar for PSQL (check LIMIT docs?)

- Andrew

-----Original Message-----
From: pgsql-general-owner(at)postgresql(dot)org
[mailto:pgsql-general-owner(at)postgresql(dot)org] On Behalf Of Scott Marlowe
Sent: Thursday, 16 August 2007 1:24 AM
To: Phoenix Kiula
Cc: Gregory Stark; Postgres General
Subject: Re: [GENERAL] Yet Another COUNT(*)...WHERE...question

On 8/15/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> On 15/08/07, Scott Marlowe <scott(dot)marlowe(at)gmail(dot)com> wrote:
> > On 8/15/07, Phoenix Kiula <phoenix(dot)kiula(at)gmail(dot)com> wrote:
> > > On 15/08/07, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:
> > > > "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com> writes:
> > > >
> > > > > I'm grappling with a lot of reporting code for our app that relies
on
> > > > > queries such as:
> > > > >
> > > > > SELECT COUNT(*) FROM TABLE WHERE ....(conditions)...
> > > > >...
> > > > > The number of such possibilities for multiple WHERE conditions is
> > > > > infinite...
> > > >
> > > > Depends on the "conditions" bit. You can't solve all of the infinite
> > > > possibilities -- well you can, just run the query above -- but if
you want > to do better it's all about understanding your data.
> > >
> > >
> > > I am not sure what the advice here is. The WHERE condition comes from
> > > the indices. So if the query was not "COUNT(*)" but just a couple of
> > > columns, the query executes in less than a second. Just that COUNT(*)
> > > becomes horribly slow.
> >
> > Sorry, but I don't believe you. if you're doing a count(*) on the
> > same dataset that returns in < 1 second, then the count(*) with the
> > same where clause will run in < 1 second. I haven't seen pgsql do
> > anything else.
>
>
>
> Sorry I was not clear. Imagine an Amazon.com search results page. It
> has about 15 results on Page 1, then it shows "Page 1 of 190".
>
> To show each page, the query probably has a "LIMIT 15 OFFSET 0" for
> Page 1. However, to calculate the total number of pages, they probably
> do a separate counts query, because doing a "select *" and then
> counting the number of rows returned would be even more inefficient
> than a count(*).

When I go to amazon.com I only ever get three pages of results. ever.
Because they know that returning 190 pages is not that useful, as
hardly anyone is going to wander through that many pages.

Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
blacksmith" i.e. it's guesstimating as well. no reason for google to
look at every single row for blacksmith to know that there's about 5.6
million.

> So, in reporting, two queries are fairly common I would think, unless
> I am missing something?

Yes, optimization. :) You don't need an exact count to tell someone
that there's more data and they can go to it. Note that if you are
planning on doing things google sized, you'll need to do what they
did, invent your own specialized database.

For us mere mortals, it's quite likely that you can do something like:

explain select * from table where field like 'abc%';

and then parse the explain output for an approximate number.

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

http://www.postgresql.org/docs/faq


From: Bill Moran <wmoran(at)potentialtech(dot)com>
To: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
Cc: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "Postgres General" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-15 15:57:38
Message-ID: 20070815115738.352012cb.wmoran@potentialtech.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

In response to "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>:

> > Yes, optimization. :) You don't need an exact count to tell someone
> > that there's more data and they can go to it.
>
>
> In general, I agree. But my example of Amazon was only to illustrate
> the point about two queries and why they may be needed. I seem to see
> many more pages than you do, but in any case, Google and Amazon can
> afford to be less precise.
>
> Thanks for the suggestion of using EXPLAIN and parsing an
> approximation, but when you need to show a trader how many trades he
> has made, for instance, then approximation is not a possibility at
> all. Especially not if the numbers sway so wildly --
>
>
> FIRSTDB=# explain select * from trades where t_id = 'kXjha';
> QUERY PLAN
> -----------------------------------------------------------------------------------
> Bitmap Heap Scan on trades (cost=15.77..1447.12 rows=374 width=224)
> Recheck Cond: ((t_id)::text = 'kXjha'::text)
> -> Bitmap Index Scan on trades_tid_date (cost=0.00..15.67 rows=374 width=0)
> Index Cond: ((t_id)::text = 'kXjha'::text)
> (4 rows)
>
> FIRSTDB=# select count(*) from trades where t_id = 'kXjha';
> count
> -------
> 3891
> (1 row)
>
>
>
> Could I do something so that the EXPLAIN showed up with slightly more
> close-to-accurate stats? The above query is just after a "vacuum
> analyze"!

In the above case, you could probably materialize the data with a trigger
that updates a counter in a separate table every time a new trade is
added. This will give you 100% accurate results with _very_ fast
response time.

Part of the problem is that there's no one answer to your question, there
are multiple approaches to solving it, depending on the details of the
problem and the acceptable time/accuracy of the answers. Some basic
approaches:

1) Materialize the data. MySQL actually does this automatically for you
with MyISAM tables, which is why count(*) is so fast. But if you
absolutely need fast, accurate counts, you can build your own
triggers in PG. This is unlikely to be practical with all queries.
2) Estimate. The accuracy of estimates can vary wildly by query and
how often the database is analyzed, etc. For something like,
"show results 1 - 10 of about 50,000", estimates are great and fast,
but for other cases, not acceptable. The good news is you can get
a fast estimate from any query with no up-front work.
3) Accept that sometimes to get accurate answers it's going to take
time. Around here, we call it the "Orbitz" technique, because when
we discuss it, everyone thinks of the "please wait while I process
your query" page you get from orbitz.com. You'd be surprised how
willing your users are to wait, as long as they know they have to
wait.
4) Throw more hardware at it. If you absolutely _must_have_ super-
accurate results faster, then you may need to buy more RAM, faster
disks and faster CPUs to accomplish it.
5) Come up with something revolutionary that nobody's every thought of
before. Good luck with this one.

Of course, all of these ideas are only practical if you've already
ensured that your system is properly tuned. Crappy values for
shared_buffers and other tuning will lead you to waste time trying
to redesign something that should work just fine, so verify all your
configuration first. You may be able to get more acceptable estimates
by increasing your statistics targets, for example.

--
Bill Moran
http://www.potentialtech.com


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 10:12:03
Message-ID: k988c3do7eh60mocpmalu5pta5j7s93oh3@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Scott Marlowe" wrote:

>When I go to amazon.com I only ever get three pages of results. ever.
> Because they know that returning 190 pages is not that useful, as
>hardly anyone is going to wander through that many pages.
>
>Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
>blacksmith" i.e. it's guesstimating as well. no reason for google to
>look at every single row for blacksmith to know that there's about 5.6
>million.

But if you go to eBay, they always give you an accurate count. Even if the no.
of items found is pretty large (example: <http://search.ebay.com/new>).

Rainer


From: Decibel! <decibel(at)decibel(dot)org>
To: Rainer Bauer <usenet(at)munnin(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 10:19:16
Message-ID: 20070816101916.GC54135@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote:
> "Scott Marlowe" wrote:
>
> >When I go to amazon.com I only ever get three pages of results. ever.
> > Because they know that returning 190 pages is not that useful, as
> >hardly anyone is going to wander through that many pages.
> >
> >Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
> >blacksmith" i.e. it's guesstimating as well. no reason for google to
> >look at every single row for blacksmith to know that there's about 5.6
> >million.
>
> But if you go to eBay, they always give you an accurate count. Even if the no.
> of items found is pretty large (example: <http://search.ebay.com/new>).

And I'd bet money that they're using a full text search of some kind to
get those results, which isn't remotely close to the same thing as a
generic SELECT count(*).
--
Decibel!, aka Jim Nasby decibel(at)decibel(dot)org
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 11:09:32
Message-ID: 3ua8c399oncth948vmv0tfach6g86abprv@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Decibel! wrote:

>On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote:
>> "Scott Marlowe" wrote:
>>
>> >When I go to amazon.com I only ever get three pages of results. ever.
>> > Because they know that returning 190 pages is not that useful, as
>> >hardly anyone is going to wander through that many pages.
>> >
>> >Google, you'll notice says "Results 1 - 10 of about 5,610,000 for
>> >blacksmith" i.e. it's guesstimating as well. no reason for google to
>> >look at every single row for blacksmith to know that there's about 5.6
>> >million.
>>
>> But if you go to eBay, they always give you an accurate count. Even if the no.
>> of items found is pretty large (example: <http://search.ebay.com/new>).
>
>And I'd bet money that they're using a full text search of some kind to
>get those results, which isn't remotely close to the same thing as a
>generic SELECT count(*).

Without text search (but with a category restriction):
<http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList>

I only wanted to show a counter-example for a big site which uses pagination
to display result sets and still reports accurate counts.

Anyway, what Phoenix is trying to say is that 2 queries are required: One to
get the total count and one to get the tuples for the current page. I reckon
it would help, if the query returning the result set could also report the
total no. of tuples found. Somthing like
SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>

Or is there a way to do that?

Rainer


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Rainer Bauer" <usenet(at)munnin(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 11:31:14
Message-ID: 87hcmzwwal.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Rainer Bauer" <usenet(at)munnin(dot)com> writes:

> Anyway, what Phoenix is trying to say is that 2 queries are required: One to
> get the total count and one to get the tuples for the current page. I reckon
> it would help, if the query returning the result set could also report the
> total no. of tuples found. Somthing like
> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>
>
> Or is there a way to do that?

Well anything like the above would just report l as the count.

The only way to do it in Postgres currently is to create a temporary table.
Then you can populate it once, then select the count from the temporary table
in one query and the required page from it in the second query.

But temporary tables in Postgres are not really designed for this. In
particular they count as DDL so you have to grant privileges to create tables
to the application and it has to create and delete entries in pg_class for
every use.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com


From: "Trevor Talbot" <quension(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 11:50:11
Message-ID: 90bce5730708160450l6039c733k169e0f101e2cdd32@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/16/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:

> >> But if you go to eBay, they always give you an accurate count. Even if the no.
> >> of items found is pretty large (example: <http://search.ebay.com/new>).
> >
> >And I'd bet money that they're using a full text search of some kind to
> >get those results, which isn't remotely close to the same thing as a
> >generic SELECT count(*).
>
> Without text search (but with a category restriction):
> <http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList>
>
> I only wanted to show a counter-example for a big site which uses pagination
> to display result sets and still reports accurate counts.

Categories are still finite state: you can simply store a count for
each category. Again it's just a case of knowing your data and
queries; it's not trying to solve a general infinite-possibilities
situation.

For instance, the OP mentioned wanting to get data on a particular
trader for the last week. Maintain a summary table that keeps counts
of each trader for each week, and ID bounds for the actual data table.
When you need to query the last 4 weeks, sum(). When you need to
query the last 30 days, sum() 4 weeks + a query on the master table
bounded by timestamp and ID range for the 5th week from the summary
table.

I'm sure there are sites out there that provide precise counts quickly
for extremely complex queries on gigantic datasets, but all the common
stuff is about specifics, not arbitrary queries. There are also
systems other than SQL RDBMS that can be used to drive such reporting.


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Rainer Bauer <usenet(at)munnin(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 12:02:58
Message-ID: 20070816120258.GA15940@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Aug 16, 2007 at 01:09:32PM +0200, Rainer Bauer wrote:
> Anyway, what Phoenix is trying to say is that 2 queries are required: One to
> get the total count and one to get the tuples for the current page. I reckon
> it would help, if the query returning the result set could also report the
> total no. of tuples found. Somthing like
> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>

Well, thee is another possibility, use cursors:

DECLARE CURSOR ... AS <query>;
FETCH 30 -- or however many to want now
MOVE TO END -- or whatever the command is, this gives you the number of rows

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:20:01
Message-ID: 5gj8c3lstvsu9ga10721k7uq8e5pq0j5n8@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Trevor Talbot" wrote:

>On 8/16/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:
>
>> >> But if you go to eBay, they always give you an accurate count. Even if the no.
>> >> of items found is pretty large (example: <http://search.ebay.com/new>).
>> >
>> >And I'd bet money that they're using a full text search of some kind to
>> >get those results, which isn't remotely close to the same thing as a
>> >generic SELECT count(*).
>>
>> Without text search (but with a category restriction):
>> <http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList>
>>
>> I only wanted to show a counter-example for a big site which uses pagination
>> to display result sets and still reports accurate counts.
>
>Categories are still finite state: you can simply store a count for
>each category. Again it's just a case of knowing your data and
>queries; it's not trying to solve a general infinite-possibilities
>situation.

Consider this query with multiple WHERE conditions:
<http://search.ebay.com/ne-ol-an_W0QQfasiZ1QQfbdZ1QQfcdZ1QQfcidZ77QQfclZ3QQfmcZ1QQfrppZ50QQfsooZ1QQfsopZ1QQftidZ1QQpriceZ1QQsabdhiZ100QQsacurZ999QQsalicZQ2d15QQsaprchiZ50000QQsatitleZQ28neQ2aQ2colQ2aQ2canQ2aQ29QQsojsZ0>

My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.

Before this drifts off:
* I do know *why* count(*) is slow using Postgres.
* I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>).
* I realize that pagination for multi-million tuple results does not make sense.

Rainer


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:20:09
Message-ID: man8c3tdchp1umuf8vu5kvuujrr1vfvmkj@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Gregory Stark wrote:

>"Rainer Bauer" <usenet(at)munnin(dot)com> writes:
>
>> Anyway, what Phoenix is trying to say is that 2 queries are required: One to
>> get the total count and one to get the tuples for the current page. I reckon
>> it would help, if the query returning the result set could also report the
>> total no. of tuples found. Somthing like
>> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>
>>
>> Or is there a way to do that?
>
>Well anything like the above would just report l as the count.

True, but what about this:

SELECT (SELECT COUNT(*) FROM <table> WHERE <cond>), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>

I just tested this on a query and its about 5-10% faster than issuing both commands separately (caching effects?). I wonder whether there would be any chance that Postgres could detect that the "count" select and the "data" select result in the same query plan?

In my example (which is included below) the hash join is executed twice.

>The only way to do it in Postgres currently is to create a temporary table.
>Then you can populate it once, then select the count from the temporary table
>in one query and the required page from it in the second query.
>
>But temporary tables in Postgres are not really designed for this. In
>particular they count as DDL so you have to grant privileges to create tables
>to the application and it has to create and delete entries in pg_class for
>every use.

Well I don't think popuplating a temporary table with possible millions of rows is faster than executing the query twice. Remember that a performance problem only occurs if there are a lot of tuples returned.

Rainer

======================================================================

This is the count query:
SELECT COUNT(*) FROM "tblItem" AS i INNER JOIN
(SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805) AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID"
WHERE ("intTimeEnd" < 1187273177)

This is the select analyse output:
"Aggregate (cost=356098.46..356098.47 rows=1 width=0) (actual time=29411.570..29411.570 rows=1 loops=1)"
" -> Hash Join (cost=177545.23..350137.60 rows=2384343 width=0) (actual time=17382.286..28864.851 rows=2383740 loops=1)"
" Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")"
" -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=1984.006..11048.762 rows=2513204 loops=1)"
" Recheck Cond: ("intCategoryID" = 88869805)"
" -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actual time=1980.614..1980.614 rows=2513204 loops=1)"
" Index Cond: ("intCategoryID" = 88869805)"
" -> Hash (cost=95316.41..95316.41 rows=2339583 width=4) (actual time=15024.827..15024.827 rows=2383832 loops=1)"
" -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=4) (actual time=8.634..13763.878 rows=2383832 loops=1)"
" Filter: ("intTimeEnd" < 1187273177)"
"Total runtime: 29411.668 ms"

======================================================================

This is the data query:
SELECT i."intItemIDCnt" FROM "tblItem" AS i INNER JOIN
(SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805) AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID"
WHERE ("intTimeEnd" < 1187273177)
ORDER BY "intFlagListingFeatured" DESC, "intTimeEnd", "intItemIDCnt" OFFSET 500 LIMIT 50

This is the select analyse output:
"Limit (cost=733011.30..733011.42 rows=50 width=12) (actual time=37852.007..37852.058 rows=50 loops=1)"
" -> Sort (cost=733010.05..738970.91 rows=2384343 width=12) (actual time=37851.581..37851.947 rows=550 loops=1)"
" Sort Key: i."intFlagListingFeatured", i."intTimeEnd", i."intItemIDCnt""
" -> Hash Join (cost=179830.23..354707.60 rows=2384343 width=12) (actual time=17091.753..29040.425 rows=2383740 loops=1)"
" Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")"
" -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=1976.599..10970.394 rows=2513204 loops=1)"
" Recheck Cond: ("intCategoryID" = 88869805)"
" -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actual time=1973.160..1973.160 rows=2513204 loops=1)"
" Index Cond: ("intCategoryID" = 88869805)"
" -> Hash (cost=95316.41..95316.41 rows=2339583 width=12) (actual time=14758.256..14758.256 rows=2383832 loops=1)"
" -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=12) (actual time=8.592..13373.179 rows=2383832 loops=1)"
" Filter: ("intTimeEnd" < 1187273177)"
"Total runtime: 38247.533 ms"

======================================================================

This is the combined count/data query:
SELECT (SELECT COUNT(*) FROM "tblItem" AS i INNER JOIN (SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805) AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID" WHERE ("intTimeEnd" < 1187273177)), i."intItemIDCnt" FROM "tblItem" AS i INNER JOIN (SELECT "intItemID" FROM "tblItem2Category" WHERE "intCategoryID"=88869805) AS vrtChild ON i."intItemIDCnt"=vrtChild."intItemID" WHERE ("intTimeEnd" < 1187273177) ORDER BY "intFlagListingFeatured" DESC, "intTimeEnd", "intItemIDCnt" OFFSET 500 LIMIT 50

"Limit (cost=1089109.77..1089109.90 rows=50 width=12) (actual time=62547.673..62547.727 rows=50 loops=1)"
" InitPlan"
" -> Aggregate (cost=356098.46..356098.47 rows=1 width=0) (actual time=16385.927..16385.927 rows=1 loops=1)"
" -> Hash Join (cost=177545.23..350137.60 rows=2384343 width=0) (actual time=4320.749..15843.759 rows=2383740 loops=1)"
" Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")"
" -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=293.025..9211.673 rows=2513204 loops=1)"
" Recheck Cond: ("intCategoryID" = 88869805)"
" -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actual time=289.602..289.602 rows=2513204 loops=1)"
" Index Cond: ("intCategoryID" = 88869805)"
" -> Hash (cost=95316.41..95316.41 rows=2339583 width=4) (actual time=3461.292..3461.292 rows=2383832 loops=1)"
" -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=4) (actual time=3.658..1692.979 rows=2383832 loops=1)"
" Filter: ("intTimeEnd" < 1187273177)"
" -> Sort (cost=733010.05..738970.91 rows=2384343 width=12) (actual time=62547.242..62547.616 rows=550 loops=1)"
" Sort Key: i."intFlagListingFeatured", i."intTimeEnd", i."intItemIDCnt""
" -> Hash Join (cost=179830.23..354707.60 rows=2384343 width=12) (actual time=38625.024..51768.452 rows=2383740 loops=1)"
" Hash Cond: ("tblItem2Category"."intItemID" = i."intItemIDCnt")"
" -> Bitmap Heap Scan on "tblItem2Category" (cost=41560.03..134660.50 rows=2561397 width=4) (actual time=1968.052..2893.092 rows=2513204 loops=1)"
" Recheck Cond: ("intCategoryID" = 88869805)"
" -> Bitmap Index Scan on ccitem2categorycategoryidix (cost=0.00..40919.69 rows=2561397 width=0) (actual time=1964.642..1964.642 rows=2513204 loops=1)"
" Index Cond: ("intCategoryID" = 88869805)"
" -> Hash (cost=95316.41..95316.41 rows=2339583 width=12) (actual time=19915.284..19915.284 rows=2383832 loops=1)"
" -> Seq Scan on "tblItem" i (cost=0.00..95316.41 rows=2339583 width=12) (actual time=8.622..18369.696 rows=2383832 loops=1)"
" Filter: ("intTimeEnd" < 1187273177)"
"Total runtime: 63042.165 ms"

======================================================================


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:20:13
Message-ID: 7cq8c3dar3u22at8i7siiqfigsm45dr52c@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Martijn van Oosterhout wrote:

>On Thu, Aug 16, 2007 at 01:09:32PM +0200, Rainer Bauer wrote:
>> Anyway, what Phoenix is trying to say is that 2 queries are required: One to
>> get the total count and one to get the tuples for the current page. I reckon
>> it would help, if the query returning the result set could also report the
>> total no. of tuples found. Somthing like
>> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>
>
>Well, thee is another possibility, use cursors:
>
>DECLARE CURSOR ... AS <query>;
>FETCH 30 -- or however many to want now
>MOVE TO END -- or whatever the command is, this gives you the number of rows
>
>Hope this helps,

Thanks! I will give this a try.

Rainer


From: "Phoenix Kiula" <phoenix(dot)kiula(at)gmail(dot)com>
To: "Rainer Bauer" <usenet(at)munnin(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:35:05
Message-ID: e373d31e0708160835h5f3372at77192011915f8cec@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 16/08/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:
> Gregory Stark wrote:
>
> >"Rainer Bauer" <usenet(at)munnin(dot)com> writes:
> >
> >> Anyway, what Phoenix is trying to say is that 2 queries are required: One to
> >> get the total count and one to get the tuples for the current page. I reckon
> >> it would help, if the query returning the result set could also report the
> >> total no. of tuples found. Somthing like
> >> SELECT COUNT(*), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>
> >>
> >> Or is there a way to do that?
> >
> >Well anything like the above would just report l as the count.
>
> True, but what about this:
>
> SELECT (SELECT COUNT(*) FROM <table> WHERE <cond>), * FROM <table> WHERE <cond> OFFSET <o> LIMIT <l>
>

Whoa, this may not please SQL puritans but I love it! And yes, it is
cached. I find the idea of temporary tables and storing counts for
different 'slices' of my data untenable with all the complex mishmash
of triggers and such. The count(*) query seems to take a bit in the
beginning but works ok thereafter because it seems to be auto-cached.
Sweet. Thanks for sharing!!


From: Bill Moran <wmoran(at)potentialtech(dot)com>
To: Rainer Bauer <usenet(at)munnin(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:35:57
Message-ID: 20070816113557.80e4af1b.wmoran@potentialtech.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

In response to Rainer Bauer <usenet(at)munnin(dot)com>:

> "Trevor Talbot" wrote:
>
> >On 8/16/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:
> >
> >> >> But if you go to eBay, they always give you an accurate count. Even if the no.
> >> >> of items found is pretty large (example: <http://search.ebay.com/new>).
> >> >
> >> >And I'd bet money that they're using a full text search of some kind to
> >> >get those results, which isn't remotely close to the same thing as a
> >> >generic SELECT count(*).
> >>
> >> Without text search (but with a category restriction):
> >> <http://collectibles.listings.ebay.com/_W0QQsacatZ1QQsocmdZListingItemList>
> >>
> >> I only wanted to show a counter-example for a big site which uses pagination
> >> to display result sets and still reports accurate counts.
> >
> >Categories are still finite state: you can simply store a count for
> >each category. Again it's just a case of knowing your data and
> >queries; it's not trying to solve a general infinite-possibilities
> >situation.
>
> Consider this query with multiple WHERE conditions:
> <http://search.ebay.com/ne-ol-an_W0QQfasiZ1QQfbdZ1QQfcdZ1QQfcidZ77QQfclZ3QQfmcZ1QQfrppZ50QQfsooZ1QQfsopZ1QQftidZ1QQpriceZ1QQsabdhiZ100QQsacurZ999QQsalicZQ2d15QQsaprchiZ50000QQsatitleZQ28neQ2aQ2colQ2aQ2canQ2aQ29QQsojsZ0>
>
> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.

While I don't _want_ to argue with you ... I can't seem to help myself.

How do you _know_ that's the exact number of items? There are 50 items on
that page, the paginator at the bottom shows 97,686 pages, but there's no
way (that I can find) to go to the _last_ page to ensure that said numbers
are correct. It could simply be estimating the number of items and
calculating the # of pages based on that. With 4mil items, a few 1000 off
isn't anything anyone would notice.

> Before this drifts off:
> * I do know *why* count(*) is slow using Postgres.
> * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>).

That could be possible, but it's still speculation at this point.
If someone with Oracle-fu could say for sure one way or the other,
that would be interesting ...

Unless there's data on that sun.com page that provides more detail. It
doesn't seem to be willing to load for me at this point ...

> * I realize that pagination for multi-million tuple results does not make sense.

Then what is the point to this thread? Are we just shooting the breeze at
this point?

--
Bill Moran
http://www.potentialtech.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Rainer Bauer <usenet(at)munnin(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:37:06
Message-ID: 4452.1187278626@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Rainer Bauer <usenet(at)munnin(dot)com> writes:
> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.

And exactly how do you know that that's true?

regards, tom lane


From: Rodrigo De León <rdeleonp(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:51:33
Message-ID: 1187279493.226383.306630@d55g2000hsg.googlegroups.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Aug 16, 5:19 am, deci(dot)(dot)(dot)(at)decibel(dot)org (Decibel!) wrote:
> On Thu, Aug 16, 2007 at 12:12:03PM +0200, Rainer Bauer wrote:
> > "Scott Marlowe" wrote:
> > But if you go to eBay, they always give you an accurate count. Even if the no.
> > of items found is pretty large (example: <http://search.ebay.com/new>).
>
> And I'd bet money that they're using a full text search of some kind to
> get those results, which isn't remotely close to the same thing as a
> generic SELECT count(*).

http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf

Search for the text "Scaling Search".

Interesting stuff.


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:53:56
Message-ID: r6s8c39e76mic12ai56akdauj1rv0ahuti@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Bill Moran wrote:

>> Consider this query with multiple WHERE conditions:
>> <http://search.ebay.com/ne-ol-an_W0QQfasiZ1QQfbdZ1QQfcdZ1QQfcidZ77QQfclZ3QQfmcZ1QQfrppZ50QQfsooZ1QQfsopZ1QQftidZ1QQpriceZ1QQsabdhiZ100QQsacurZ999QQsalicZQ2d15QQsaprchiZ50000QQsatitleZQ28neQ2aQ2colQ2aQ2canQ2aQ29QQsojsZ0>
>>
>> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.
>
>While I don't _want_ to argue with you ... I can't seem to help myself.
>
>How do you _know_ that's the exact number of items? There are 50 items on
>that page, the paginator at the bottom shows 97,686 pages, but there's no
>way (that I can find) to go to the _last_ page to ensure that said numbers
>are correct. It could simply be estimating the number of items and
>calculating the # of pages based on that. With 4mil items, a few 1000 off
>isn't anything anyone would notice.

No, those numbers are correct. You can go to the last page using the input box
at the end of the listing "Go to page". However, with a 4 million result set,
the count will have changed while the last page is retrieved (new items have
been listed and otheres have ended in the mean time). You will have to check
this out with a search yielding fewer results (couple of hundred thousands)
and load the first and last page simultaneously.

>> * I realize that pagination for multi-million tuple results does not make sense.
>
>Then what is the point to this thread? Are we just shooting the breeze at
>this point?

I just wanted to show an example where accurate counts are reported for large
search results.

Rainer


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 15:54:01
Message-ID: als8c3pk9kqeuc62artoikr5g82o2gbk8e@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom Lane wrote:

>Rainer Bauer <usenet(at)munnin(dot)com> writes:
>> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.
>
>And exactly how do you know that that's true?

5 years experience with developing a browser for eBay?

Rainer


From: "Trevor Talbot" <quension(at)gmail(dot)com>
To: "PgSQL General ML" <pgsql-general(at)postgresql(dot)org>
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 16:01:03
Message-ID: 90bce5730708160901t5e0519cbt60b5769ab9b510af@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/16/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:

> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.
>
> Before this drifts off:
> * I do know *why* count(*) is slow using Postgres.
> * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>).
> * I realize that pagination for multi-million tuple results does not make sense.

You got me curious, so I went hunting for more hints on what eBay
actually does, and found these slides from a presentation given by two
eBay engineers last year:
http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf

It's, er, a whole different ballgame there. Database behavior is
barely involved in their searching; they do joins and RI across
database clusters within the _application_. I knew eBay was big, but
wow...


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 16:28:58
Message-ID: bdu8c3t15lgv49oertk10ld64i0qjoo2c8@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Trevor Talbot" wrote:

>On 8/16/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:
>
>> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.
>>
>> Before this drifts off:
>> * I do know *why* count(*) is slow using Postgres.
>> * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>).
>> * I realize that pagination for multi-million tuple results does not make sense.
>
>You got me curious, so I went hunting for more hints on what eBay
>actually does, and found these slides from a presentation given by two
>eBay engineers last year:
>http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf

Quite interesting.

>It's, er, a whole different ballgame there. Database behavior is
>barely involved in their searching; they do joins and RI across
>database clusters within the _application_. I knew eBay was big, but
>wow...

Well then: forget the Oracle count(*) argument :-(

Rainer


From: "Scott Marlowe" <scott(dot)marlowe(at)gmail(dot)com>
To: "Rainer Bauer" <usenet(at)munnin(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 18:38:35
Message-ID: dcc563d10708161138w20c57be0gcec023c5a001295f@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On 8/16/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:
> "Trevor Talbot" wrote:
>
> >On 8/16/07, Rainer Bauer <usenet(at)munnin(dot)com> wrote:
> >
> >> My point is that whatever search criterias are involved and how many items are found eBay always returns the *accurate* number of items found.
> >>
> >> Before this drifts off:
> >> * I do know *why* count(*) is slow using Postgres.
> >> * I *think* that count(*) is fast on eBay because count is cheaper using Oracle (which eBay does: <http://www.sun.com/customers/index.xml?c=ebay.xml>).
> >> * I realize that pagination for multi-million tuple results does not make sense.
> >
> >You got me curious, so I went hunting for more hints on what eBay
> >actually does, and found these slides from a presentation given by two
> >eBay engineers last year:
> >http://www.addsimplicity.com/downloads/eBaySDForum2006-11-29.pdf
>
> Quite interesting.
>
> >It's, er, a whole different ballgame there. Database behavior is
> >barely involved in their searching; they do joins and RI across
> >database clusters within the _application_. I knew eBay was big, but
> >wow...
>
> Well then: forget the Oracle count(*) argument :-(

FYI, I went to the ebay page you posted, which listed something like
98011 pages, and asked for page 96000. It searched for about a minute
and timed out with the error message

There was a problem executing your request. Please try again.

Tried it again, twice, about 5 minutes apart, and got the same error each time.

So I'm guessing that ebay is better at making your THINK it has the
exact count than actually having the exact count.


From: Rainer Bauer <usenet(at)munnin(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Yet Another COUNT(*)...WHERE...question
Date: 2007-08-16 19:32:19
Message-ID: q499c3p18grufr921991kh7s99rea7hvmj@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Scott Marlowe" wrote:

>FYI, I went to the ebay page you posted, which listed something like
>98011 pages, and asked for page 96000. It searched for about a minute
>and timed out with the error message
>
>There was a problem executing your request. Please try again.
>
>Tried it again, twice, about 5 minutes apart, and got the same error each time.
>
>So I'm guessing that ebay is better at making your THINK it has the
>exact count than actually having the exact count.

Well that depends on the current traffic on eBay. It worked allright this
afternoon (GMT), but I get the same error message now (btw, a few years ago
you only got an error message if a search retruned more than 1M items).

Try it with this search: <http://search.ebay.com/new> which gives about 2.2M
items.

Rainer