Re: Why sequential scan when there's a supporting index?

Lists: pgsql-novice
From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Why sequential scan when there's a supporting index?
Date: 2002-05-24 13:14:40
Message-ID: 1022246084.631.26.camel@rebel
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

Hi,

As you can see, I "VACUUM VERBOSE ANALYZE" my table, describe
the index on the table, then try to find the max() value of
the indexed field. However, EXPLAIN still shows that the
query wants to sequentially scan the table.

Why?

TIA,
Ron

test2=# vacuum verbose analyze t_lane_tx;
NOTICE: --Relation t_lane_tx--
NOTICE: Pages 1785858: Changed 0, Empty 0; Tup 33931294: Vac 0, Keep 0,
UnUsed 0.
Total CPU 77.03s/7.24u sec elapsed 494.00 sec.
NOTICE: Analyzing t_lane_tx
VACUUM
test2=# \d i_lane_tx_tmp
Index "i_lane_tx_tmp"
Column | Type
---------+------
tx_date | date
btree

test2=# explain select max(tx_date) from t_lane_tx;
NOTICE: QUERY PLAN:

Aggregate (cost=2209999.20..2209999.20 rows=1 width=4)
-> Seq Scan on t_lane_tx (cost=0.00..2125170.96 rows=33931296
width=4)

--
+---------------------------------------------------------+
| Ron Johnson, Jr. Home: ron(dot)l(dot)johnson(at)cox(dot)net |
| Jefferson, LA USA http://ronandheather.dhs.org:81 |
| |
| "I have created a government of whirled peas..." |
| Maharishi Mahesh Yogi, 12-May-2002, |
! CNN, Larry King Live |
+---------------------------------------------------------+


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
Cc: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Why sequential scan when there's a supporting index?
Date: 2002-05-24 13:54:06
Message-ID: 3024.1022248446@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net> writes:
> test2=# explain select max(tx_date) from t_lane_tx;
> NOTICE: QUERY PLAN:

> Aggregate (cost=2209999.20..2209999.20 rows=1 width=4)
> -> Seq Scan on t_lane_tx (cost=0.00..2125170.96 rows=33931296
> width=4)

I'm beginning to think this should be in the FAQ ...

Try it like this:

regression=# explain select unique1 from tenk1 order by unique1 desc limit 1;

Limit (cost=0.00..0.11 rows=1 width=4)
-> Index Scan Backward using tenk1_unique1 on tenk1 (cost=0.00..1071.99 rows=10000 width=4)

Although the LIMIT clause isn't standard, this approach is attractive
compared to max() because you can fetch any or all values in the row
containing the maximal element, which is a very useful thing. Also,
the approach scales to situations where you want to sort by multiple
columns.

Improving the handling of max() has been on the TODO list for awhile,
but most of the hacker community considers it low priority because of
the availability of the above workaround. Also, Postgres has a very
generalized black-box approach to aggregate functions, so no one's been
able to think of a reasonably clean way to teach the planner that some
aggregates are connected to index sort ordering.

regards, tom lane


From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Why sequential scan when there's a supporting index?
Date: 2002-05-24 14:25:58
Message-ID: 1022250358.1977.52.camel@rebel
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

On Fri, 2002-05-24 at 08:54, Tom Lane wrote:
> Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net> writes:
[snip]
> Improving the handling of max() has been on the TODO list for awhile,
> but most of the hacker community considers it low priority because of
> the availability of the above workaround. Also, Postgres has a very
> generalized black-box approach to aggregate functions, so no one's been
> able to think of a reasonably clean way to teach the planner that some
> aggregates are connected to index sort ordering.

How do I vote on increasing the priority of "fixing max()"?

SELECT MAX(FOO) FROM BAR;
SELECT tx_date, COUNT(*) FROM t_lane_tx GROUP BY tx_date;

These are awfully common statements on proprietary RDBMSs,
and it confuses the _heck_ out of someone migrating to
Postgres.

Btw, "SELECT tx_date, COUNT(*) FROM t_lane_tx GROUP BY tx_date;"
also does a Seq Scan on t_lane_tx. What's the best work-around
for this query?

--
+---------------------------------------------------------+
| Ron Johnson, Jr. Home: ron(dot)l(dot)johnson(at)cox(dot)net |
| Jefferson, LA USA http://ronandheather.dhs.org:81 |
| |
| "I have created a government of whirled peas..." |
| Maharishi Mahesh Yogi, 12-May-2002, |
! CNN, Larry King Live |
+---------------------------------------------------------+


From: Andrew McMillan <andrew(at)catalyst(dot)net(dot)nz>
To: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
Cc: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Why sequential scan when there's a supporting index?
Date: 2002-05-25 08:52:21
Message-ID: 1022316741.11106.547.camel@kant.mcmillan.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

On Sat, 2002-05-25 at 02:25, Ron Johnson wrote:
>
> Btw, "SELECT tx_date, COUNT(*) FROM t_lane_tx GROUP BY tx_date;"
> also does a Seq Scan on t_lane_tx. What's the best work-around
> for this query?

There is no work around for this one. In some circumstances the indexes
in a PostgreSQL database will contain 'dirty' information, and so to get
the correct answer in these cases PostgreSQL has to go to the real
table.

For my personal view I'm OK with the current behaviour. It has
tradeoffs, and this is one of the negatives, but although I find myself
doing this interactively quite often I only very rarely find myself
doing it inside an application.

Regards,
Andrew.
--
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267
Are you enrolled at http://schoolreunions.co.nz/ yet?


From: Ron Johnson <ron(dot)l(dot)johnson(at)cox(dot)net>
To: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Why sequential scan when there's a supporting index?
Date: 2002-05-25 11:35:30
Message-ID: 1022326531.6585.15.camel@rebel
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

On Sat, 2002-05-25 at 03:52, Andrew McMillan wrote:
> On Sat, 2002-05-25 at 02:25, Ron Johnson wrote:
> >
> > Btw, "SELECT tx_date, COUNT(*) FROM t_lane_tx GROUP BY tx_date;"
> > also does a Seq Scan on t_lane_tx. What's the best work-around
> > for this query?
>
> There is no work around for this one. In some circumstances the indexes
> in a PostgreSQL database will contain 'dirty' information, and so to get
> the correct answer in these cases PostgreSQL has to go to the real
> table.

"Dirty information"? Is this a consequence of READ COMMITTED
transactions?

> For my personal view I'm OK with the current behaviour. It has
> tradeoffs, and this is one of the negatives, but although I find myself
> doing this interactively quite often I only very rarely find myself
> doing it inside an application.

IMO, when a "proprietary" DBA (like me) hears that that statement
does table scans, s/he will be stunned, and wonder what other
"gotchas" are lurking out there awaiting someone who wants to
query enterprise-sized tables. The main reason that I am
researching Postgres (a _real_ database) is to see whether we
can move historical data off the proprietary system, and on to
something less expensive that people can run ad-hoc queries
against...

--
+---------------------------------------------------------+
| Ron Johnson, Jr. Home: ron(dot)l(dot)johnson(at)cox(dot)net |
| Jefferson, LA USA http://ronandheather.dhs.org:81 |
| |
| "I have created a government of whirled peas..." |
| Maharishi Mahesh Yogi, 12-May-2002, |
! CNN, Larry King Live |
+---------------------------------------------------------+


From: "Joshua b(dot) Jore" <josh(at)greentechnologist(dot)org>
To:
Cc: PgSQL Novice ML <pgsql-novice(at)postgresql(dot)org>
Subject: Re: Why sequential scan when there's a supporting index?
Date: 2002-05-25 16:57:39
Message-ID: Pine.BSO.4.44.0205251147450.15363-100000@kitten.greentechnologist.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-novice

On 25 May 2002, Ron Johnson wrote:
> On Sat, 2002-05-25 at 03:52, Andrew McMillan wrote:
> > On Sat, 2002-05-25 at 02:25, Ron Johnson wrote:
> > For my personal view I'm OK with the current behaviour. It has
> > tradeoffs, and this is one of the negatives, but although I find myself
> > doing this interactively quite often I only very rarely find myself
> > doing it inside an application.
>
> IMO, when a "proprietary" DBA (like me) hears that that statement
> does table scans, s/he will be stunned, and wonder what other
> "gotchas" are lurking out there awaiting someone who wants to
> query enterprise-sized tables.

I was wondering the same thing. I read the html manual, O'Reilly's
Practical PostgreSQL, Sam's PostgreSQL Developer's Handbook and have
browsed http://techdocs.postgresql.org. I also lurk on this list and
ocasionally ask and answer questions. Where in all that information am I
supposed to pick up that 'SELECT max(foo) FROM quux' should be written as
'SELECT foo FROM quux ORDER BY foo LIMIT 1'. That's really unintuitive and
if this info isn't in the docs I've read so far, I'm not sure where it is.
Am I supposed to read the source for comments as well? So mostly I just
want to know where these suggested work arounds are documented. Is there a
particular set of keywords that work for searching the hacker's list? I'm
at a loss as to where to get this (obviously) valuable info.

Pointers would be appreciated,

Joshua b. Jore ; http://www.greentechnologist.org ; 10012 11010 11022
10202 1012 2122 11020 10202 10202 11002 1020 1012 11102 11102 11102 1201
11001 11002 10211 11020 10202 10202 11002 11021 1201 11010 11020 10211