JOIN not being calculated correctly

Lists: pgsql-sql
From: "Scott Pederick" <scott(at)pederick(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: JOIN not being calculated correctly
Date: 2004-10-27 11:02:35
Message-ID: 001e01c4bc14$78cbda90$1d7af0dc@boblaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hi all!

I'm having a problem with Postgresql 8.0.0-beta4 on a Win2K machine and a
particular join.

I've got two tables - a list of customers and jobs they've had. A customer
can have multiple jobs.

The query always scans the entire jobs table for each customer - I need it
the other way around so I can get a list of the customers who have at least
one job.

The EXPLAIN shows the jobs table is being scanned for some reason:

Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Customers
INNER JOIN Jobs USING (CustomerId);
QUERY PLAN
-------------------------------------------------------------------------
Hash Join (cost=78.54..4908.71 rows=70727 width=8)
Hash Cond: ("outer".customerid = "inner".customerid)
-> Seq Scan on jobs (cost=0.00..3769.27 rows=70727 width=8)
-> Hash (cost=76.03..76.03 rows=1003 width=4)
-> Seq Scan on customers (cost=0.00..76.03 rows=1003 width=4)
(5 rows)

Even if I reverse the JOIN I get the exact same result:

Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Jobs INNER
JOIN Customers USING (CustomerId);
QUERY PLAN
-------------------------------------------------------------------------
Hash Join (cost=78.54..4908.71 rows=70727 width=8)
Hash Cond: ("outer".customerid = "inner".customerid)
-> Seq Scan on jobs (cost=0.00..3769.27 rows=70727 width=8)
-> Hash (cost=76.03..76.03 rows=1003 width=4)
-> Seq Scan on customers (cost=0.00..76.03 rows=1003 width=4)
(5 rows)

How can I force it to operate as I need it to? It seems the query engine is
a little smarter than it needs to be.

If anyone can shed some light on this problem, it would be greatly
appreciated. I've taken it as far as I can and don't really know where to
move from here.

Thanks in advance,

Scott Pederick


From: Andrew Hammond <ahammond(at)ca(dot)afilias(dot)info>
To: Scott Pederick <scott(at)pederick(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: JOIN not being calculated correctly
Date: 2004-11-02 18:46:41
Message-ID: 4187D611.2000109@ca.afilias.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Posting the EXPLAIN is a good, an EXPLAIN ANALYZE would be better
(assuming your dataset is small enough for it to complete in this
lifetime). You also need to include the following information:

1) The schema involved, including information about indexes being used.

2) Have you vacuumed / analyzed the tables involved recently?

3) Have you modified the stats on any of the tables / columns involve or
are you using defaults?

Drew

Scott Pederick wrote:
| Hi all!
|
| I'm having a problem with Postgresql 8.0.0-beta4 on a Win2K machine and a
| particular join.
|
| I've got two tables - a list of customers and jobs they've had. A customer
| can have multiple jobs.
|
| The query always scans the entire jobs table for each customer - I need it
| the other way around so I can get a list of the customers who have at
least
| one job.
|
| The EXPLAIN shows the jobs table is being scanned for some reason:
|
| Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Customers
| INNER JOIN Jobs USING (CustomerId);
| QUERY PLAN
| -------------------------------------------------------------------------
| Hash Join (cost=78.54..4908.71 rows=70727 width=8)
| Hash Cond: ("outer".customerid = "inner".customerid)
| -> Seq Scan on jobs (cost=0.00..3769.27 rows=70727 width=8)
| -> Hash (cost=76.03..76.03 rows=1003 width=4)
| -> Seq Scan on customers (cost=0.00..76.03 rows=1003 width=4)
| (5 rows)
|
|
|
| Even if I reverse the JOIN I get the exact same result:
|
| Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Jobs INNER
| JOIN Customers USING (CustomerId);
| QUERY PLAN
| -------------------------------------------------------------------------
| Hash Join (cost=78.54..4908.71 rows=70727 width=8)
| Hash Cond: ("outer".customerid = "inner".customerid)
| -> Seq Scan on jobs (cost=0.00..3769.27 rows=70727 width=8)
| -> Hash (cost=76.03..76.03 rows=1003 width=4)
| -> Seq Scan on customers (cost=0.00..76.03 rows=1003 width=4)
| (5 rows)
|
|
| How can I force it to operate as I need it to? It seems the query
engine is
| a little smarter than it needs to be.
|
| If anyone can shed some light on this problem, it would be greatly
| appreciated. I've taken it as far as I can and don't really know where to
| move from here.

- --
Andrew Hammond 416-673-4138 ahammond(at)ca(dot)afilias(dot)info
Database Administrator, Afilias Canada Corp.
CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFBh9YQgfzn5SevSpoRAg0LAKCg5K7IccFIOvdTc8DEl2YaUMcUCgCfRt2Q
CI1Vo6yxHkrWcoTQMQ/EvOw=
=m15B
-----END PGP SIGNATURE-----


From: "Scott Pederick" <sql(dot)postgresql(dot)org(at)pederick(dot)com>
To: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: JOIN not being calculated correctly
Date: 2004-11-20 21:37:54
Message-ID: 000501c4cf49$31c346e0$e1f9f4dc@boblaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hi Andrew,

Thanks for your response - sorry it's taken so long to reply, I've been out
of contact for a while.

Still can't fathom why it's not using an index scan on the jobs table..

: Posting the EXPLAIN is a good, an EXPLAIN ANALYZE would be better
: (assuming your dataset is small enough for it to complete in this
: lifetime). You also need to include the following information:
:
: 1) The schema involved, including information about indexes being used.

Here's the EXPLAIN ANALYZE, schemas and indexes:

Quarry=> EXPLAIN ANALYZE SELECT Customers.CustomerId, Jobs.JobId FROM
Customers INNER JOIN Jobs USING (CustomerId);
QUERY PLAN
----------------------------------------------------------------------------
------------------------------------------
Hash Join (cost=78.54..7924.71 rows=70727 width=8) (actual
time=50.000..21941.000 rows=70727 loops=1)
Hash Cond: ("outer".customerid = "inner".customerid)
-> Seq Scan on jobs (cost=0.00..6785.27 rows=70727 width=8) (actual
time=40.000..21040.000 rows=70727 loops=1)
-> Hash (cost=76.03..76.03 rows=1003 width=4) (actual
time=10.000..10.000 rows=0 loops=1)
-> Seq Scan on customers (cost=0.00..76.03 rows=1003 width=4)
(actual time=0.000..0.000 rows=1003 loops=1)
Total runtime: 22292.000 ms
(6 rows)

Customers
---
CREATE TABLE customers (
customerid serial NOT NULL,
account character varying(6) DEFAULT ''::character varying NOT NULL,
businessname character varying(40) DEFAULT ''::character varying NOT
NULL,
address character varying(30) DEFAULT ''::character varying NOT NULL,
suburb character varying(30) DEFAULT ''::character varying NOT NULL,
postcode character varying(4) DEFAULT ''::character varying NOT NULL,
state character(3) DEFAULT ''::bpchar NOT NULL,
mobile character varying(14) DEFAULT ''::character varying NOT NULL,
fax character varying(14) DEFAULT ''::character varying NOT NULL,
acctitle character varying(4) DEFAULT ''::character varying NOT NULL,
accfirst character varying(14) DEFAULT ''::character varying NOT NULL,
acclast character varying(14) DEFAULT ''::character varying NOT NULL,
accphone character varying(14) DEFAULT ''::character varying NOT NULL,
accfax character(1) DEFAULT ''::bpchar NOT NULL,
accmobile character varying(14) DEFAULT ''::character varying NOT NULL,
alttitle character varying(4) DEFAULT ''::character varying NOT NULL,
altfirst character varying(14) DEFAULT ''::character varying NOT NULL,
altlast character varying(14) DEFAULT ''::character varying NOT NULL,
current numeric(10,2) DEFAULT 0 NOT NULL,
aged30 numeric(10,2) DEFAULT 0 NOT NULL,
aged60 numeric(10,2) DEFAULT 0 NOT NULL,
aged90 numeric(10,2) DEFAULT 0 NOT NULL,
ytd numeric(10,2) DEFAULT 0 NOT NULL,
pg character(1) DEFAULT ''::bpchar NOT NULL,
acclimit character varying(14) DEFAULT ''::character varying NOT NULL,
accauto character(1) DEFAULT ''::bpchar NOT NULL,
text text DEFAULT ''::text NOT NULL,
fee character(1) DEFAULT ''::bpchar NOT NULL,
gst character(1) DEFAULT ''::bpchar NOT NULL,
stop character(1) DEFAULT ''::bpchar NOT NULL
);

Customers - Indexes
---
customers_pkey PRIMARY KEY (customerid);
UNIQUE INDEX customers_account ON customers USING btree (account);

Jobs
---
CREATE TABLE jobs (
jobid serial NOT NULL,
customerid integer,
quarryid integer,
producttypeid integer,
invoiceid integer,
salesid integer,
orderid character varying(15) DEFAULT ''::character varying NOT NULL,
jobdate date,
contactname character varying(40) DEFAULT ''::character varying NOT
NULL,
businessname character varying(40) DEFAULT ''::character varying NOT
NULL,
address character varying(30) DEFAULT ''::character varying NOT NULL,
suburb character varying(30) DEFAULT ''::character varying NOT NULL,
phone character varying(14) DEFAULT ''::character varying NOT NULL,
melway character varying(4) DEFAULT ''::character varying NOT NULL,
melxy character(3) DEFAULT ''::bpchar NOT NULL,
quarryorder numeric(10,2) DEFAULT 0 NOT NULL,
quarrypickup numeric(10,2) DEFAULT 0 NOT NULL,
quarryprice numeric(10,2) DEFAULT 0 NOT NULL,
transport numeric(10,2) DEFAULT 0 NOT NULL,
margin numeric(10,2) DEFAULT 0 NOT NULL,
unit character(1) DEFAULT ''::bpchar NOT NULL,
done character(1) DEFAULT ''::bpchar NOT NULL,
xno character varying(10) DEFAULT ''::character varying NOT NULL,
mzone character(1) DEFAULT ''::bpchar NOT NULL,
invoice numeric(10,2) DEFAULT 0 NOT NULL,
drivercomment character varying(12) DEFAULT ''::character varying NOT
NULL
text text DEFAULT ''::text NOT NULL,
gst numeric(10,2) DEFAULT 0 NOT NULL,
distance numeric(7,2) DEFAULT 0 NOT NULL,
productdefinitionid integer,
customerprice numeric(10,2) DEFAULT 0 NOT NULL
);

Jobs Keys/Indexes
---
jobs_pkey PRIMARY KEY (jobid);
INDEX jobs_customerid ON jobs USING btree (customerid);
INDEX jobs_jobdate ON jobs USING btree (jobdate);
INDEX jobs_quarryid ON jobs USING btree (quarryid);

CONSTRAINT jobs_customerid_fkey FOREIGN KEY (customerid) REFERENCES
customers(customerid) ON UPDATE RESTRICT ON DELETE RESTRICT;
CONSTRAINT jobs_invoiceid_fkey FOREIGN KEY (invoiceid) REFERENCES
invoices(invoiceid) ON UPDATE RESTRICT ON DELETE RESTRICT;
ADD CONSTRAINT jobs_productdefinitionid_fkey FOREIGN KEY
(productdefinitionid) REFERENCES producttypes(producttypeid) ON UPDATE
RESTRICT;
CONSTRAINT jobs_producttypeid_fkey FOREIGN KEY (producttypeid) REFERENCES
producttypes(producttypeid) ON UPDATE RESTRICT ON DELETE RESTRICT;
CONSTRAINT jobs_quarryid_fkey FOREIGN KEY (quarryid) REFERENCES
quarries(quarryid) ON UPDATE RESTRICT ON DELETE RESTRICT;
CONSTRAINT jobs_salesid_fkey FOREIGN KEY (salesid) REFERENCES sales(salesid)
ON UPDATE RESTRICT ON DELETE RESTRICT;

: 2) Have you vacuumed / analyzed the tables involved recently?

Yes, I ran "VACUUM ANALYZE" on both table before I started.

: 3) Have you modified the stats on any of the tables / columns involve or
: are you using defaults?

I'm using the defaults...

Thanks in advance to anyone who has taken the time to wade through this
one... :)

Scott

:
: Drew
:
:
: Scott Pederick wrote:
: | Hi all!
: |
: | I'm having a problem with Postgresql 8.0.0-beta4 on a Win2K machine and
a
: | particular join.
: |
: | I've got two tables - a list of customers and jobs they've had. A
customer
: | can have multiple jobs.
: |
: | The query always scans the entire jobs table for each customer - I need
it
: | the other way around so I can get a list of the customers who have at
: least
: | one job.
: |
: | The EXPLAIN shows the jobs table is being scanned for some reason:
: |
: | Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Customers
: | INNER JOIN Jobs USING (CustomerId);
: | QUERY PLAN
:
| -------------------------------------------------------------------------
: | Hash Join (cost=78.54..4908.71 rows=70727 width=8)
: | Hash Cond: ("outer".customerid = "inner".customerid)
: | -> Seq Scan on jobs (cost=0.00..3769.27 rows=70727 width=8)
: | -> Hash (cost=76.03..76.03 rows=1003 width=4)
: | -> Seq Scan on customers (cost=0.00..76.03 rows=1003 width=4)
: | (5 rows)
: |
: |
: |
: | Even if I reverse the JOIN I get the exact same result:
: |
: | Quarry=# EXPLAIN SELECT Customers.CustomerId, Jobs.JobId FROM Jobs INNER
: | JOIN Customers USING (CustomerId);
: | QUERY PLAN
:
| -------------------------------------------------------------------------
: | Hash Join (cost=78.54..4908.71 rows=70727 width=8)
: | Hash Cond: ("outer".customerid = "inner".customerid)
: | -> Seq Scan on jobs (cost=0.00..3769.27 rows=70727 width=8)
: | -> Hash (cost=76.03..76.03 rows=1003 width=4)
: | -> Seq Scan on customers (cost=0.00..76.03 rows=1003 width=4)
: | (5 rows)
: |
: |
: | How can I force it to operate as I need it to? It seems the query
: engine is
: | a little smarter than it needs to be.
: |
: | If anyone can shed some light on this problem, it would be greatly
: | appreciated. I've taken it as far as I can and don't really know where
to
: | move from here.
:
:
: - --
: Andrew Hammond 416-673-4138 ahammond(at)ca(dot)afilias(dot)info
: Database Administrator, Afilias Canada Corp.
: CB83 2838 4B67 D40F D086 3568 81FC E7E5 27AF 4A9A
: -----BEGIN PGP SIGNATURE-----
: Version: GnuPG v1.2.5 (GNU/Linux)
:
: iD8DBQFBh9YQgfzn5SevSpoRAg0LAKCg5K7IccFIOvdTc8DEl2YaUMcUCgCfRt2Q
: CI1Vo6yxHkrWcoTQMQ/EvOw=
: =m15B
: -----END PGP SIGNATURE-----
:
: ---------------------------(end of broadcast)---------------------------
: TIP 3: if posting/reading through Usenet, please send an appropriate
: subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
: message can get through to the mailing list cleanly
:


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Scott Pederick" <sql(dot)postgresql(dot)org(at)pederick(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: JOIN not being calculated correctly
Date: 2004-11-20 21:51:34
Message-ID: 626.1100987494@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

"Scott Pederick" <sql(dot)postgresql(dot)org(at)pederick(dot)com> writes:
> Still can't fathom why it's not using an index scan on the jobs table..

Why exactly do you think that would make it faster?

The query evidently requires visiting every single jobs row, so a
seqscan seems appropriate to me; indeed I'd say the planner picked
exactly the perfect plan. If you think not, try forcing other plan
choices and see what happens to the runtime.

regards, tom lane