Re: people who buy A, also buy C, D, E

From: "Dan Langille" <dan(at)langille(dot)org>
To: Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-04-26 12:35:20
Message-ID: 426DFD48.16933.D44C1F2@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

On 26 Apr 2005 at 14:24, Christoph Haller wrote:

> Dan Langille wrote:
> >
> > The goal of my query is: given a book, what did other people who
> > bought this book also buy? I plan the list the 5 most popular such
> > books. In reality, this isn't about books, but that makes it easier
> > to understand I think.
> >
> > We have a table of customer_id (watch_list_id) and book_id
> > (element_id).
> >
> > freshports.org=# \d watch_list_element
> > Table "public.watch_list_element"
> > Column | Type | Modifiers
> > ---------------+---------+-----------
> > watch_list_id | integer | not null
> > element_id | integer | not null
> > Indexes:
> > "watch_list_element_pkey" primary key, btree (watch_list_id,
> > element_id)
> > "watch_list_element_element_id" btree (element_id)
> > Foreign-key constraints:
> > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > UPDATE CASCADE ON DELETE CASCADE
> > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > CASCADE ON DELETE CASCADE
> >
> > freshports.org=#
> >
> > I have a query which returns the needed results:
> >
> > SELECT W.element_id
> > FROM watch_list_element W
> > WHERE w.watch_list_id in (select watch_list_id from
> > watch_list_element where element_id = 54968)
> > GROUP BY W.element_id
> > ORDER BY count(W.watch_list_id) DESC
> > LIMIT 5;
> >
> > But performance is an issue here. So I'm planning to calculate all
> > the possible values and cache them. That is, given each element_id
> > in a watch_list, what are the top 5 element_id values on all the
> > lists on which the original element_id appears?
> >
> > I'm having trouble constructing the query. I'm not even sure I can
> > do this in one select, but that would be nice. Examples and clues
> > are appreciated.
> >
> > Any ideas?
> >
> > Thank you.
> > --
>
> Just two ideas.
>
> 1) Older Postgres versions are notorious for being slow
> on "IN" clauses.
> Does this one (untested) perform better:
>
> SELECT W.element_id, count(W.watch_list_id)
> FROM watch_list_element W
> WHERE EXISTS
> (SELECT * FROM watch_list_element E
> WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;

I'm on PostgreSQL 7.4.7:

freshports.org=# explain analyse
freshports.org-# SELECT W.element_id, count(W.watch_list_id)
freshports.org-# FROM watch_list_element W
freshports.org-# WHERE EXISTS
freshports.org-# (SELECT * FROM watch_list_element E
freshports.org(# WHERE E.element_id = 54968 AND W.watch_list_id =
E.watch_list_id)
freshports.org-# GROUP BY W.element_id
freshports.org-# ORDER BY 2 DESC
freshports.org-# LIMIT 5;

QUERY PLAN
----------------------------------------------------------------------
----------------------------------------------------------------------
---------------------------------
Limit (cost=417905.49..417905.51 rows=5 width=8) (actual
time=3142.480..3142.528 rows=5 loops=1)
-> Sort (cost=417905.49..417908.08 rows=1033 width=8) (actual
time=3142.471..3142.486 rows=5 loops=1)
Sort Key: count(watch_list_id)
-> HashAggregate (cost=417851.20..417853.78 rows=1033
width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)
-> Seq Scan on watch_list_element w
(cost=0.00..417506.76 rows=68888 width=8) (actual
time=0.129..2619.989 rows=94018 loops=1)
Filter: (subplan)
SubPlan
-> Index Scan using watch_list_element_pkey
on watch_list_element e (cost=0.00..3.02 rows=1 width=8) (actual
time=0.011..0.011 rows=1 loops=137776)
Index Cond: (($0 = watch_list_id) AND
(element_id = 54968))
Total runtime: 3143.304 ms
(10 rows)

freshports.org=#

Compare that to the original query:

freshports.org=# explain analyse
freshports.org-# SELECT W.element_id
freshports.org-# FROM watch_list_element W
freshports.org-# WHERE w.watch_list_id in (select watch_list_id
from
freshports.org(# watch_list_element where element_id = 54968)
freshports.org-# GROUP BY W.element_id
freshports.org-# ORDER BY count(W.watch_list_id) DESC
freshports.org-# LIMIT 5;

QUERY PLAN
----------------------------------------------------------------------
----------------------------------------------------------------------
--------------------------------------------
Limit (cost=1174.26..1174.28 rows=5 width=8) (actual
time=1786.628..1786.675 rows=5 loops=1)
-> Sort (cost=1174.26..1176.16 rows=758 width=8) (actual
time=1786.618..1786.634 rows=5 loops=1)
Sort Key: count(w.watch_list_id)
-> HashAggregate (cost=1136.11..1138.01 rows=758 width=8)
(actual time=1718.439..1756.290 rows=7338 loops=1)
-> Nested Loop (cost=465.98..1132.32 rows=758
width=8) (actual time=44.372..1292.270 rows=94018 loops=1)
-> HashAggregate (cost=465.98..465.98 rows=6
width=4) (actual time=44.332..47.136 rows=599 loops=1)
-> Index Scan using
watch_list_element_element_id on watch_list_element
(cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789
rows=599 loops=1)
Index Cond: (element_id = 54968)
-> Index Scan using watch_list_element_pkey on
watch_list_element w (cost=0.00..109.48 rows=126 width=8) (actual
time=0.049..0.863 rows=157 loops=599)
Index Cond: (w.watch_list_id =
"outer".watch_list_id)
Total runtime: 1787.466 ms
(11 rows)

freshports.org=#

> 2) I suspect calculating all possible values would require time and an
> enormous cache buffer in size as well as re-calculating pretty often.
> So my approach would be trying to tune the query before introducing
> cached results.

Yes, it's pretty extensive: for each element (there are 9000), get me

all the watch lists it appears upon, then get me the top 5 elements
on
*those* watch lists.
--
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/
NEW brochure available at http://www.bsdcan.org/2005/advocacy/

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Christoph Haller 2005-04-26 13:27:44 Re: people who buy A, also buy C, D, E
Previous Message Achilleus Mantzios 2005-04-26 12:25:12 Re: people who buy A, also buy C, D, E