Re: people who buy A, also buy C, D, E

From: Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
To: Dan Langille <dan(at)langille(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-04-26 12:24:37
Message-ID: 426E3305.7D0AF218@rodos.fzk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Dan Langille wrote:
>
> The goal of my query is: given a book, what did other people who
> bought this book also buy? I plan the list the 5 most popular such
> books. In reality, this isn't about books, but that makes it easier
> to understand I think.
>
> We have a table of customer_id (watch_list_id) and book_id
> (element_id).
>
> freshports.org=# \d watch_list_element
> Table "public.watch_list_element"
> Column | Type | Modifiers
> ---------------+---------+-----------
> watch_list_id | integer | not null
> element_id | integer | not null
> Indexes:
> "watch_list_element_pkey" primary key, btree (watch_list_id,
> element_id)
> "watch_list_element_element_id" btree (element_id)
> Foreign-key constraints:
> "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> UPDATE CASCADE ON DELETE CASCADE
> "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> CASCADE ON DELETE CASCADE
>
> freshports.org=#
>
> I have a query which returns the needed results:
>
> SELECT W.element_id
> FROM watch_list_element W
> WHERE w.watch_list_id in (select watch_list_id from
> watch_list_element where element_id = 54968)
> GROUP BY W.element_id
> ORDER BY count(W.watch_list_id) DESC
> LIMIT 5;
>
> But performance is an issue here. So I'm planning to calculate all
> the possible values and cache them. That is, given each element_id in
> a watch_list, what are the top 5 element_id values on all the lists
> on which the original element_id appears?
>
> I'm having trouble constructing the query. I'm not even sure I can
> do this in one select, but that would be nice. Examples and clues
> are appreciated.
>
> Any ideas?
>
> Thank you.
> --

Just two ideas.

1) Older Postgres versions are notorious for being slow
on "IN" clauses.
Does this one (untested) perform better:

SELECT W.element_id, count(W.watch_list_id)
FROM watch_list_element W
WHERE EXISTS
(SELECT * FROM watch_list_element E
WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
GROUP BY W.element_id
ORDER BY 2 DESC
LIMIT 5;

2) I suspect calculating all possible values would require time and
an enormous cache buffer in size as well as re-calculating pretty often.
So my approach would be trying to tune the query before introducing
cached results.

HTH

Regards, Christoph

In response to

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Achilleus Mantzios 2005-04-26 12:25:12 Re: people who buy A, also buy C, D, E
Previous Message Christoph Haller 2005-04-26 09:08:13 Re: UPDATE WITH ORDER BY