Re: people who buy A, also buy C, D, E

Lists: pgsql-sql
From: "Dan Langille" <dan(at)langille(dot)org>
To: pgsql-sql(at)postgresql(dot)org
Cc: dan(at)langille(dot)org
Subject: people who buy A, also buy C, D, E
Date: 2005-04-26 02:21:35
Message-ID: 426D6D6F.7549.B12D9FC@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

The goal of my query is: given a book, what did other people who
bought this book also buy? I plan the list the 5 most popular such
books. In reality, this isn't about books, but that makes it easier
to understand I think.

We have a table of customer_id (watch_list_id) and book_id
(element_id).

freshports.org=# \d watch_list_element
Table "public.watch_list_element"
Column | Type | Modifiers
---------------+---------+-----------
watch_list_id | integer | not null
element_id | integer | not null
Indexes:
"watch_list_element_pkey" primary key, btree (watch_list_id,
element_id)
"watch_list_element_element_id" btree (element_id)
Foreign-key constraints:
"$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
UPDATE CASCADE ON DELETE CASCADE
"$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
CASCADE ON DELETE CASCADE

freshports.org=#

I have a query which returns the needed results:

SELECT W.element_id
FROM watch_list_element W
WHERE w.watch_list_id in (select watch_list_id from
watch_list_element where element_id = 54968)
GROUP BY W.element_id
ORDER BY count(W.watch_list_id) DESC
LIMIT 5;

But performance is an issue here. So I'm planning to calculate all
the possible values and cache them. That is, given each element_id in
a watch_list, what are the top 5 element_id values on all the lists
on which the original element_id appears?

I'm having trouble constructing the query. I'm not even sure I can
do this in one select, but that would be nice. Examples and clues
are appreciated.

Any ideas?

Thank you.
--
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/
NEW brochure available at http://www.bsdcan.org/2005/advocacy/


From: Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
To: Dan Langille <dan(at)langille(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-04-26 12:24:37
Message-ID: 426E3305.7D0AF218@rodos.fzk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Dan Langille wrote:
>
> The goal of my query is: given a book, what did other people who
> bought this book also buy? I plan the list the 5 most popular such
> books. In reality, this isn't about books, but that makes it easier
> to understand I think.
>
> We have a table of customer_id (watch_list_id) and book_id
> (element_id).
>
> freshports.org=# \d watch_list_element
> Table "public.watch_list_element"
> Column | Type | Modifiers
> ---------------+---------+-----------
> watch_list_id | integer | not null
> element_id | integer | not null
> Indexes:
> "watch_list_element_pkey" primary key, btree (watch_list_id,
> element_id)
> "watch_list_element_element_id" btree (element_id)
> Foreign-key constraints:
> "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> UPDATE CASCADE ON DELETE CASCADE
> "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> CASCADE ON DELETE CASCADE
>
> freshports.org=#
>
> I have a query which returns the needed results:
>
> SELECT W.element_id
> FROM watch_list_element W
> WHERE w.watch_list_id in (select watch_list_id from
> watch_list_element where element_id = 54968)
> GROUP BY W.element_id
> ORDER BY count(W.watch_list_id) DESC
> LIMIT 5;
>
> But performance is an issue here. So I'm planning to calculate all
> the possible values and cache them. That is, given each element_id in
> a watch_list, what are the top 5 element_id values on all the lists
> on which the original element_id appears?
>
> I'm having trouble constructing the query. I'm not even sure I can
> do this in one select, but that would be nice. Examples and clues
> are appreciated.
>
> Any ideas?
>
> Thank you.
> --

Just two ideas.

1) Older Postgres versions are notorious for being slow
on "IN" clauses.
Does this one (untested) perform better:

SELECT W.element_id, count(W.watch_list_id)
FROM watch_list_element W
WHERE EXISTS
(SELECT * FROM watch_list_element E
WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
GROUP BY W.element_id
ORDER BY 2 DESC
LIMIT 5;

2) I suspect calculating all possible values would require time and
an enormous cache buffer in size as well as re-calculating pretty often.
So my approach would be trying to tune the query before introducing
cached results.

HTH

Regards, Christoph


From: Achilleus Mantzios <achill(at)matrix(dot)gatewaynet(dot)com>
To: Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
Cc: Dan Langille <dan(at)langille(dot)org>, <pgsql-sql(at)postgresql(dot)org>
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-04-26 12:25:12
Message-ID: Pine.LNX.4.44.0504261523390.7746-100000@matrix.gatewaynet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

O Christoph Haller έγραψε στις Apr 26, 2005 :

> Dan Langille wrote:
> >
> > The goal of my query is: given a book, what did other people who
> > bought this book also buy? I plan the list the 5 most popular such
> > books. In reality, this isn't about books, but that makes it easier
> > to understand I think.
> >
> > We have a table of customer_id (watch_list_id) and book_id
> > (element_id).
> >
> > freshports.org=# \d watch_list_element
> > Table "public.watch_list_element"
> > Column | Type | Modifiers
> > ---------------+---------+-----------
> > watch_list_id | integer | not null
> > element_id | integer | not null
> > Indexes:
> > "watch_list_element_pkey" primary key, btree (watch_list_id,
> > element_id)
> > "watch_list_element_element_id" btree (element_id)
> > Foreign-key constraints:
> > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > UPDATE CASCADE ON DELETE CASCADE
> > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > CASCADE ON DELETE CASCADE
> >
> > freshports.org=#
> >
> > I have a query which returns the needed results:
> >
> > SELECT W.element_id
> > FROM watch_list_element W
> > WHERE w.watch_list_id in (select watch_list_id from
> > watch_list_element where element_id = 54968)
> > GROUP BY W.element_id
> > ORDER BY count(W.watch_list_id) DESC
> > LIMIT 5;
> >
> > But performance is an issue here. So I'm planning to calculate all
> > the possible values and cache them. That is, given each element_id in
> > a watch_list, what are the top 5 element_id values on all the lists
> > on which the original element_id appears?
> >
> > I'm having trouble constructing the query. I'm not even sure I can
> > do this in one select, but that would be nice. Examples and clues
> > are appreciated.
> >
> > Any ideas?
> >
> > Thank you.
> > --
>
> Just two ideas.
>
> 1) Older Postgres versions are notorious for being slow
> on "IN" clauses.
> Does this one (untested) perform better:
>
> SELECT W.element_id, count(W.watch_list_id)
> FROM watch_list_element W
> WHERE EXISTS
> (SELECT * FROM watch_list_element E
> WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> GROUP BY W.element_id
> ORDER BY 2 DESC
> LIMIT 5;
>
> 2) I suspect calculating all possible values would require time and
> an enormous cache buffer in size as well as re-calculating pretty often.
> So my approach would be trying to tune the query before introducing
> cached results.

AFAIK, problems like this fall into the "Data Mining" field,
and often their solution go beyond some DB arrangments.
A little research wouldn't hurt, IMO.

>
> HTH
>
> Regards, Christoph
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

--
-Achilleus


From: "Dan Langille" <dan(at)langille(dot)org>
To: Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-04-26 12:35:20
Message-ID: 426DFD48.16933.D44C1F2@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On 26 Apr 2005 at 14:24, Christoph Haller wrote:

> Dan Langille wrote:
> >
> > The goal of my query is: given a book, what did other people who
> > bought this book also buy? I plan the list the 5 most popular such
> > books. In reality, this isn't about books, but that makes it easier
> > to understand I think.
> >
> > We have a table of customer_id (watch_list_id) and book_id
> > (element_id).
> >
> > freshports.org=# \d watch_list_element
> > Table "public.watch_list_element"
> > Column | Type | Modifiers
> > ---------------+---------+-----------
> > watch_list_id | integer | not null
> > element_id | integer | not null
> > Indexes:
> > "watch_list_element_pkey" primary key, btree (watch_list_id,
> > element_id)
> > "watch_list_element_element_id" btree (element_id)
> > Foreign-key constraints:
> > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > UPDATE CASCADE ON DELETE CASCADE
> > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > CASCADE ON DELETE CASCADE
> >
> > freshports.org=#
> >
> > I have a query which returns the needed results:
> >
> > SELECT W.element_id
> > FROM watch_list_element W
> > WHERE w.watch_list_id in (select watch_list_id from
> > watch_list_element where element_id = 54968)
> > GROUP BY W.element_id
> > ORDER BY count(W.watch_list_id) DESC
> > LIMIT 5;
> >
> > But performance is an issue here. So I'm planning to calculate all
> > the possible values and cache them. That is, given each element_id
> > in a watch_list, what are the top 5 element_id values on all the
> > lists on which the original element_id appears?
> >
> > I'm having trouble constructing the query. I'm not even sure I can
> > do this in one select, but that would be nice. Examples and clues
> > are appreciated.
> >
> > Any ideas?
> >
> > Thank you.
> > --
>
> Just two ideas.
>
> 1) Older Postgres versions are notorious for being slow
> on "IN" clauses.
> Does this one (untested) perform better:
>
> SELECT W.element_id, count(W.watch_list_id)
> FROM watch_list_element W
> WHERE EXISTS
> (SELECT * FROM watch_list_element E
> WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;

I'm on PostgreSQL 7.4.7:

freshports.org=# explain analyse
freshports.org-# SELECT W.element_id, count(W.watch_list_id)
freshports.org-# FROM watch_list_element W
freshports.org-# WHERE EXISTS
freshports.org-# (SELECT * FROM watch_list_element E
freshports.org(# WHERE E.element_id = 54968 AND W.watch_list_id =
E.watch_list_id)
freshports.org-# GROUP BY W.element_id
freshports.org-# ORDER BY 2 DESC
freshports.org-# LIMIT 5;

QUERY PLAN
----------------------------------------------------------------------
----------------------------------------------------------------------
---------------------------------
Limit (cost=417905.49..417905.51 rows=5 width=8) (actual
time=3142.480..3142.528 rows=5 loops=1)
-> Sort (cost=417905.49..417908.08 rows=1033 width=8) (actual
time=3142.471..3142.486 rows=5 loops=1)
Sort Key: count(watch_list_id)
-> HashAggregate (cost=417851.20..417853.78 rows=1033
width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)
-> Seq Scan on watch_list_element w
(cost=0.00..417506.76 rows=68888 width=8) (actual
time=0.129..2619.989 rows=94018 loops=1)
Filter: (subplan)
SubPlan
-> Index Scan using watch_list_element_pkey
on watch_list_element e (cost=0.00..3.02 rows=1 width=8) (actual
time=0.011..0.011 rows=1 loops=137776)
Index Cond: (($0 = watch_list_id) AND
(element_id = 54968))
Total runtime: 3143.304 ms
(10 rows)

freshports.org=#

Compare that to the original query:

freshports.org=# explain analyse
freshports.org-# SELECT W.element_id
freshports.org-# FROM watch_list_element W
freshports.org-# WHERE w.watch_list_id in (select watch_list_id
from
freshports.org(# watch_list_element where element_id = 54968)
freshports.org-# GROUP BY W.element_id
freshports.org-# ORDER BY count(W.watch_list_id) DESC
freshports.org-# LIMIT 5;

QUERY PLAN
----------------------------------------------------------------------
----------------------------------------------------------------------
--------------------------------------------
Limit (cost=1174.26..1174.28 rows=5 width=8) (actual
time=1786.628..1786.675 rows=5 loops=1)
-> Sort (cost=1174.26..1176.16 rows=758 width=8) (actual
time=1786.618..1786.634 rows=5 loops=1)
Sort Key: count(w.watch_list_id)
-> HashAggregate (cost=1136.11..1138.01 rows=758 width=8)
(actual time=1718.439..1756.290 rows=7338 loops=1)
-> Nested Loop (cost=465.98..1132.32 rows=758
width=8) (actual time=44.372..1292.270 rows=94018 loops=1)
-> HashAggregate (cost=465.98..465.98 rows=6
width=4) (actual time=44.332..47.136 rows=599 loops=1)
-> Index Scan using
watch_list_element_element_id on watch_list_element
(cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789
rows=599 loops=1)
Index Cond: (element_id = 54968)
-> Index Scan using watch_list_element_pkey on
watch_list_element w (cost=0.00..109.48 rows=126 width=8) (actual
time=0.049..0.863 rows=157 loops=599)
Index Cond: (w.watch_list_id =
"outer".watch_list_id)
Total runtime: 1787.466 ms
(11 rows)

freshports.org=#

> 2) I suspect calculating all possible values would require time and an
> enormous cache buffer in size as well as re-calculating pretty often.
> So my approach would be trying to tune the query before introducing
> cached results.

Yes, it's pretty extensive: for each element (there are 9000), get me

all the watch lists it appears upon, then get me the top 5 elements
on
*those* watch lists.
--
Dan Langille : http://www.langille.org/
BSDCan - The Technical BSD Conference - http://www.bsdcan.org/
NEW brochure available at http://www.bsdcan.org/2005/advocacy/


From: Christoph Haller <ch(at)rodos(dot)fzk(dot)de>
To: Dan Langille <dan(at)langille(dot)org>
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-04-26 13:27:44
Message-ID: 426E41D0.D4103DEF@rodos.fzk.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Dan Langille wrote:
>
> On 26 Apr 2005 at 14:24, Christoph Haller wrote:
>
> > Dan Langille wrote:
> > >
> > > The goal of my query is: given a book, what did other people who
> > > bought this book also buy? I plan the list the 5 most popular such
> > > books. In reality, this isn't about books, but that makes it easier
> > > to understand I think.
> > >
> > > We have a table of customer_id (watch_list_id) and book_id
> > > (element_id).
> > >
> > > freshports.org=# \d watch_list_element
> > > Table "public.watch_list_element"
> > > Column | Type | Modifiers
> > > ---------------+---------+-----------
> > > watch_list_id | integer | not null
> > > element_id | integer | not null
> > > Indexes:
> > > "watch_list_element_pkey" primary key, btree (watch_list_id,
> > > element_id)
> > > "watch_list_element_element_id" btree (element_id)
> > > Foreign-key constraints:
> > > "$2" FOREIGN KEY (watch_list_id) REFERENCES watch_list(id) ON
> > > UPDATE CASCADE ON DELETE CASCADE
> > > "$1" FOREIGN KEY (element_id) REFERENCES element(id) ON UPDATE
> > > CASCADE ON DELETE CASCADE
> > >
> > > freshports.org=#
> > >
> > > I have a query which returns the needed results:
> > >
> > > SELECT W.element_id
> > > FROM watch_list_element W
> > > WHERE w.watch_list_id in (select watch_list_id from
> > > watch_list_element where element_id = 54968)
> > > GROUP BY W.element_id
> > > ORDER BY count(W.watch_list_id) DESC
> > > LIMIT 5;
> > >
> > > But performance is an issue here. So I'm planning to calculate all
> > > the possible values and cache them. That is, given each element_id
> > > in a watch_list, what are the top 5 element_id values on all the
> > > lists on which the original element_id appears?
> > >
> > > I'm having trouble constructing the query. I'm not even sure I can
> > > do this in one select, but that would be nice. Examples and clues
> > > are appreciated.
> > >
> > > Any ideas?
> > >
> > > Thank you.
> > > --
> >
> > Just two ideas.
> >
> > 1) Older Postgres versions are notorious for being slow
> > on "IN" clauses.
> > Does this one (untested) perform better:
> >
> > SELECT W.element_id, count(W.watch_list_id)
> > FROM watch_list_element W
> > WHERE EXISTS
> > (SELECT * FROM watch_list_element E
> > WHERE E.element_id = 54968 AND W.watch_list_id = E.watch_list_id)
> > GROUP BY W.element_id ORDER BY 2 DESC LIMIT 5;
>
> I'm on PostgreSQL 7.4.7:
>
> freshports.org=# explain analyse
> freshports.org-# SELECT W.element_id, count(W.watch_list_id)
> freshports.org-# FROM watch_list_element W
> freshports.org-# WHERE EXISTS
> freshports.org-# (SELECT * FROM watch_list_element E
> freshports.org(# WHERE E.element_id = 54968 AND W.watch_list_id =
> E.watch_list_id)
> freshports.org-# GROUP BY W.element_id
> freshports.org-# ORDER BY 2 DESC
> freshports.org-# LIMIT 5;
>
> QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> ---------------------------------
> Limit (cost=417905.49..417905.51 rows=5 width=8) (actual
> time=3142.480..3142.528 rows=5 loops=1)
> -> Sort (cost=417905.49..417908.08 rows=1033 width=8) (actual
> time=3142.471..3142.486 rows=5 loops=1)
> Sort Key: count(watch_list_id)
> -> HashAggregate (cost=417851.20..417853.78 rows=1033
> width=8) (actual time=3074.170..3112.294 rows=7338 loops=1)
> -> Seq Scan on watch_list_element w
> (cost=0.00..417506.76 rows=68888 width=8) (actual
> time=0.129..2619.989 rows=94018 loops=1)
> Filter: (subplan)
> SubPlan
> -> Index Scan using watch_list_element_pkey
> on watch_list_element e (cost=0.00..3.02 rows=1 width=8) (actual
> time=0.011..0.011 rows=1 loops=137776)
> Index Cond: (($0 = watch_list_id) AND
> (element_id = 54968))
> Total runtime: 3143.304 ms
> (10 rows)
>
> freshports.org=#
>
> Compare that to the original query:
>
> freshports.org=# explain analyse
> freshports.org-# SELECT W.element_id
> freshports.org-# FROM watch_list_element W
> freshports.org-# WHERE w.watch_list_id in (select watch_list_id
> from
> freshports.org(# watch_list_element where element_id = 54968)
> freshports.org-# GROUP BY W.element_id
> freshports.org-# ORDER BY count(W.watch_list_id) DESC
> freshports.org-# LIMIT 5;
>
> QUERY PLAN
> ----------------------------------------------------------------------
> ----------------------------------------------------------------------
> --------------------------------------------
> Limit (cost=1174.26..1174.28 rows=5 width=8) (actual
> time=1786.628..1786.675 rows=5 loops=1)
> -> Sort (cost=1174.26..1176.16 rows=758 width=8) (actual
> time=1786.618..1786.634 rows=5 loops=1)
> Sort Key: count(w.watch_list_id)
> -> HashAggregate (cost=1136.11..1138.01 rows=758 width=8)
> (actual time=1718.439..1756.290 rows=7338 loops=1)
> -> Nested Loop (cost=465.98..1132.32 rows=758
> width=8) (actual time=44.372..1292.270 rows=94018 loops=1)
> -> HashAggregate (cost=465.98..465.98 rows=6
> width=4) (actual time=44.332..47.136 rows=599 loops=1)
> -> Index Scan using
> watch_list_element_element_id on watch_list_element
> (cost=0.00..464.15 rows=735 width=4) (actual time=37.957..41.789
> rows=599 loops=1)
> Index Cond: (element_id = 54968)
> -> Index Scan using watch_list_element_pkey on
> watch_list_element w (cost=0.00..109.48 rows=126 width=8) (actual
> time=0.049..0.863 rows=157 loops=599)
> Index Cond: (w.watch_list_id =
> "outer".watch_list_id)
> Total runtime: 1787.466 ms
> (11 rows)
>
> freshports.org=#
>
> > 2) I suspect calculating all possible values would require time and an
> > enormous cache buffer in size as well as re-calculating pretty often.
> > So my approach would be trying to tune the query before introducing
> > cached results.
>
> Yes, it's pretty extensive: for each element (there are 9000), get me
> all the watch lists it appears upon, then get me the top 5 elements
> on *those* watch lists.
> --

So your original query outruns my "EXISTS" version
by a factor of 2. I did not realize the "IN" has
become so much faster.
I am running out of clues.
May be the [PERFORMANCE] list has more
qualified approaches.
Regards, Christoph


From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Cc: dan(at)langille(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-06-25 03:35:40
Message-ID: 62b4e0aae1e53b542337e18ec2b6b4a8@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> The goal of my query is: given a book, what did other people who
> bought this book also buy? I plan the list the 5 most popular such
> books.

I've been playing with this a little bit, and I don't think you are
going to get better than you already have. Certainly, the caching
won't work either as any insert into the watch_list_element has
the potential to change a very large number of pre-compiled lists.
However, there are some minor optimizations that can be made to
speed up the existing query quite a bit. One note first: the LIMIT
should be 6 not 5 if you really want the five other "books" and the
book itself will more than likely appear in the list. Picking it
out is something the client app can do.

* Make sure the tables are freshly analyzed. Might want to bump
up the default stats a bit too.

* Looks like you already have indexes on the watch_list_element
table. The watch_list_element_element_id index could be broken
into multiple conditional indexes, but your explain shows this
would not really gain us much:

actual time=37.957..41.789

* One big gain would be to cluster the table on watch_list_id:

CREATE INDEX watch_index ON watch_list_element (watch_list_id);
CLUSTER watch_index ON watch_list_element;

I got about a 25% speedup on my queries by doing this. YMMV, as I
don't know enough about your conditions to do more than make an
approximate test database. But it should help this query out.

* Finally, you should upgrade if at all possible. Going from
7.4.7 to 8.0.1 gave me a 10% speed increase, while going from
8.0.1 to 8.1.0 (e.g. the upcoming version) gave me an additional
25% speed boost, mostly due to the new bitmap stuff. So, making
the jump to 8.0.1 will be good practice for the 8.1.0 jump, right? :)

Overall, I was able to get the query to go about a third faster
than when I started. Hope this helps.

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
PGP Key: 0x14964AC8 200506242328
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iD8DBQFCvNCrvJuQZxSWSsgRAmkDAJ44z/Ei27HuEBqx/htmCRHJZWi8VQCfV2mm
upeE0p3z4h11NJzl5aOqCkc=
=LVqI
-----END PGP SIGNATURE-----


From: Jan Wieck <JanWieck(at)Yahoo(dot)com>
To: Greg Sabino Mullane <greg(at)turnstep(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org, dan(at)langille(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-06-25 14:00:53
Message-ID: 42BD6395.8070301@Yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

On 6/24/2005 11:35 PM, Greg Sabino Mullane wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
>
>> The goal of my query is: given a book, what did other people who
>> bought this book also buy? I plan the list the 5 most popular such
>> books.

This sounds very much like one of the congestion problems given in the
TPC-W benchmark. You might want to take a look at some of the published
full disclosure reports or the PHP TPC-W implementation on pgfoundry to
get some hints.

Jan

>
> I've been playing with this a little bit, and I don't think you are
> going to get better than you already have. Certainly, the caching
> won't work either as any insert into the watch_list_element has
> the potential to change a very large number of pre-compiled lists.
> However, there are some minor optimizations that can be made to
> speed up the existing query quite a bit. One note first: the LIMIT
> should be 6 not 5 if you really want the five other "books" and the
> book itself will more than likely appear in the list. Picking it
> out is something the client app can do.
>
> * Make sure the tables are freshly analyzed. Might want to bump
> up the default stats a bit too.
>
> * Looks like you already have indexes on the watch_list_element
> table. The watch_list_element_element_id index could be broken
> into multiple conditional indexes, but your explain shows this
> would not really gain us much:
>
> actual time=37.957..41.789
>
> * One big gain would be to cluster the table on watch_list_id:
>
> CREATE INDEX watch_index ON watch_list_element (watch_list_id);
> CLUSTER watch_index ON watch_list_element;
>
> I got about a 25% speedup on my queries by doing this. YMMV, as I
> don't know enough about your conditions to do more than make an
> approximate test database. But it should help this query out.
>
> * Finally, you should upgrade if at all possible. Going from
> 7.4.7 to 8.0.1 gave me a 10% speed increase, while going from
> 8.0.1 to 8.1.0 (e.g. the upcoming version) gave me an additional
> 25% speed boost, mostly due to the new bitmap stuff. So, making
> the jump to 8.0.1 will be good practice for the 8.1.0 jump, right? :)
>
> Overall, I was able to get the query to go about a third faster
> than when I started. Hope this helps.
>
> - --
> Greg Sabino Mullane greg(at)turnstep(dot)com
> PGP Key: 0x14964AC8 200506242328
> http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
>
> -----BEGIN PGP SIGNATURE-----
>
> iD8DBQFCvNCrvJuQZxSWSsgRAmkDAJ44z/Ei27HuEBqx/htmCRHJZWi8VQCfV2mm
> upeE0p3z4h11NJzl5aOqCkc=
> =LVqI
> -----END PGP SIGNATURE-----
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #


From: PFC <lists(at)boutiquenumerique(dot)com>
To: "Jan Wieck" <JanWieck(at)yahoo(dot)com>, "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
Cc: pgsql-sql(at)postgresql(dot)org, dan(at)langille(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-06-26 16:25:40
Message-ID: op.sszjw2thth1vuj@localhost
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

>>> The goal of my query is: given a book, what did other people who
>>> bought this book also buy? I plan the list the 5 most popular such
>>> books.

You can use the table listing ordered products directly, for example :

table ordered_products: order_id, product_id, quantity

SELECT b.product_id, sum(quantity) as rank FROM ordered_products a,
ordered_products b WHERE a.product_id=(the product id) AND
b.order_id=a.order_id AND b.product_id != a.product_id GROUP BY
b.product_id ORDER BY rank DESC LIMIT 6;

This will need indexes on order_id and product_id that you probably
already have.
It will also be slow.

You can also have a cache table :

cache prod_id_a, prod_id_b, quantity
With a constraint that prod_id_a < prod_id_b

You add a trigger on insert, update or delete to ordered_products to
insert or update rows in this table, modifying the quantity according to
the purchase.

To select you do :

SELECT * FROM
(
(SELECT prod_id_b as pid, quantity FROM cache WHERE prod_id_a=(your id)
ORDER BY prod_id_a DESC, quantity DESC LIMIT 5)
UNION ALL
(SELECT prod_id_a as pid, quantity FROM cache WHERE prod_id_b=(your id)
ORDER BY prod_id_b DESC, quantity DESC LIMIT 5)
) as foo
ORDER BY quantity DESC
LIMIT 5;

It will be probably very fast but the table will grow huge and need
various indexes :
(prod_id_a, quantity)
(prod_id_b quantity)
(prod_id_a, prod_id_b) (the primary key)

You'll get 1/2 * N * (N-1) rows, N being the number of products on your
site. If you remove the constraint prod_id_a < prod_id_b
you'll get N^2 rows which is worse.

Another solution :

Table cache : product_id integer, also_purchased integer[]

After every order, update also_purchased with the results of the query
using the self join on ordered_products tables above.
This query should not be fast enough to use in a product webpage but it
shouldn't be slow enough to be used like thi, only when orders are made.

To get the "also purchased products" all you have to do is read a line in
this table.


From: "Greg Sabino Mullane" <greg(at)turnstep(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: people who buy A, also buy C, D, E
Date: 2005-06-28 23:53:39
Message-ID: c5c5417bcc537847c105ade5e360341c@biglumber.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>>> The goal of my query is: given a book, what did other people who
>>> bought this book also buy? I plan the list the 5 most popular such
>>> books.
>
> SELECT b.product_id, sum(quantity) as rank FROM ordered_products a,
> ordered_products b WHERE a.product_id=(the product id) AND
> b.order_id=a.order_id AND b.product_id != a.product_id GROUP BY
> b.product_id ORDER BY rank DESC LIMIT 6;

I don't think this is exactly what the original poster had in mind:
we want a ranking of a dynamically generated subset of all possible
products (e.g. books). So if someone buys "Harry Potter and the Proprietary
Database", then only the books bought by people who also bought /that/
book are considered, ranked, and ordered. There's not a lot of caching that
can be effectively done, due to the high number of combinations and large
potential for change.

> table ordered_products: order_id, product_id, quantity

I'm not sure where you are getting "quantity" from: as near as I
can tell, this will always be a quantity of 1: one person ordering
one item.

- --
Greg Sabino Mullane greg(at)turnstep(dot)com
PGP Key: 0x14964AC8 200506281946
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iD8DBQFCweKavJuQZxSWSsgRAkmHAJ9fQ+Degs6jSrGRozEoI35F8nlyBACfYm2u
QgawxHOij5FHVd0FopW25IU=
=r5eo
-----END PGP SIGNATURE-----