Re: Query improvement

Lists: pgsql-performance
From: Mark <Marek(dot)Balgar(at)seznam(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Query improvement
Date: 2011-05-01 10:23:52
Message-ID: 1304245432864-4362578.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hi I have 3 tables
page - revision - pagecontent

CREATE TABLE mediawiki.page
(
page_id serial NOT NULL,
page_namespace smallint NOT NULL,
page_title text NOT NULL,
page_restrictions text,
page_counter bigint NOT NULL DEFAULT 0,
page_is_redirect smallint NOT NULL DEFAULT 0,
page_is_new smallint NOT NULL DEFAULT 0,
page_random numeric(15,14) NOT NULL DEFAULT random(),
page_touched timestamp with time zone,
page_latest integer NOT NULL,
page_len integer NOT NULL,
titlevector tsvector,
page_type integer NOT NULL DEFAULT 0,
CONSTRAINT page_pkey PRIMARY KEY (page_id)
);

CREATE TABLE mediawiki.revision
(
rev_id serial NOT NULL,
rev_page integer,
rev_text_id integer,
rev_comment text,
rev_user integer NOT NULL,
rev_user_text text NOT NULL,
rev_timestamp timestamp with time zone NOT NULL,
rev_minor_edit smallint NOT NULL DEFAULT 0,
rev_deleted smallint NOT NULL DEFAULT 0,
rev_len integer,
rev_parent_id integer,
CONSTRAINT revision_rev_page_fkey FOREIGN KEY (rev_page)
REFERENCES mediawiki.page (page_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE,
CONSTRAINT revision_rev_id_key UNIQUE (rev_id)
)

CREATE TABLE mediawiki.pagecontent
(
old_id integer NOT NULL DEFAULT
nextval('mediawiki.text_old_id_seq'::regclass),
old_text text,
old_flags text,
textvector tsvector,
CONSTRAINT pagecontent_pkey PRIMARY KEY (old_id)
)

where i have query
SELECT pa.page_id, pa.page_title,
ts_rank(pc.textvector,(to_tsquery('fotbal')))+ts_rank(pa.titlevector,(to_tsquery('fotbal')))*10
as totalrank
from mediawiki.page pa, mediawiki.revision re, mediawiki.pagecontent pc
WHERE pa.page_id in
(SELECT page_id FROM mediawiki.page WHERE page_id IN
(SELECT page_id FROM mediawiki.page
WHERE (titlevector @@ (to_tsquery('fotbal'))))
OR page_id IN
(SELECT p.page_id from mediawiki.page p,mediawiki.revision r,
(SELECT old_id FROM mediawiki.pagecontent
WHERE (textvector @@ (to_tsquery('fotbal')))) ss
WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id)))
AND (pa.page_id=re.rev_page AND re.rev_id=pc.old_id)
ORDER BY totalrank LIMIT 100;

This query find out titles of pages in page and content in page content by
full text search - @@
afterwards i count for the resulted id by ts_rank the relevance.

Now the problem.
When I try ANALYZE it shows:
"Limit (cost=136568.00..136568.25 rows=100 width=185)"
" -> Sort (cost=136568.00..137152.26 rows=233703 width=185)"
" Sort Key: ((ts_rank(pc.textvector, to_tsquery('fotbal'::text)) +
(ts_rank(pa.titlevector, to_tsquery('fotbal'::text)) * 10::double
precision)))"
" -> Hash Join (cost=61707.99..127636.04 rows=233703 width=185)"
" Hash Cond: (re.rev_id = pc.old_id)"
" -> Merge Join (cost=24098.90..71107.48 rows=233703
width=66)"
" Merge Cond: (pa.page_id = re.rev_page)"
" -> Merge Semi Join (cost=24096.98..55665.69
rows=233703 width=66)"
" Merge Cond: (pa.page_id =
mediawiki.page.page_id)"
" -> Index Scan using page_btree_id on page pa
(cost=0.00..13155.20 rows=311604 width=62)"
" -> Index Scan using page_btree_id on page
(cost=24096.98..38810.19 rows=233703 width=4)"
" Filter: ((hashed SubPlan 1) OR (hashed
SubPlan 2))"
" SubPlan 1"
" -> Bitmap Heap Scan on page
(cost=10.41..900.33 rows=270 width=4)"
" Recheck Cond: (titlevector @@
to_tsquery('fotbal'::text))"
" -> Bitmap Index Scan on gin_index
(cost=0.00..10.34 rows=270 width=0)"
" Index Cond: (titlevector @@
to_tsquery('fotbal'::text))"
" SubPlan 2"
" -> Nested Loop (cost=1499.29..23192.08
rows=1558 width=4)"
" -> Nested Loop
(cost=1499.29..15967.11 rows=1558 width=4)"
" -> Bitmap Heap Scan on
pagecontent (cost=1499.29..6448.12 rows=1558 width=4)"
" Recheck Cond:
(textvector @@ to_tsquery('fotbal'::text))"
" -> Bitmap Index Scan
on gin_index2 (cost=0.00..1498.90 rows=1558 width=0)"
" Index Cond:
(textvector @@ to_tsquery('fotbal'::text))"
" -> Index Scan using
page_btree_rev_content_id on revision r (cost=0.00..6.10 rows=1 width=8)"
" Index Cond: (r.rev_id =
pagecontent.old_id)"
" -> Index Scan using page_btree_id
on page p (cost=0.00..4.62 rows=1 width=4)"
" Index Cond: (p.page_id =
r.rev_page)"
" -> Index Scan using page_btree_rev_page_id on revision
re (cost=0.00..11850.52 rows=311604 width=8)"
" -> Hash (cost=27932.04..27932.04 rows=311604 width=127)"
" -> Seq Scan on pagecontent pc (cost=0.00..27932.04
rows=311604 width=127)"

I there some posibility to speed up the hash join which takes a lot of time?
I have tried to find some solution, but it was not successfull.
Thanks a lot.--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4362578.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Mark <Marek(dot)Balgar(at)seznam(dot)cz>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query improvement
Date: 2011-05-02 07:58:40
Message-ID: BANLkTinqa_YBD+qhuNqta=M8rqaZNFtE5g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Sun, May 1, 2011 at 12:23 PM, Mark <Marek(dot)Balgar(at)seznam(dot)cz> wrote:
> Now the problem.
> When I try ANALYZE it shows:

That's a regular explain... can you post an EXPLAIN ANALYZE?

Hash joins are very inefficient if they require big temporary files.
I usually work around that by disabling hash joins for the problematic queries:

set enable_hashjoin = false;
<query>
set enable_hashjoin = true;

But an explain analyze would confirm or deny that theory.


From: Mark <Marek(dot)Balgar(at)seznam(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query improvement
Date: 2011-05-02 20:54:06
Message-ID: 1304369646236-4365717.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Here is EXPLAIN ANALYZE:

"Limit (cost=136568.00..136568.25 rows=100 width=185) (actual
time=1952.174..1952.215 rows=100 loops=1)"
" -> Sort (cost=136568.00..137152.26 rows=233703 width=185) (actual
time=1952.172..1952.188 rows=100 loops=1)"
" Sort Key: ((ts_rank(pc.textvector, to_tsquery('fotbal'::text)) +
(ts_rank(pa.titlevector, to_tsquery('fotbal'::text)) * 10::double
precision)))"
" Sort Method: top-N heapsort Memory: 23kB"
" -> Hash Join (cost=61707.99..127636.04 rows=233703 width=185)
(actual time=1046.838..1947.815 rows=3278 loops=1)"
" Hash Cond: (re.rev_id = pc.old_id)"
" -> Merge Join (cost=24098.90..71107.48 rows=233703
width=66) (actual time=200.884..859.453 rows=3278 loops=1)"
" Merge Cond: (pa.page_id = re.rev_page)"
" -> Merge Semi Join (cost=24096.98..55665.69
rows=233703 width=66) (actual time=200.843..629.821 rows=3278 loops=1)"
" Merge Cond: (pa.page_id =
mediawiki.page.page_id)"
" -> Index Scan using page_btree_id on page pa
(cost=0.00..13155.20 rows=311604 width=62) (actual time=0.027..145.989
rows=311175 loops=1)"
" -> Index Scan using page_btree_id on page
(cost=24096.98..38810.19 rows=233703 width=4) (actual time=200.779..429.219
rows=3278 loops=1)"
" Filter: ((hashed SubPlan 1) OR (hashed
SubPlan 2))"
" SubPlan 1"
" -> Bitmap Heap Scan on page
(cost=10.41..900.33 rows=270 width=4) (actual time=0.748..9.845 rows=280
loops=1)"
" Recheck Cond: (titlevector @@
to_tsquery('fotbal'::text))"
" -> Bitmap Index Scan on gin_index
(cost=0.00..10.34 rows=270 width=0) (actual time=0.586..0.586 rows=280
loops=1)"
" Index Cond: (titlevector @@
to_tsquery('fotbal'::text))"
" SubPlan 2"
" -> Nested Loop (cost=1499.29..23192.08
rows=1558 width=4) (actual time=2.032..185.743 rows=3250 loops=1)"
" -> Nested Loop
(cost=1499.29..15967.11 rows=1558 width=4) (actual time=1.980..109.491
rows=3250 loops=1)"
" -> Bitmap Heap Scan on
pagecontent (cost=1499.29..6448.12 rows=1558 width=4) (actual
time=1.901..36.583 rows=3250 loops=1)"
" Recheck Cond:
(textvector @@ to_tsquery('fotbal'::text))"
" -> Bitmap Index Scan
on gin_index2 (cost=0.00..1498.90 rows=1558 width=0) (actual
time=1.405..1.405 rows=3250 loops=1)"
" Index Cond:
(textvector @@ to_tsquery('fotbal'::text))"
" -> Index Scan using
page_btree_rev_content_id on revision r (cost=0.00..6.10 rows=1 width=8)
(actual time=0.020..0.021 rows=1 loops=3250)"
" Index Cond: (r.rev_id =
pagecontent.old_id)"
" -> Index Scan using page_btree_id
on page p (cost=0.00..4.62 rows=1 width=4) (actual time=0.022..0.022 rows=1
loops=3250)"
" Index Cond: (p.page_id =
r.rev_page)"
" -> Index Scan using page_btree_rev_page_id on revision
re (cost=0.00..11850.52 rows=311604 width=8) (actual time=0.012..166.042
rows=311175 loops=1)"
" -> Hash (cost=27932.04..27932.04 rows=311604 width=127)
(actual time=801.000..801.000 rows=311604 loops=1)"
" Buckets: 1024 Batches: 64 Memory Usage: 744kB"
" -> Seq Scan on pagecontent pc (cost=0.00..27932.04
rows=311604 width=127) (actual time=0.018..465.686 rows=311604 loops=1)"
"Total runtime: 1952.962 ms"

I have tried
set enable_hashjoin = false;
<query>
set enable_hashjoin = true;

but the result have been worst than before. By the way is there a posibility
to create beeter query with same effect?
I have tried more queries, but this has got best performance yet.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4365717.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Mark <Marek(dot)Balgar(at)seznam(dot)cz>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query improvement
Date: 2011-05-03 07:21:56
Message-ID: BANLkTinUPSdPC1K_ziYaPM=XLtEkqcn-cA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, May 2, 2011 at 10:54 PM, Mark <Marek(dot)Balgar(at)seznam(dot)cz> wrote:
> but the result have been worst than before. By the way is there a posibility
> to create beeter query with same effect?
> I have tried more queries, but this has got best performance yet.

Well, this seems to be the worst part:

(SELECT page_id FROM mediawiki.page WHERE page_id IN
(SELECT page_id FROM mediawiki.page
WHERE (titlevector @@ (to_tsquery('fotbal'))))
OR page_id IN
(SELECT p.page_id from mediawiki.page p,mediawiki.revision r,
(SELECT old_id FROM mediawiki.pagecontent
WHERE (textvector @@ (to_tsquery('fotbal')))) ss
WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id)))

If you're running a new enough pg (8.4+), you could try using CTEs for that.

I haven't used CTEs much, but I think it goes something like:

WITH someids AS (

(SELECT page_id FROM mediawiki.page WHERE page_id IN
(SELECT page_id FROM mediawiki.page
WHERE (titlevector @@ (to_tsquery('fotbal'))))
OR page_id IN
(SELECT p.page_id from mediawiki.page p,mediawiki.revision r,
(SELECT old_id FROM mediawiki.pagecontent
WHERE (textvector @@ (to_tsquery('fotbal')))) ss
WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id)))

)
SELECT pa.page_id, pa.page_title,
ts_rank(pc.textvector,(to_tsquery('fotbal')))+ts_rank(pa.titlevector,(to_tsquery('fotbal')))*10
as totalrank
from mediawiki.page pa, mediawiki.revision re, mediawiki.pagecontent pc
WHERE pa.page_id in someids
AND (pa.page_id=re.rev_page AND re.rev_id=pc.old_id)
ORDER BY totalrank LIMIT 100;


From: "Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To: "Claudio Freire" <klaussfreire(at)gmail(dot)com>, "Mark" <Marek(dot)Balgar(at)seznam(dot)cz>
Cc: <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Query improvement
Date: 2011-05-03 09:14:01
Message-ID: C4DAC901169B624F933534A26ED7DF310861AE8B@JENMAIL01.ad.intershop.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


> On Mon, May 2, 2011 at 10:54 PM, Mark <Marek(dot)Balgar(at)seznam(dot)cz> wrote:
> > but the result have been worst than before. By the way is there a
posibility
> > to create beeter query with same effect?
> > I have tried more queries, but this has got best performance yet.
>
> Well, this seems to be the worst part:
>
> (SELECT page_id FROM mediawiki.page WHERE page_id IN
> (SELECT page_id FROM mediawiki.page
> WHERE (titlevector @@ (to_tsquery('fotbal'))))
> OR page_id IN
> (SELECT p.page_id from mediawiki.page
p,mediawiki.revision r,
> (SELECT old_id FROM mediawiki.pagecontent
> WHERE (textvector @@ (to_tsquery('fotbal')))) ss
> WHERE (p.page_id=r.rev_page AND r.rev_id=ss.old_id)))
>

'OR' statements often generate complicated plans. You should try to
rewrite your Query with a n UNION clause.
Using explicit joins may also help the planner:

SELECT page_id
FROM mediawiki.page
WHERE (titlevector @@ (to_tsquery('fotbal')))

UNION

SELECT p.page_id
FROM mediawiki.page p
JOIN mediawiki.revision r on (p.page_id=r.rev_page)
JOIN mediawiki.pagecontent ss on (r.rev_id=ss.old_id)
WHERE (ss.textvector @@ (to_tsquery('fotbal')))

HTH,

Marc Mamin


From: Mark <Marek(dot)Balgar(at)seznam(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query improvement
Date: 2011-05-07 10:37:02
Message-ID: 1304764622535-4378157.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks for reply both UNION and JOINS helped. Mainly the UNION helped a lot.
Now the query takes 1sec max. Thanks a lot.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4378157.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Mark <Marek(dot)Balgar(at)seznam(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query improvement
Date: 2011-05-07 10:38:09
Message-ID: 1304764689919-4378160.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks a lot for reply. Finally I have used UNION, but thanks for your help.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4378160.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Mark <Marek(dot)Balgar(at)seznam(dot)cz>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: Query improvement
Date: 2011-05-07 10:39:52
Message-ID: 1304764792758-4378163.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Thanks for replies. Finally I have used UNION and JOINS, which helped. Mainly
the UNION helped a lot. Now the query takes 1sec max. Thanks a lot.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-improvement-tp4362578p4378163.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Claudio Freire <klaussfreire(at)gmail(dot)com>
Cc: Mark <Marek(dot)Balgar(at)seznam(dot)cz>, pgsql-performance(at)postgresql(dot)org
Subject: Re: Query improvement
Date: 2011-05-14 02:28:51
Message-ID: BANLkTimV2Y8uT6SjsvZhR1kZ5Pf8+HPtYg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

On Mon, May 2, 2011 at 3:58 AM, Claudio Freire <klaussfreire(at)gmail(dot)com> wrote:
> Hash joins are very inefficient if they require big temporary files.

Hmm, that's not been my experience. What have you seen?

I've seen a 64-batch hash join beat out a
nested-loop-with-inner-indexscan, which I never woulda believed,
but...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company