Re: combined indexes with Gist - planner issues?

Lists: pgsql-hackers
From: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: combined indexes with Gist - planner issues?
Date: 2009-08-31 12:08:25
Message-ID: 4A9BBD39.2010500@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hello everybody,

we are seriously fighting with some planner issue which seems to be
slightly obscure to us.
we have a table which is nicely indexed (several GB in size).
i am using btree_gist operator classes to use a combined index including
an FTI expression along with a number:

db=# \d product.t_product
Table "product.t_product"
Column | Type |
Modifiers
-----------------------+---------------+----------------------------------------------------------------
id | bigint | not null default
nextval('product.t_product_id_seq'::regclass)
shop_id | integer |
art_number | text |
title | text |
description | text |
display_price | numeric(10,4) |

Indexes:
"t_product_pkey" PRIMARY KEY, btree (id)
"idx_test" gist (display_price, to_tsvector('german'::regconfig,
(title || ' '::text) || description))
* "idx_test2" gist (to_tsvector('german'::regconfig, (title || '
'::text) || description), display_price)*

what we basically expected here is that Postgres will scan the table
using the index to give us the cheapest products containing the words we
are looking for.
i am totally surprised to see that we have to fetch all products given
the words, sort and then do the limit.
this totally kills performance because some words simply show up
millions of times. this totally kills everything.

the plans look like this:

db=# explain analyze SELECT art_number, title
FROM product.t_product
WHERE to_tsvector('german'::regconfig, (title || ' '::text) ||
description) @@ plainto_tsquery('harddisk')
ORDER BY display_price
LIMIT 10;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=108340.08..108340.10 rows=10 width=54) (actual
time=1328.900..1328.909 rows=10 loops=1)
-> Sort (cost=108340.08..108422.48 rows=32961 width=54) (actual
time=1328.899..1328.905 rows=10 loops=1)
Sort Key: display_price
Sort Method: top-N heapsort Memory: 18kB
-> Bitmap Heap Scan on t_product (cost=2716.62..107627.80
rows=32961 width=54) (actual time=1052.706..1328.772 rows=55 loops=1)
Recheck Cond: (to_tsvector('german'::regconfig, ((title
|| ' '::text) || description)) @@ plainto_tsquery('harddisk'::text))
-> Bitmap Index Scan on idx_test2 (cost=0.00..2708.38
rows=32961 width=0) (actual time=1052.576..1052.576 rows=55 loops=1)
Index Cond: (to_tsvector('german'::regconfig,
((title || ' '::text) || description)) @@ plainto_tsquery('harddisk'::text))
Total runtime: 1328.942 ms
(9 rows)

runtime increases badly if words start to be more likely ...

db=# explain analyze SELECT art_number, title
FROM product.t_product
WHERE to_tsvector('german'::regconfig, (title || ' '::text) ||
description) @@ plainto_tsquery('spiel')
ORDER BY display_price
LIMIT 10;
QUERY
PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=108340.08..108340.10 rows=10 width=54) (actual
time=33489.675..33489.682 rows=10 loops=1)
-> Sort (cost=108340.08..108422.48 rows=32961 width=54) (actual
time=33489.675..33489.675 rows=10 loops=1)
Sort Key: display_price
Sort Method: top-N heapsort Memory: 18kB
-> Bitmap Heap Scan on t_product (cost=2716.62..107627.80
rows=32961 width=54) (actual time=774.923..33408.522 rows=56047 loops=1)
Recheck Cond: (to_tsvector('german'::regconfig, ((title
|| ' '::text) || description)) @@ plainto_tsquery('spiel'::text))
-> Bitmap Index Scan on idx_test2 (cost=0.00..2708.38
rows=32961 width=0) (actual time=759.078..759.078 rows=56047 loops=1)
Index Cond: (to_tsvector('german'::regconfig,
((title || ' '::text) || description)) @@ plainto_tsquery('spiel'::text))
Total runtime: 33489.906 ms
(9 rows)

i am wondering why postgres is not able to use a combined index here?
is this some obscure thing related to gist, a logical problem or a
planner deficiency?

ideas are welcome.

many thanks,

hans

--
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: Re: combined indexes with Gist - planner issues?
Date: 2009-08-31 13:54:01
Message-ID: 543.1251726841@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at> writes:
> what we basically expected here is that Postgres will scan the table
> using the index to give us the cheapest products containing the words we
> are looking for.
> i am totally surprised to see that we have to fetch all products given
> the words, sort and then do the limit.

I don't know why you'd find that surprising. GIST indexes have no
support for ordering.

regards, tom lane


From: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: Re: combined indexes with Gist - planner issues?
Date: 2009-08-31 14:06:22
Message-ID: 4A9BD8DE.3060005@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at> writes:
>
>> what we basically expected here is that Postgres will scan the table
>> using the index to give us the cheapest products containing the words we
>> are looking for.
>> i am totally surprised to see that we have to fetch all products given
>> the words, sort and then do the limit.
>>
>
> I don't know why you'd find that surprising. GIST indexes have no
> support for ordering.
>
> regards, tom lane
>
>

ok, i thought it would be something gist specific i was not aware of.
the golden question now is: i am looking for the cheapest products given
a certain text in an insane amount of data.
how to do it? other quals which could narrow down the amount of data
would not help.

i cannot see an option with regular "weapons" ...
maybe you can an idea how to fix core to make it work? maybe there is a
mechanism we could need.
we really have to make this work - no matter what it takes.
we are willing to put effort into that.

many thanks,

hans

--
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: Re: combined indexes with Gist - planner issues?
Date: 2009-08-31 14:24:41
Message-ID: 20090831142441.GA30008@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 31, 2009 at 04:06:22PM +0200, Hans-Juergen Schoenig -- PostgreSQL wrote:
> ok, i thought it would be something gist specific i was not aware of.
> the golden question now is: i am looking for the cheapest products given
> a certain text in an insane amount of data.
> how to do it? other quals which could narrow down the amount of data
> would not help.
>
> i cannot see an option with regular "weapons" ...
> maybe you can an idea how to fix core to make it work? maybe there is a
> mechanism we could need.
> we really have to make this work - no matter what it takes.
> we are willing to put effort into that.

The way I usually attack such a problem is to think of a data
structure+algorithm that could produce the output you want. Once you've
got that it's usually clear how you can make postgres do it and what
changes would need to be made.

At first glance I don't see any nice data structure specific for your
problem. But it occurs to me that maybe you could just have a (btree)
index on the price and just scan in asceding order until you have
enough records. Expensive if the first record is expensive.

Another possibility is to change your query to use the price in the
GiST index: execute multiple queries of the form:

... AND display_price >= 0.01 and display_price < 1;
... AND display_price >= 1 and display_price < 10;

Because you match less records the sort won't be so expensive and you
can stop once you have enough records.

Hope this helps,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.


From: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: Re: combined indexes with Gist - planner issues?
Date: 2009-08-31 14:40:21
Message-ID: 4A9BE0D5.7060106@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Martijn van Oosterhout wrote:
> On Mon, Aug 31, 2009 at 04:06:22PM +0200, Hans-Juergen Schoenig -- PostgreSQL wrote:
>
>> ok, i thought it would be something gist specific i was not aware of.
>> the golden question now is: i am looking for the cheapest products given
>> a certain text in an insane amount of data.
>> how to do it? other quals which could narrow down the amount of data
>> would not help.
>>
>> i cannot see an option with regular "weapons" ...
>> maybe you can an idea how to fix core to make it work? maybe there is a
>> mechanism we could need.
>> we really have to make this work - no matter what it takes.
>> we are willing to put effort into that.
>>
>
> The way I usually attack such a problem is to think of a data
> structure+algorithm that could produce the output you want. Once you've
> got that it's usually clear how you can make postgres do it and what
> changes would need to be made.
>
> At first glance I don't see any nice data structure specific for your
> problem. But it occurs to me that maybe you could just have a (btree)
> index on the price and just scan in asceding order until you have
> enough records. Expensive if the first record is expensive.
>
> Another possibility is to change your query to use the price in the
> GiST index: execute multiple queries of the form:
>
> ... AND display_price >= 0.01 and display_price < 1;
> ... AND display_price >= 1 and display_price < 10;
>
>

hello ...

i had a similar idea here but the problem is: prices will pretty much
depends on products.
to get to some critical example: "book" is a horribly frequent word and
you will find just too many in a too narrow price range.
using a price index is alone is not a good idea. how many products which
cost USD 9.95 do you know and how many of them are books? :(
i did some experiments which PL/proxy to scale out a little and i wrote
some C code to explicitly cache data from the start and so on.
this is all shit, however - it is too much data and I have too many request.
i don't want to fallback to some java-based stuff such as solr. it would
totally ruin my credibility and the stand postgres has at this customer.
whatever it takes - a PG based solution has to be found and implemented.

my knowledge of how gist works internally is not too extensive. any
"kickstart" idea would be appreciated.

many thanks,

hans

--
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: Re: combined indexes with Gist - planner issues?
Date: 2009-08-31 15:27:18
Message-ID: 4A9BEBD6.60407@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hans-Juergen Schoenig -- PostgreSQL wrote:
> my knowledge of how gist works internally is not too extensive. any
> "kickstart" idea would be appreciated.

If there's not too many of those common words, you can create a simple
partial b-tree index for each, and handle the less common words with the
gist index you have (you can drop the display_price column from the index).

Another idea:

Create a table containing one row for each word in each product:

CREATE TABLE t_product_word (id bigint, word text, display_price
numeric(10,4));

with triggers to keep it up-to-date. You can then create a regular two
column b-tree index on that:

CREATE INDEX idx_word_price ON t_product_word (word, display_price);

And query with:

SELECT p.art_number, p.title
FROM t_product p INNER JOIN t_product_word pw ON p.id = pw.id
WHERE pw.word = 'harddisk'
ORDER BY pw.display_price DESC LIMIT 10;

The t_product_word table will be huge, but with a few gigabytes of data
it should still be manageable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: Re: combined indexes with Gist - planner issues?
Date: 2009-08-31 15:34:18
Message-ID: 4A9BED7A.5000202@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hello ...

we did some experiments with doing such a table.
the problem is if you want to allow arbitrary combinations of words
which can be modeled perfectly with FTI.
you would instantly end up with a self join with 5 relations or so -
which is again bad.

there are too many common words to consider doing with partly with gist
and partly with a btree.

is there any option to adapt gist in a way that a combined index would
make sense here?

many thanks,

hans

Heikki Linnakangas wrote:
> Hans-Juergen Schoenig -- PostgreSQL wrote:
>
>> my knowledge of how gist works internally is not too extensive. any
>> "kickstart" idea would be appreciated.
>>
>
> If there's not too many of those common words, you can create a simple
> partial b-tree index for each, and handle the less common words with the
> gist index you have (you can drop the display_price column from the index).
>
> Another idea:
>
> Create a table containing one row for each word in each product:
>
> CREATE TABLE t_product_word (id bigint, word text, display_price
> numeric(10,4));
>
> with triggers to keep it up-to-date. You can then create a regular two
> column b-tree index on that:
>
> CREATE INDEX idx_word_price ON t_product_word (word, display_price);
>
> And query with:
>
> SELECT p.art_number, p.title
> FROM t_product p INNER JOIN t_product_word pw ON p.id = pw.id
> WHERE pw.word = 'harddisk'
> ORDER BY pw.display_price DESC LIMIT 10;
>
> The t_product_word table will be huge, but with a few gigabytes of data
> it should still be manageable.
>
>

--
Cybertec Schoenig & Schoenig GmbH
Reyergasse 9 / 2
A-2700 Wiener Neustadt
Web: www.postgresql-support.de


From: Markus Wanner <markus(at)bluegap(dot)ch>
To: Hans-Juergen Schoenig -- PostgreSQL <postgres(at)cybertec(dot)at>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Zoltan Boszormenyi <zb(at)cybertec(dot)at>
Subject: Re: combined indexes with Gist - planner issues?
Date: 2009-09-03 05:33:39
Message-ID: 4A9F5533.2010105@bluegap.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Hans-Juergen Schoenig -- PostgreSQL wrote:
> we did some experiments with doing such a table.
> the problem is if you want to allow arbitrary combinations of words
> which can be modeled perfectly with FTI.
> you would instantly end up with a self join with 5 relations or so -
> which is again bad.
>
> there are too many common words to consider doing with partly with gist
> and partly with a btree.

How about an inverted index, either via GIN or with a custom table, such
that you have the cheapest price per existing word. (That's pretty close
to how full text searching itself works). Either reduce the number of
words with tsearch2's stemming algorithms. Or go for trigrams right
away. Split a word or search query in all its trigrams, then look up the
(cheapest) price(s) per trigram and return the n least expensive ones.

I've done somethings pretty similar for a customer, using the custom
table approach, as integration of GIN just started back then. Even now,
you need to consider the downside of that index lacking visibility
information and having to recreate the index from time to time. OTOH a
custom table needs a lot more manual twiddling with triggers and bulk
"index" rebuilding.

I guess I'd still go for a custom table today, as it's simply more
flexible. Something like:

CREATE TABLE cheapest_product_by_word (
trgm TEXT,
cheapest_products bigint[]
);

However, all of this is assuming that data (i.e. prices, products)
change very rarely and it's beneficial to calculate such an intermediate
lookup-table in advance. Not sure how much that's the case for you.

Regards

Markus Wanner