Re: query very slow when enable_seqscan=on

Lists: pgsql-bugspgsql-performance
From: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
To: pgsql-bugs(at)postgresql(dot)org
Subject: query very slow when enable_seqscan=on
Date: 2006-07-03 20:31:07
Message-ID: 20060703203105.GA2801@batory.org.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

I have a very slow query when enable_seqscan=on and very fast when
enable_seqscan=off. My schema looks like this (relevant columns
only):

create table organizations (
organization_id serial primary key,
organization varchar(200) not null,
organization_location varchar(55) not null
-- and several irrelevant columns
); -- about 9000 records
create table persons (
person_id serial primary key,
surname varchar(50) not null,
forename varchar(35) not null,
organization_id int references organizations,
-- and several irrelevant columns
); -- about 6500 records
create index persons_surname_forename_person_id on persons (
surname,
forename,
lpad(person_id,10,'0')
); -- I was hoping this would speed up array comparisions

The query looking for a position of a person of given person_id in a
list sorted by surname, forename and person_id and filtered by some
criteria. In this example person_id=1, forename~*'to' (about 400
people) and organization_location~*'warszawa' (about 2000
organizations):

select count(*) as position from (select
person_id, surname, forename
from persons
natural left join organizations
where forename~*'to' and organization_location~*'warszawa'
) as person_filter
where array[surname, forename, lpad(person_id,10,'0')]
<
(select array[surname, forename, lpad(person_id,10,'0')]
from persons where person_id=1);

This query take about 30 seconds when enable_seqscan=on and 65
milliseconds when off.

When enable_seqscan=on:
Aggregate (cost=785.72..785.73 rows=1 width=0) (actual time=27948.955..27948.956 rows=1 loops=1)
InitPlan
-> Index Scan using persons_pkey on persons (cost=0.00..3.11 rows=1 width=26) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (person_id = 1)
-> Nested Loop (cost=0.00..782.60 rows=1 width=0) (actual time=27948.939..27948.939 rows=0 loops=1)
Join Filter: ("inner".organization_id = "outer".organization_id)
-> Seq Scan on organization (cost=0.00..480.95 rows=1 width=4) (actual time=0.071..69.702 rows=1892 loops=1)
Filter: ((organization_location)::text ~* 'warszawa'::text)
-> Seq Scan on persons (cost=0.00..296.10 rows=444 width=4) (actual time=14.720..14.720 rows=0 loops=1892)
Filter: (((forename)::text ~* 'to'::text) AND (ARRAY[surname, forename, (lpad((person_id)::text, 10, '0'::text))::character varying] < $0))
Total runtime: 27949.106 ms

When enable_seqscan=off:
Aggregate (cost=100001710.26..100001710.27 rows=1 width=0) (actual time=66.788..66.789 rows=1 loops=1)
InitPlan
-> Index Scan using persons_pkey on persons (cost=0.00..3.11 rows=1 width=26) (actual time=0.019..0.019 rows=0 loops=1)
Index Cond: (person_id = 1)
-> Hash Join (cost=100001408.81..100001707.14 rows=1 width=0) (actual time=66.756..66.756 rows=0 loops=1)
Hash Cond: ("outer".organization_id = "inner".organization_id)
-> Seq Scan on persons (cost=100000000.00..100000296.10 rows=444 width=4) (actual time=14.972..14.972 rows=0 loops=1)
Filter: (((forename)::text ~* 'to'::text) AND (ARRAY[surname, forename, (lpad((person_id)::text, 10, '0'::text))::character varying] < $0))
-> Hash (cost=1408.81..1408.81 rows=1 width=4) (actual time=51.763..51.763 rows=1892 loops=1)
-> Index Scan using organizations_pkey on organizations (cost=0.00..1408.81 rows=1 width=4) (actual time=0.049..48.233 rows=1892 loops=1)
Filter: ((organization_location)::text ~* 'warszawa'::text)
Total runtime: 66.933 ms

Database is properly analyzed. postgresql-8.1.4 on Fedora Core 4.

Regards
Tometzky

PS. Actual table and column names are different (they're in Polish)
but I've translated them for better readability for english-speaking.

PS. I wonder if it makes sense to "enable_seqscan=off" for every client
if a database is small enough to fit in OS cache.
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
Cc: pgsql-bugs(at)postgresql(dot)org, pgsql-performance(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-03 23:05:42
Message-ID: 1151967943.2479.330.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

On Mon, 2006-07-03 at 22:31 +0200, Tomasz Ostrowski wrote:
> I have a very slow query when enable_seqscan=on and very fast when
> enable_seqscan=off. My schema looks like this (relevant columns
> only):
> PS. Actual table and column names are different (they're in Polish)
> but I've translated them for better readability for english-speaking.

Thanks

> PS. I wonder if it makes sense to "enable_seqscan=off" for every client
> if a database is small enough to fit in OS cache.

You can set this for individual statements if you choose to.

> -> Seq Scan on organization (cost=0.00..480.95 rows=1
> width=4) (actual time=0.071..69.702 rows=1892 loops=1)
> Filter: ((organization_location)::text ~*
> 'warszawa'::text)

The issue is caused by the under-estimation of the number of rows in the
table as a result of the regular expression comparison. As a result the
planner thinks it can choose a nested loops scan, though ends up doing
1892 seq scans of persons, when it thought it would do only one.

The under estimation is a known issue. Posting to -perform for the
record.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-03 23:05:46
Message-ID: 7353.1151967946@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl> writes:
> I have a very slow query when enable_seqscan=on and very fast when
> enable_seqscan=off.

Here's your problem:

> -> Seq Scan on organization (cost=0.00..480.95 rows=1 width=4) (actual time=0.071..69.702 rows=1892 loops=1)
> Filter: ((organization_location)::text ~* 'warszawa'::text)

If it were estimating something like the actual number of rows matching
that filter, it'd never have chosen a nestloop plan like that.

How many rows are there in the organization table?

This is probably the fault of the pattern-selectivity heuristic: it's
far too optimistic about long match strings eliminating a lot of rows.
I think there's been some discussion of modifying that logic but no
one's really stepped up with a better idea.

regards, tom lane


From: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-04 08:37:33
Message-ID: 20060704083733.GA20102@batory.org.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

On Mon, 03 Jul 2006, Tom Lane wrote:

> > -> Seq Scan on organization (cost=0.00..480.95 rows=1 width=4) (actual time=0.071..69.702 rows=1892 loops=1)
> > Filter: ((organization_location)::text ~* 'warszawa'::text)
>
> How many rows are there in the organization table?

About 9000. And about 6500 persons. "Warszawa" is a biggest city in
Poland and a capital - many organizations are located there.

> This is probably the fault of the pattern-selectivity heuristic:
> it's far too optimistic about long match strings eliminating a lot
> of rows. I think there's been some discussion of modifying that
> logic but no one's really stepped up with a better idea.

I think because there is no good solution to this - no statistical
information is going to predict how much data will match a regular
expression. Maybe in this situation an algorithm should be
pessimistic - that it will return all rows, or all non-null rows or
all rows no shorter than matching string (if it's a string and not
for example regex like [abcdefghijklmnopqrstuvwxyz] which is long but
will match basicaly everything). In my opinion it is better to
overestimate most of the time than to risk underestimation by a
factor of 1000 and more.

For now I'm turning off seqscans. This is a second time I got
terrible permormance with seqscans turned on because of bad
estimation. And my database will probably fit in cache.

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-04 13:56:29
Message-ID: 13853.1152021389@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl> writes:
> I think because there is no good solution to this - no statistical
> information is going to predict how much data will match a regular
> expression.

Well, it's certainly hard to imagine simple stats that would let the
code guess that, say, "warsa" and "warsaw" match nearly the same
(large) number of rows while "warsawq" matches nothing.

I think the real problem here is that regex matching is the wrong tool
for the job. Have you looked into a full-text index (tsearch2)?
With something like that, the index operator has at least got the
correct conceptual model, ie, looking for indexed words. I'm not sure
if they have any decent statistical support for it :-( but in theory
that seems doable, whereas regex estimation will always be a crapshoot.

regards, tom lane


From: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-04 14:44:08
Message-ID: 20060704144406.GA21511@batory.org.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

On Tue, 04 Jul 2006, Tom Lane wrote:

> I think the real problem here is that regex matching is the wrong
> tool for the job. Have you looked into a full-text index
> (tsearch2)?

So much to do with so little time...

I've briefly looked into it but:

- it's complicated;

- it is not needed - basic scan is good enough for the amount of data
we have (if a sane query plan is chosen by a database);

- we have data in many languages (including based on cyryllic
alphabet) - languages which use different forms of the same word
based on context, for example:
Warszawa
Warszawy
Warszawie
Warszawę
Warszawą
Warszawo
All of the above could be translated to "Warsaw". So we need to
support matching parts of words ("warszaw"), which I haven't seen
in tsearch2 (maybe I've overlooked). We also have words, which
different forms look like this: "stół" "stole" "stołu" (Polish for
"table") - when we need to find it we'd need to list every possible
form (about 10) or use a regex like: 'st[oó][lł]'.

> With something like that, the index operator has at least got the
> correct conceptual model, ie, looking for indexed words. I'm not sure
> if they have any decent statistical support for it :-( but in theory
> that seems doable, whereas regex estimation will always be a crapshoot.

So why estimate regex expressions if there is no estimation possible?
Let's set this estimate to be pessimistic (match everything or
everything not null) and it will choose better plans. At least until
somebody will figure out better approach.

Pozdrawiam
Tometzky
--
Best of prhn - najzabawniejsze teksty polskiego UseNet-u
http://prhn.dnsalias.org/
Chaos zawsze pokonuje porządek, gdyż jest lepiej zorganizowany.
[ Terry Pratchett ]


From: tomas(at)tuxteam(dot)de
To: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-04 16:24:34
Message-ID: 20060704162434.GD20619@www.trapp.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

On Tue, Jul 04, 2006 at 04:44:08PM +0200, Tomasz Ostrowski wrote:
> On Tue, 04 Jul 2006, Tom Lane wrote:
>
> > I think the real problem here is that regex matching is the wrong
> > tool for the job. Have you looked into a full-text index
> > (tsearch2)?
>
> So much to do with so little time...

For what it's worth, I've got pretty good results (at least taking the
little amount of work I put into it) with trigram indexes, courtesy of
$PGCONTRIB/pg_trgm.sql, just another amazing little piece by Bartunov
and Sigaev.

Let me know if you'd like to hear more.

Regards
-- tomás


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-04 18:15:54
Message-ID: 25494.1152036954@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl> writes:
> So why estimate regex expressions if there is no estimation possible?
> Let's set this estimate to be pessimistic (match everything or
> everything not null) and it will choose better plans.

Better plans for this specific example, worse plans for other cases.
Life is not that simple.

regards, tom lane


From: Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: query very slow when enable_seqscan=on
Date: 2006-07-05 08:33:59
Message-ID: 20060705083358.GA570@batory.org.pl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs pgsql-performance

On Tue, 04 Jul 2006, Tom Lane wrote:

> Tomasz Ostrowski <tometzky(at)batory(dot)org(dot)pl> writes:
> > So why estimate regex expressions if there is no estimation possible?
> > Let's set this estimate to be pessimistic (match everything or
> > everything not null) and it will choose better plans.
>
> Better plans for this specific example, worse plans for other cases.
> Life is not that simple.

It isn't. This worse plans will be choosen only when pattern/regex
matching is used and will be, say, 2 times worse.

What I'm trying to point out is that some people use regular
expressions for filtering rows. When the program is written it is
often impossible to know what data will be put into it. And when a
program is unexpectedly 2000 times slower than normal it is much
worse than if it is 2 times slower, but predictable.

I know Postgres uses probabilistic approach so there's always a
probability that the planner chooses very wrong. But this probability
is so small that it can be ignored. With pattern/regex matching it is
not.

Regards
Tometzky
--
...although Eating Honey was a very good thing to do, there was a
moment just before you began to eat it which was better than when you
were...
Winnie the Pooh