Re: [Gmod-schema] Re: performace problem after VACUUM

Lists: pgsql-performance
From: SLetovsky(at)aol(dot)com
To: tgl(at)sss(dot)pgh(dot)pa(dot)us, cain(at)cshl(dot)org
Cc: pgsql-performance(at)postgresql(dot)org, gmod-schema(at)lists(dot)sourceforge(dot)net
Subject: Re: [Gmod-schema] Re: performace problem after VACUUM ANALYZE
Date: 2003-02-14 19:07:49
Message-ID: 27.394f051e.2b7eded5@aol.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Tom,
Thanks for your help in this. We have some flexibility in the schema
design; we are going through our first PostGres performance testing now.
Am I correct in interpreting your comments as saying you believe that
if we could lose the OR and the strand constraint PG would probably
use the index properly? There is an alternative representation on the table
that would do that.

Cheers, -Stan

In a message dated 2/14/2003 6:22:15 PM Eastern Standard Time,
tgl(at)sss(dot)pgh(dot)pa(dot)us writes:

> Subj: [Gmod-schema] Re: [PERFORM] performace problem after VACUUM ANALYZE
> Date: 2/14/2003 6:22:15 PM Eastern Standard Time
> From: <A HREF="mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us">tgl(at)sss(dot)pgh(dot)pa(dot)us</A>
> To: <A HREF="mailto:cain(at)cshl(dot)org">cain(at)cshl(dot)org</A>
> CC: <A HREF="mailto:pgsql-performance(at)postgresql(dot)org">pgsql-performance(at)postgresql(dot)org</A>, <A HREF="mailto:gmod-schema(at)lists(dot)sourceforge(dot)net">gmod-schema(at)lists(dot)sourceforge(dot)net</A>
> Sent from the Internet
>
>
>
> Scott Cain <cain(at)cshl(dot)org> writes:
> >Here is the query that is causing the problems:
>
> >select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
> > from feature f, featureloc fl
> > where
> > fl.srcfeature_id = 1 and
> > ((fl.strand=1 and fl.nbeg <= 393164 and fl.nend >= 390956) OR
> > (fl.strand=-1 and fl.nend <= 393164 and fl.nbeg >= 390956)) and
> > f.feature_id = fl.feature_id
>
> >[ and the index he'd like it to use is ]
>
> >Index "featureloc_src_strand_beg_end"
> > Column | Type
> >---------------+----------
> > srcfeature_id | integer
> > strand | smallint
> > nbeg | integer
> > nend | integer
> >btree
>
> After fooling with this I see a couple of problems. One is the
> same old cross-datatype-comparison issue that so frequently bites
> people: "1" and "-1" are integer constants, and comparing them to
> a smallint column isn't an indexable operation. You need casts.
> (Or, forget the "optimization" of storing strand as a smallint.
> Depending on your row layout, it's quite likely saving you no space
> anyway.)
>
> Problem two is that the optimizer isn't smart enough to recognize that a
> query condition laid out in this form should be processed as two
> indexscans --- it would possibly have gotten it right if the first index
> column had been inside the OR, but not this way. The upshot is that
> when you force it to use index featureloc_src_strand_beg_end, it's
> actually only using the srcfeature_id column of the index --- which is
> slow of course, and also explains why the optimizer doesn't find that
> option very attractive.
>
> I had some success in getting current sources to generate a desirable
> plan by doing this:
>
> regression=# explain select distinct *
> regression-# from feature f join featureloc fl on (f.feature_id =
> fl.feature_id) where
> regression-# ((fl.srcfeature_id = 1 and fl.strand=1::int2 and fl.nbeg <=
> 393164 and fl.nend >= 390956) OR
> regression(# (fl.srcfeature_id = 1 and fl.strand=-1::int2 and fl.nend <=
> 393164 and fl.nbeg >= 390956));
>
> Unique (cost=34.79..34.85 rows=5 width=50)
> -> Sort (cost=34.79..34.80 rows=5 width=50)
> Sort Key: f.name, fl.nbeg, fl.nend, fl.strand
> -> Hash Join (cost=9.68..34.73 rows=5 width=50)
> Hash Cond: ("outer".feature_id = "inner".feature_id)
> -> Seq Scan on feature f (cost=0.00..20.00 rows=1000 width=36)
> -> Hash (cost=9.68..9.68 rows=1 width=14)
> -> Index Scan using featureloc_src_strand_beg_end,
> featureloc_src_strand_beg_end on featureloc fl (cost=0.00..9.68 rows=1
> width=14)
> Index Cond: (((srcfeature_id = 1) AND (strand = 1::smallint)
> AND (nbeg <= 393164) AND (nend >= 390956)) OR ((srcfeature_id = 1) AND
> (strand = -1::smallint) AND (nbeg >= 390956) AND (nend <= 393164)))
> Filter: (((srcfeature_id = 1) AND (strand = 1::smallint) AND
> (nbeg <= 393164) AND (nend >= 390956)) OR ((srcfeature_id = 1) AND (strand
> = -1::smallint) AND (nend <= 393164) AND (nbeg >= 390956)))
> (10 rows)
>
> Shoving the join condition into an explicit JOIN clause is a hack, but
> it nicely does the job of keeping the WHERE clause as a pristine
> OR-of-ANDs structure, so that the optimizer can hardly fail to notice
> that that's the preferred canonical form.
>
> I would strongly recommend an upgrade to PG 7.3, both on general
> principles and because you can actually see what the indexscan condition
> is in EXPLAIN's output. Before 7.3 you had to grovel through EXPLAIN
> VERBOSE to be sure what was really happening with a multicolumn index.
>
> regards, tom lane
>
>
> -------------------------------------------------------
> This SF.NET email is sponsored by: FREE SSL Guide from Thawte
> are you planning your Web Server Security? Click here to get a FREE
> Thawte SSL guide and find the answers to all your SSL security issues.
> http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0026en
> _______________________________________________
> Gmod-schema mailing list
> Gmod-schema(at)lists(dot)sourceforge(dot)net
> https://lists.sourceforge.net/lists/listinfo/gmod-schema
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: SLetovsky(at)aol(dot)com
Cc: cain(at)cshl(dot)org, pgsql-performance(at)postgresql(dot)org, gmod-schema(at)lists(dot)sourceforge(dot)net
Subject: Re: [Gmod-schema] Re: performace problem after VACUUM ANALYZE
Date: 2003-02-15 00:11:52
Message-ID: 12378.1045267912@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

SLetovsky(at)aol(dot)com writes:
> Am I correct in interpreting your comments as saying you believe that
> if we could lose the OR and the strand constraint PG would probably
> use the index properly?

No, I said I thought it could do it without that ;-). But yes, you'd
have a much less fragile query if you could lose the OR condition.

Have you looked into using a UNION ALL instead of OR to merge the two
sets of results? It sounds grotty, but might be faster...

regards, tom lane


From: Scott Cain <cain(at)cshl(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: stan letovsky <SLetovsky(at)aol(dot)com>, pgsql-performance(at)postgresql(dot)org, gmod schema <gmod-schema(at)lists(dot)sourceforge(dot)net>
Subject: Re: [Gmod-schema] Re: performace problem after VACUUM
Date: 2003-02-15 20:36:31
Message-ID: 1045341390.3944.678.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

Hello Tom,

Here's the short answer: I've got it working much faster now (>100 msec
for the query by explain analyze).

Here's the long answer: I reworked the table, horribly denormalizing
it. I changed the coordinate system, so that start is always less than
end, regardless of strand. Here is the original query:

select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
from feature f, featureloc fl
where
fl.srcfeature_id = 1 and
((fl.strand=1 and fl.nbeg <= 393164 and fl.nend >= 390956) OR
(fl.strand=-1 and fl.nend <= 393164 and fl.nbeg >= 390956)) and
f.feature_id = fl.feature_id

and here is the equivalent query in the new coordinate system:

select distinct f.name,fl.nbeg,fl.nend,fl.strand,f.type_id,f.feature_id
from feature f, featureloc fl
where
fl.srcfeature_id = 1 and
f.feature_id = fl.feature_id and
fl.max >= 390956 and
fl.min <= 393164

Notice that it is MUCH simpler, and the query planner uses exactly the
indexes I want, and as noted above, runs much faster. Of course, this
also means that I have to rewrite my database adaptor, but it shouldn't
be too bad.

For those on the GMOD list, here is how I changed the table:

alter table featureloc add column min int;
alter table featureloc add column max int;
update featureloc set min=nbeg where strand=1;
update featureloc set max=nend where strand=1;
update featureloc set max=nbeg where strand=-1;
update featureloc set min=nend where strand=-1;
update featureloc set min=nbeg where (strand=0 or strand is null) and nbeg<nend;
update featureloc set max=nend where (strand=0 or strand is null) and nbeg<nend;
update featureloc set min=nend where (strand=0 or strand is null) and nbeg>nend;
update featureloc set max=nbeg where (strand=0 or strand is null) and nbeg>nend;
create index featureloc_src_min_max on featureloc (srcfeature_id,min,max);
select count(*) from featureloc where min is null and nbeg is not null;

The last select is just a test to make sure I didn't miss anything, and
it did return zero. Also, it doesn't appear that there are any features
that are strandless. I found that a little surprising, but included
those updates for completeness.

Tom, thank you much for your help. Hopefully, I will get the group to
buy into this schema change, and life will be good.

Scott

On Fri, 2003-02-14 at 19:11, Tom Lane wrote:
> SLetovsky(at)aol(dot)com writes:
> > Am I correct in interpreting your comments as saying you believe that
> > if we could lose the OR and the strand constraint PG would probably
> > use the index properly?
>
> No, I said I thought it could do it without that ;-). But yes, you'd
> have a much less fragile query if you could lose the OR condition.
>
> Have you looked into using a UNION ALL instead of OR to merge the two
> sets of results? It sounds grotty, but might be faster...
>
> regards, tom lane
--
------------------------------------------------------------------------
Scott Cain, Ph. D. cain(at)cshl(dot)org
GMOD Coordinator (http://www.gmod.org/) 216-392-3087
Cold Spring Harbor Laboratory