SQL command speed

Lists: pgsql-sql
From: Kate Collins <klcollins(at)wsicorp(dot)com>
To: postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject: SQL command speed
Date: 2000-05-18 19:11:27
Message-ID: 3924405E.15014E6B@wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

I am running PostgreSQL 7.0 on Red Hat Linux 6.2. I am fairly new to
using PostgreSQL. I am in the process of comparing performance with an
Oracle data base. I have converted and populated several tables from
Oracle to PostgreSQL.

To test the speed of the data bases, I wrote the following script using
the PERL DBI:

--- Start Script 1 ---
#!/bin/perl
use DBI;
$dbh = DBI->connect("dbi:Pg:dbname=pbi", "ntm", "", { RaiseError =>
1, AutoCommit => 0 })
or die "Can't connect to PGSQL";

$sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY
HH24:MI')
FROM notam_details
WHERE
item_a = 'EGKB' OR
item_a = 'EGDM' OR
item_a = 'EGHH' OR
item_a = 'EGGD' OR
item_a = 'EGVN' OR
item_a = 'EGFF' OR
item_a = 'EGDC' OR
item_a = 'EGTC' OR
item_a = 'EGDR' OR
item_a = 'EGTE' OR
item_a = 'EGLF' OR
item_a = 'EGTG' OR
item_a = 'EGBJ' OR
item_a = 'EGLC' OR
item_a = 'EGKK' OR
item_a = 'EGLL' OR
item_a = 'EGSS' OR
item_a = 'EGGW' OR
item_a = 'EGMD' OR
item_a = 'EGDL' OR
item_a = 'EGUM' OR
item_a = 'EGHD' OR
item_a = 'EGHE' OR
item_a = 'EGKA' OR
item_a = 'EGHI' OR
item_a = 'EGMC' OR
item_a = 'EGDG' OR
item_a = 'EGFH' OR
item_a = 'EGDY' OR
item_a = 'EGJA' OR
item_a = 'EGJB' OR
item_a = 'EGJJ'";

$sth = $dbh->prepare( $sql);
$sth->execute();
$result = $sth->fetchall_arrayref();
foreach (@{$result})
{
($id, $date) = @$_;
print "$id:$date\n";
}
--- End Script 1 ---

When I ran it it took 12 seconds with PostgreSQL and 1 second in Oracle.

I then went through several steps using vacuum, building indexes, etc,
but I did not see much of a performance improvement. Then I used
explain, and noticed that the query was not using the indexes I created.
I did some experimentation, and if the WHERE clause had one or two items
it would use the index; more and it would not.

So I rewrote the script to do multiple small select queries instead of
one big select query. The new script looked like this:

--- Start Script 2 ---
#!/bin/perl

use DBI;

$dbh = DBI->connect("dbi:Pg:dbname=pbi", "ntm", "",
{ RaiseError => 1, AutoCommit => 0 })
or die "Can't connect to PGSQL";

$sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI')

FROM notam_details
WHERE item_a = ?";

$sth = $dbh->prepare( $sql);

@stations = (EGKB, EGDM, EGHH, EGGD, EGVN, EGFF, EGDC, EGTC, EGDR, EGTE,
EGLF, EGTG, EGBJ, EGLC, EGKK, EGLL, EGSS, EGGW, EGMD, EGDL, EGUM, EGHD,
EGHE, EGKA, EGHI, EGMC, EGDG, EGFH, EGDY, EGJA, EGJB, EGJJ);
foreach (@stations)
{
$sth->bind_param( 1, $_);
$sth->execute();
$result = $sth->fetchall_arrayref();
foreach $s (@{$result})
{
($id, $date) = @$s;
print "$id:$date\n";
}
}
--- End Script 2 ---

The result was the execution time of the script dropped to 1 second
using PostgreSQL!

At first I thought it was a feature of the PERL DBI, but I ran the same
queries using psql, and I got similar results. I also ran some timing
checks of the PERL code and 99% of the execution time of the Script 1 is
being spent in the "execute" statement.

I have looked through the documentation of PostgreSQL, and I can find no
explanation for this. I was curious if this is a know issue and thus is
the proper way to create SELECT statements in PostgreSQL?

Kate Collins

BTW, Script 2 takes a little longer to run on the Oracle system, circa
1.3 seconds.

--
=================================================
Katherine (Kate) L. Collins
Senior Software Engineer/Meteorologist
Weather Services International (WSI Corporation)
4 Federal Street
Billerica, MA 01821
EMAIL: klcollins(at)wsicorp(dot)com
PHONE: (978) 670-5110
FAX: (978) 670-5100
http://www.intellicast.com


From: mig(at)utdt(dot)edu
To: klcollins(at)wsicorp(dot)com
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: SQL command speed
Date: 2000-05-18 19:38:29
Message-ID: 200005181938.QAA04835@ant.utdt
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

I believe the PostgreSQL optimizer is fooled by many ORs and switches
to sequential scans: it cannot estimate properly the quantity of
results that will be returned.

Try it in one go, as

$sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY
HH24:MI')
FROM notam_details
WHERE
item_a IN
('EGKB','EGDM','EGHH','EGGD','EGVN','EGFF',
'EGDC','EGTC','EGDR','EGTE','EGLF','EGTG',
'EGBJ','EGLC','EGKK','EGLL','EGSS','EGGW',
'EGMD','EGDL','EGUM','EGHD','EGHE','EGKA',
'EGHI','EGMC','EGDG','EGFH','EGDY','EGJA',
'EGJB','EGJJ')";

The optimizer should (I think) like this better and use the indices,
without requiring that you iterate the queries from the frontend.

As I am trying to learn these things too, I will appreciate knowing
about the result of your tests: please keep me posted.

Thanks

Miguel Sofer


From: Kate Collins <klcollins(at)wsicorp(dot)com>
To: mig(at)utdt(dot)edu
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: SQL command speed
Date: 2000-05-18 20:19:02
Message-ID: 39245036.71D4618F@wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Miguel,

Thank you for the reply.

I created a third script with the syntax you suggested, and the result
was similar to the first script, about 11.5 to 12 seconds.

The interesting thing is that in all three cases, when I use the unix
"time" command to time the execution, the "user" and "sys" portion of the
execution is about the same. user = 0.20, sys = 0.02.

This indicates to me that the bottle neck is not in the execution of the
PERL, but maybe in the postmaster daemon. The system I am testing with
is a PII, 400 mhz with 256 ram. It is not doing anything else at this
time, but running these test.

Interestingly enough under Oracle, the new script takes about the same
time as the other two, 1-1.5 seconds.

Kate Collins

mig(at)utdt(dot)edu wrote:

> I believe the PostgreSQL optimizer is fooled by many ORs and switches
> to sequential scans: it cannot estimate properly the quantity of
> results that will be returned.
>
> Try it in one go, as
>
> $sql = "SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY
> HH24:MI')
> FROM notam_details
> WHERE
> item_a IN
> ('EGKB','EGDM','EGHH','EGGD','EGVN','EGFF',
> 'EGDC','EGTC','EGDR','EGTE','EGLF','EGTG',
> 'EGBJ','EGLC','EGKK','EGLL','EGSS','EGGW',
> 'EGMD','EGDL','EGUM','EGHD','EGHE','EGKA',
> 'EGHI','EGMC','EGDG','EGFH','EGDY','EGJA',
> 'EGJB','EGJJ')";
>
> The optimizer should (I think) like this better and use the indices,
> without requiring that you iterate the queries from the frontend.
>
> As I am trying to learn these things too, I will appreciate knowing
> about the result of your tests: please keep me posted.
>
> Thanks
>
> Miguel Sofer

--
=================================================
Katherine (Kate) L. Collins
Senior Software Engineer/Meteorologist
Weather Services International (WSI Corporation)
4 Federal Street
Billerica, MA 01821
EMAIL: klcollins(at)wsicorp(dot)com
PHONE: (978) 670-5110
FAX: (978) 670-5100
http://www.intellicast.com


From: mig(at)utdt(dot)edu
To: klcollins(at)wsicorp(dot)com
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: SQL command speed
Date: 2000-05-18 20:27:42
Message-ID: 200005182027.RAA05089@ant.utdt
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Thanks for your reply; I am disappointed that it did not work the way
I thought it would ...

I suppose that you did VACUUM ANALYZE your table recently, right? If
not, the stale table statistics may be playing a role too: do it, and
try again.

With this I have reached the outer boundaries of my "knowledge"; if it
doesn't work out, I hope somebody else may be able to help. In any
case, I would appreciate knowing more about your tests and the
eventual outcome ...

Miguel


From: "Ross J(dot) Reedstrom" <reedstrm(at)wallace(dot)ece(dot)rice(dot)edu>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: SQL command speed
Date: 2000-05-18 20:39:09
Message-ID: 20000518153909.B2549@rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Hey crew:
I've got a relatively simple SQL problem. In a db backed web site
we're building, I'd like to fill a dropdown box with the contents of a
validation table, in this case research institutions. I want to sort them
alphabetically, but using "library rules": i.e. skip inital articles,
since we've a few 'The University of Foo" and "The Johns Hopkins
University", for example.

I thought I had it with this SQL:

SELECT InstName from Institutions ORDER BY ltrim (InstName, 'The');

Looked good, until I found 'Texas A&M University' sorting below York.

Seems ltrim() removes inital charaters from the set of charaters, not
inital strings, so I was sorting on 'xas A&M University'

Anyone have some magic solution for this?

Ross
--
Ross J. Reedstrom, Ph.D., <reedstrm(at)rice(dot)edu>
NSBRI Research Scientist/Programmer
Computer and Information Technology Institute
Rice University, 6100 S. Main St., Houston, TX 77005


From: Kate Collins <klcollins(at)wsicorp(dot)com>
To: mig(at)utdt(dot)edu
Cc: pgsql-sql(at)postgresql(dot)org
Subject: Re: SQL command speed
Date: 2000-05-18 20:46:57
Message-ID: 392456C1.2B5E90D6@wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Miguel,

I have re-run VACUUM ANALYZE. Unfortunately, it did not result in a speed
up.

Thank you for your suggestions and input. They have helped increase my
understanding of how it works.

I will wait and see if any body else can provide any suggestions or
insight.

Kate

mig(at)utdt(dot)edu wrote:

> Thanks for your reply; I am disappointed that it did not work the way
> I thought it would ...
>
> I suppose that you did VACUUM ANALYZE your table recently, right? If
> not, the stale table statistics may be playing a role too: do it, and
> try again.
>
> With this I have reached the outer boundaries of my "knowledge"; if it
> doesn't work out, I hope somebody else may be able to help. In any
> case, I would appreciate knowing more about your tests and the
> eventual outcome ...
>
> Miguel

--
=================================================
Katherine (Kate) L. Collins
Senior Software Engineer/Meteorologist
Weather Services International (WSI Corporation)
4 Federal Street
Billerica, MA 01821
EMAIL: klcollins(at)wsicorp(dot)com
PHONE: (978) 670-5110
FAX: (978) 670-5100
http://www.intellicast.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kate Collins <klcollins(at)wsicorp(dot)com>
Cc: postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject: Re: SQL command speed
Date: 2000-05-18 21:47:56
Message-ID: 25357.958686476@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Kate Collins <klcollins(at)wsicorp(dot)com> writes:
> I did some experimentation, and if the WHERE clause had one or two items
> it would use the index; more and it would not.

Kate, it is reasonable behavior for the planner to stop using
indexscans when there are enough OR clauses. Each OR clause requires
a separate indexscan and so eventually it'll be cheaper to just do one
sequential scan over the whole table. What we appear to have here is
a case of misestimation of the relative costs of index and sequential
scans, leading the planner to switch too soon. Next question is why
the misestimation. It's difficult to say anything without seeing
your EXPLAIN results for different numbers of OR clauses. Also, how
big is the table (how many rows) and how many rows do you actually
get from the query?

regards, tom lane


From: "Stephan Szabo" <sszabo(at)kick(dot)com>
To: "Ross J(dot) Reedstrom" <reedstrm(at)wallace(dot)ece(dot)rice(dot)edu>
Cc: <pgsql-sql(at)postgresql(dot)org>
Subject: Re: SQL command speed
Date: 2000-05-18 22:02:02
Message-ID: 024001bfc114$b25c18b0$0c64010a@kick.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

I didn't see a function that would do what you wanted to do, but
I guess you could do the trimming in a plpgsql function though...
(Since I don't have a 7.0 system to test with, these were only tested
on an old 6.5.1 database)...

create function trimthe(text) returns text as '
begin
if (substr($1, 1, 4) = \'The \' then
return substr($1, 5);
end if;
-- add other prefix checks here...
return $1;
end;
' language 'plpgsql';

and then select ordering by that...

I guess a more general function could be:
create function ltrimstr(text, text) returns text as '
begin
if position($2 in $1) = 1 then
return substr($1, char_length($2)+1);
end if;
return $1;
end;
' language 'plpgsql';

----- Original Message -----
From: "Ross J. Reedstrom" <reedstrm(at)wallace(dot)ece(dot)rice(dot)edu>
To: <pgsql-sql(at)postgresql(dot)org>
Sent: Thursday, May 18, 2000 1:39 PM
Subject: Re: [SQL] SQL command speed

> Hey crew:
> I've got a relatively simple SQL problem. In a db backed web site
> we're building, I'd like to fill a dropdown box with the contents of a
> validation table, in this case research institutions. I want to sort them
> alphabetically, but using "library rules": i.e. skip inital articles,
> since we've a few 'The University of Foo" and "The Johns Hopkins
> University", for example.
>
> I thought I had it with this SQL:
>
> SELECT InstName from Institutions ORDER BY ltrim (InstName, 'The');
>
> Looked good, until I found 'Texas A&M University' sorting below York.
>
> Seems ltrim() removes inital charaters from the set of charaters, not
> inital strings, so I was sorting on 'xas A&M University'
>
> Anyone have some magic solution for this?


From: "Ross J(dot) Reedstrom" <reedstrm(at)wallace(dot)ece(dot)rice(dot)edu>
To: pgsql-sql(at)postgresql(dot)org
Subject: Library sort order (was Re: SQL command speed)
Date: 2000-05-18 22:03:18
Message-ID: 20000518170318.B2964@rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Sorry about the subject: I meant to change it on the last one.

As usual, 5 minutes after posting, I came up with a solution, at least
for my limited case of ignoring an inital 'The ':

SELECT InstName FROM Institutions ORDER BY CASE WHEN
strpos(InstName,'The ') = 1 THEN ltrim(InstName,'The ') ELSE
InstName END;

Note that this requires 7.0, since 6.X won't allow a CASE node in the
ORDER BY position.

Ross

On Thu, May 18, 2000 at 03:39:09PM -0500, Ross J. Reedstrom wrote:
> Hey crew:
> I've got a relatively simple SQL problem. In a db backed web site
> we're building, I'd like to fill a dropdown box with the contents of a
> validation table, in this case research institutions. I want to sort them
> alphabetically, but using "library rules": i.e. skip inital articles,
> since we've a few 'The University of Foo" and "The Johns Hopkins
> University", for example.
>
> I thought I had it with this SQL:
>
> SELECT InstName from Institutions ORDER BY ltrim (InstName, 'The');
>
> Looked good, until I found 'Texas A&M University' sorting below York.
>
> Seems ltrim() removes inital charaters from the set of charaters, not
> inital strings, so I was sorting on 'xas A&M University'
>
> Anyone have some magic solution for this?
>
> Ross
> --
> Ross J. Reedstrom, Ph.D., <reedstrm(at)rice(dot)edu>
> NSBRI Research Scientist/Programmer
> Computer and Information Technology Institute
> Rice University, 6100 S. Main St., Houston, TX 77005
>


From: Kate Collins <klcollins(at)wsicorp(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject: Re: SQL command speed
Date: 2000-05-19 13:20:55
Message-ID: 39253FB7.39516F9@wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Tom,

Thank you for your reply.

The table I am using has 114600 total rows. The full query returns 1129
rows. Right now the table is static, i.e. I am not modifying it while I am
running these tests.

Here are the results of the EXPLAIN with the different numbers of OR's.

---QUERY 1, returns 1129 rows---
pbi=> explain
pbi-> SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI')
pbi-> FROM notam_details
pbi-> WHERE
pbi-> item_a = 'EGKB' OR item_a = 'EGDM' OR item_a = 'EGHH' OR
pbi-> item_a = 'EGGD' OR item_a = 'EGVN' OR item_a = 'EGFF' OR
pbi-> item_a = 'EGDC' OR item_a = 'EGTC' OR item_a = 'EGDR' OR
pbi-> item_a = 'EGTE' OR item_a = 'EGLF' OR item_a = 'EGTG' OR
pbi-> item_a = 'EGBJ' OR item_a = 'EGLC' OR item_a = 'EGKK' OR
pbi-> item_a = 'EGLL' OR item_a = 'EGSS' OR item_a = 'EGGW' OR
pbi-> item_a = 'EGMD' OR item_a = 'EGDL' OR item_a = 'EGUM' OR
pbi-> item_a = 'EGHD' OR item_a = 'EGHE' OR item_a = 'EGKA' OR
pbi-> item_a = 'EGHI' OR item_a = 'EGMC' OR item_a = 'EGDG' OR
pbi-> item_a = 'EGFH' OR item_a = 'EGDY' OR item_a = 'EGJA' OR
pbi-> item_a = 'EGJB' OR item_a = 'EGJJ';
NOTICE: QUERY PLAN:

Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12)

EXPLAIN

--- QUERY 2, returns 11 rows ---
pbi=> explain
pbi-> SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI')
pbi-> FROM notam_details
pbi-> WHERE
pbi-> item_a = 'EGKB';
NOTICE: QUERY PLAN:

Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57
rows=927 width=12)

EXPLAIN

--- QUERY 3, returns 11 rows ---
pbi=> explain
pbi-> SELECT notam_id, TO_CHAR(full_issue_date, 'DD-MON-YYYY HH24:MI')
pbi-> FROM notam_details
pbi-> WHERE
pbi-> item_a = 'EGKB' OR item_a = 'EGDM';
NOTICE: QUERY PLAN:

Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12)

EXPLAIN

---

I have run VACUUM ANALYZE NOTAM_DETAILS and created an index on the item_a
column.

I have only been using PostgreSQL for about a week, so all of this is pretty
new to me. I don't 100% understand how all of this works yet, so any insight
you can provide will be appreciated.

Kate Collins

Tom Lane wrote

> Kate Collins <klcollins(at)wsicorp(dot)com> writes:
> > I did some experimentation, and if the WHERE clause had one or two items
> > it would use the index; more and it would not.
>
> Kate, it is reasonable behavior for the planner to stop using
> indexscans when there are enough OR clauses. Each OR clause requires
> a separate indexscan and so eventually it'll be cheaper to just do one
> sequential scan over the whole table. What we appear to have here is
> a case of misestimation of the relative costs of index and sequential
> scans, leading the planner to switch too soon. Next question is why
> the misestimation. It's difficult to say anything without seeing
> your EXPLAIN results for different numbers of OR clauses. Also, how
> big is the table (how many rows) and how many rows do you actually
> get from the query?
>
> regards, tom lane

--
=================================================
Katherine (Kate) L. Collins
Senior Software Engineer/Meteorologist
Weather Services International (WSI Corporation)
4 Federal Street
Billerica, MA 01821
EMAIL: klcollins(at)wsicorp(dot)com
PHONE: (978) 670-5110
FAX: (978) 670-5100
http://www.intellicast.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kate Collins <klcollins(at)wsicorp(dot)com>
Cc: postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject: Re: SQL command speed
Date: 2000-05-19 16:06:16
Message-ID: 8221.958752376@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Kate Collins <klcollins(at)wsicorp(dot)com> writes:
> The table I am using has 114600 total rows. The full query returns 1129
> rows. Right now the table is static, i.e. I am not modifying it while I am
> running these tests.

> Here are the results of the EXPLAIN with the different numbers of OR's.

> ---QUERY 1, returns 1129 rows---
> pbi=> [ 32-or variant ]
> Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12)

> --- QUERY 2, returns 11 rows ---
> pbi=> [ just one WHERE clause ]
> Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57
> rows=927 width=12)

> --- QUERY 3, returns 11 rows ---
> pbi=> [ 2 ORed clauses ]
> Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12)

OK, the problem here is the disconnect between the planner's estimate
of the number of returned rows (the "rows" number in EXPLAIN) and the
actual result row count. If the query actually did need to pull nearly
a thousand rows for each OR'd key, then using a seqscan for more than a
couple of ORs would make sense. But in reality you don't have nearly
that many rows per OR key, so an indexscan is needed.

That estimate is largely driven by a single statistic collected by
VACUUM ANALYZE, which is the frequency of the most common value in
the item_a column. If the MCV is pretty common then the planner assumes
that the column doesn't have very many distinct values, so you get a
high estimate of the number of returned rows. Evidently you have
a very common MCV for item_a, but the key values you are looking for
aren't nearly that common.

I have seen a number of cases where someone was using a dummy value
(like 'N/A', or an empty string, etc) to indicate unknown data, and
there were so many of these entries as to not merely be the MCV,
but drive the MCV's frequency statistic far above the frequency of
occurrence of any "real" value. This fools the planner into thinking
that *all* the values are like that, and so it generates plans
accordingly. The long-run solution is to gather more-detailed
statistics, and that's in the TODO list for a future version. In the
meantime, there is a workaround that you may be able to use: instead
of a dummy value, store NULL for unknown entries. The statistics
gatherer already accounts for NULLs separately, so a large fraction
of NULLs won't fool the planner about the frequency of non-null values.

Note: if you try this, be sure to re-run VACUUM ANALYZE after you
replace the dummies with NULLs. The plans won't change until there's
a more accurate statistical entry for your table...

regards, tom lane


From: Kate Collins <klcollins(at)wsicorp(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject: Re: SQL command speed
Date: 2000-05-19 19:34:50
Message-ID: 3925975A.B8880CCD@wsicorp.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Tom,

Thank you for the explanation.

You are correct in your assessment of the nature of the distribution of the
elements in the item_a column. Some values return none or a few rows. While
other values return many rows (100 or more). It is not an even distribution. I
had never really considered what effect this would have on searching before.

Kate

Tom Lane wrote:

> Kate Collins <klcollins(at)wsicorp(dot)com> writes:
> > The table I am using has 114600 total rows. The full query returns 1129
> > rows. Right now the table is static, i.e. I am not modifying it while I am
> > running these tests.
>
> > Here are the results of the EXPLAIN with the different numbers of OR's.
>
> > ---QUERY 1, returns 1129 rows---
> > pbi=> [ 32-or variant ]
> > Seq Scan on notam_details (cost=0.00..13420.40 rows=26230 width=12)
>
> > --- QUERY 2, returns 11 rows ---
> > pbi=> [ just one WHERE clause ]
> > Index Scan using notam_details_item_a on notam_details (cost=0.00..2739.57
> > rows=927 width=12)
>
> > --- QUERY 3, returns 11 rows ---
> > pbi=> [ 2 ORed clauses ]
> > Seq Scan on notam_details (cost=0.00..4820.90 rows=1847 width=12)
>
> OK, the problem here is the disconnect between the planner's estimate
> of the number of returned rows (the "rows" number in EXPLAIN) and the
> actual result row count. If the query actually did need to pull nearly
> a thousand rows for each OR'd key, then using a seqscan for more than a
> couple of ORs would make sense. But in reality you don't have nearly
> that many rows per OR key, so an indexscan is needed.
>
> That estimate is largely driven by a single statistic collected by
> VACUUM ANALYZE, which is the frequency of the most common value in
> the item_a column. If the MCV is pretty common then the planner assumes
> that the column doesn't have very many distinct values, so you get a
> high estimate of the number of returned rows. Evidently you have
> a very common MCV for item_a, but the key values you are looking for
> aren't nearly that common.
>
> I have seen a number of cases where someone was using a dummy value
> (like 'N/A', or an empty string, etc) to indicate unknown data, and
> there were so many of these entries as to not merely be the MCV,
> but drive the MCV's frequency statistic far above the frequency of
> occurrence of any "real" value. This fools the planner into thinking
> that *all* the values are like that, and so it generates plans
> accordingly. The long-run solution is to gather more-detailed
> statistics, and that's in the TODO list for a future version. In the
> meantime, there is a workaround that you may be able to use: instead
> of a dummy value, store NULL for unknown entries. The statistics
> gatherer already accounts for NULLs separately, so a large fraction
> of NULLs won't fool the planner about the frequency of non-null values.
>
> Note: if you try this, be sure to re-run VACUUM ANALYZE after you
> replace the dummies with NULLs. The plans won't change until there's
> a more accurate statistical entry for your table...
>
> regards, tom lane

--
=================================================
Katherine (Kate) L. Collins
Senior Software Engineer/Meteorologist
Weather Services International (WSI Corporation)
4 Federal Street
Billerica, MA 01821
EMAIL: klcollins(at)wsicorp(dot)com
PHONE: (978) 670-5110
FAX: (978) 670-5100
http://www.intellicast.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Kate Collins <klcollins(at)wsicorp(dot)com>
Cc: postgresql news group <pgsql-sql(at)postgresql(dot)org>
Subject: Re: SQL command speed
Date: 2000-05-19 21:39:01
Message-ID: 9901.958772341@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-sql

Kate Collins <klcollins(at)wsicorp(dot)com> writes:
> You are correct in your assessment of the nature of the distribution
> of the elements in the item_a column. Some values return none or a
> few rows. While other values return many rows (100 or more). It is
> not an even distribution. I had never really considered what effect
> this would have on searching before.

Apparently you have at least one value with close to 2000 rows (IIRC,
the rows estimate is basically half the MCV's row count, except in
the special case where the planner can see that you are comparing
against the MCV itself).

If that's a real data value and not a dummy, then of course you can't
replace it by NULL. In that case I'm kind of stuck for a good answer
for 7.0; we can't do much to distinguish the few-distinct-values case
from the many-distinct-values-with-some-frequency-outliers case until
we have better statistics than 7.0 keeps. What you could do as a
stopgap is to force the system to use an indexscan against its
judgment, by doing this before issuing the problem query:

SET enable_seqscan = OFF;

Naturally I recommend keeping this ON most of the time, but sometimes
you just have to get out the blunt instruments ;-)

regards, tom lane