Proposal for new SET variables for optimizer costs

Lists: pgsql-hackers
From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Proposal for new SET variables for optimizer costs
Date: 2000-02-05 21:29:26
Message-ID: 14601.949786166@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I am about to implement some changes to the planner/optimizer's cost
model, following up to the thread on pghackers beginning on 20 Jan.
The main conclusion of that thread was that we needed to charge more for
a page fetched nonsequentially than for a page fetched sequentially.
After further investigation I have concluded that it is also appropriate
to include explicit modeling of the cost of evaluation of WHERE clauses.
For example, using the regression database and a query like

select * from tenk1 where
(unique1 = 1 and unique2 = 101) or
(unique1 = 2 and unique2 = 102) or
(unique1 = 3 and unique2 = 103) or
... 100 OR clauses ...
(unique1 = 100 and unique2 = 200);

(which is not too implausible for certain automatic query generators),
I observe that a sequential scan takes about 6 seconds, vs. less than
a second for a similar query with only 10 clauses. That says that the
cost of evaluating a WHERE clause this large is far from negligible.
The optimizer needs to account for this because different query plans
can have a considerable impact on the number of tuples that the WHERE
clause is evaluated for --- in this example, if we use indexscans to
pull out just the tuples with the right values of 'unique1', then the
WHERE clause need only be checked at 100 tuples, not all 10000.

I believe it would be reasonable to charge a certain amount per operator
or function appearing in the WHERE clause in order to account for this
effect. (Currently I see no need to model the cost of evaluating the
targetlist expressions. The same expressions should get evaluated for
the same tuples no matter what query plan the optimizer picks, so we
might as well just leave that cost out of our comparisons.)

Also, as was previously mentioned on pghackers, I would like to add SET
variables to control enabling/disabling of particular query plan types,
so that different plans can be checked with less hassle than restarting
psql with a new PGOPTIONS setting.

This all leads to the following proposal for redoing the optimizer
plan cost SET variables. The variables proposed below would replace
COST_HEAP and COST_INDEX, which are poorly named IMHO and are definitely
very misleadingly documented at present.

(Note that all costs will still be referenced to the cost of a disk page
fetch. We will take 1.0 as the cost of a sequential page fetch.)

SET variable name Internal variable Proposed default

RANDOM_PAGE_COST random_page_cost 4.0

Cost of fetching a disk block nonsequentially (as a multiple of the cost
of a sequential block fetch).

CPU_TUPLE_COST cpu_tuple_cost 0.01

Cost of CPU time per tuple processed within a query (as a fraction of
the cost of a sequential disk block fetch). This renames the existing
SET variable COST_HEAP (cpu_page_weight); but the default value is
smaller than it used to be, since WHERE clause evaluation will now be
accounted for separately.

CPU_INDEX_TUPLE_COST cpu_index_tuple_cost 0.001

Cost of CPU time per index tuple processed within a query (as a fraction
of the cost of a sequential disk block fetch). This renames the
existing SET variable COST_INDEX (cpu_index_page_weight); but the
default value is much smaller than it used to be, since the operator
evaluation cost will account for the bulk of the cost of visiting an
index tuple.

CPU_OPERATOR_COST cpu_operator_cost 0.0025

Cost of CPU time per operator or function evaluated in a WHERE clause.
Note that this would apply to operators evaluated at index tuples as
well as those evaluated against heap tuples.
(The proposed default corresponds to a ratio of 5 microsec against 2
millisec for a sequential block fetch, which seems to be about right
on my workstation.)

ENABLE_SEQSCAN enable_seqscan ON

ENABLE_INDEXSCAN enable_indexscan ON

ENABLE_TIDSCAN enable_tidscan ON

ENABLE_SORT enable_sort ON

ENABLE_NESTLOOP enable_nestloop ON

ENABLE_MERGEJOIN enable_mergejoin ON

ENABLE_HASHJOIN enable_hashjoin ON

Provide access via SET to the already-existing internal optimizer
control flags.

Currently, it is possible to have COST_HEAP and COST_INDEX set
automatically during connection startup; libpq will do that if
the environment variables PGCOSTHEAP and/or PGCOSTINDEX are defined
on the client side. If we want to continue that behavior, the
environment variables for these variables would be named
PGRANDOMPAGECOST etc (remove underscores and prepend PG).
I'm not sure if we want to continue inventing client-side environment
variables, however.

Comments? Ideas for better names? Anyone object to renaming the
existing variables? (BTW, although it could be argued that this
might break existing scripts that set COST_HEAP or COST_INDEX,
I doubt that there are any ... and given the existing doco,
I doubt even more that anyone is setting appropriate values ...)

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Proposal for new SET variables for optimizer costs
Date: 2000-02-05 22:01:59
Message-ID: 200002052201.RAA07126@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Looks great. I wouldn't change a thing in your proposal.

> I am about to implement some changes to the planner/optimizer's cost
> model, following up to the thread on pghackers beginning on 20 Jan.
> The main conclusion of that thread was that we needed to charge more for
> a page fetched nonsequentially than for a page fetched sequentially.
> After further investigation I have concluded that it is also appropriate
> to include explicit modeling of the cost of evaluation of WHERE clauses.
> For example, using the regression database and a query like
>

--
Bruce Momjian | http://www.op.net/~candle
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026


From: Philip Warner <pjw(at)rhyme(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Proposal for new SET variables for optimizer costs
Date: 2000-02-05 23:27:28
Message-ID: 3.0.5.32.20000206102728.0346bc00@mail.rhyme.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 16:29 5/02/00 -0500, Tom Lane wrote:
>
>SET variable name Internal variable Proposed default
>
>RANDOM_PAGE_COST random_page_cost 4.0
>
>Cost of fetching a disk block nonsequentially (as a multiple of the cost
>of a sequential block fetch).
>
>CPU_TUPLE_COST cpu_tuple_cost 0.01
>
>
>CPU_INDEX_TUPLE_COST cpu_index_tuple_cost 0.001
>
>CPU_OPERATOR_COST cpu_operator_cost 0.0025
>
>ENABLE_SEQSCAN enable_seqscan ON
>
>ENABLE_INDEXSCAN enable_indexscan ON
>
>ENABLE_TIDSCAN enable_tidscan ON
>
>ENABLE_SORT enable_sort ON
>
>ENABLE_NESTLOOP enable_nestloop ON
>
>ENABLE_MERGEJOIN enable_mergejoin ON
>
>ENABLE_HASHJOIN enable_hashjoin ON
>

Any chance of prefixing the 'set' variable names with 'PG_' or 'PG_OPT_' or
something similar? Or doing something else to differentiate them from
user-declared SQL variables? I have no idea if user-declared SQL variables
are an SQL92 thing, but these variables are 'system' things, and some kind
of differentiation seems like a good idea.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Philip Warner <pjw(at)rhyme(dot)com(dot)au>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Proposal for new SET variables for optimizer costs
Date: 2000-02-05 23:31:23
Message-ID: 16041.949793483@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Philip Warner <pjw(at)rhyme(dot)com(dot)au> writes:
> Any chance of prefixing the 'set' variable names with 'PG_' or 'PG_OPT_' or
> something similar? Or doing something else to differentiate them from
> user-declared SQL variables?

I see no need to do that, since the *only* place these names exist is
in the SET command (and its friends SHOW and RESET), and SET exists only
to set system control variables. There are no user-declared SQL
variables.

The names are quite long and underscore-filled enough without adding
unnecessary prefixes, IMHO ;-)

regards, tom lane


From: Philip Warner <pjw(at)rhyme(dot)com(dot)au>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Proposal for new SET variables for optimizer costs
Date: 2000-02-06 00:33:51
Message-ID: 3.0.5.32.20000206113351.009c8670@mail.rhyme.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

At 18:31 5/02/00 -0500, Tom Lane wrote:
>Philip Warner <pjw(at)rhyme(dot)com(dot)au> writes:
>> Any chance of prefixing the 'set' variable names with 'PG_' or 'PG_OPT_' or
>> something similar? Or doing something else to differentiate them from
>> user-declared SQL variables?
>
>I see no need to do that, since the *only* place these names exist is
>in the SET command (and its friends SHOW and RESET), and SET exists only
>to set system control variables. There are no user-declared SQL
>variables.
>
>The names are quite long and underscore-filled enough without adding
>unnecessary prefixes, IMHO ;-)

I agree, given their complexity, they are unlikely to conflict with future
SQL names, but the SET statment *is* part of the SQL standard, and I
thought it would be good to be cautious in the names you choose. This would
avoid any possible future conflict, as well as make it clear from the
outset that they are *not* standard SQL names.

Another option would be to add another command, eg. 'PG', which is used for
all non-SQLxx commands:

PG SET somename = somevalue
PG VACUUM

...etc. But this has the disctinct disadvantage of being more work, and
being cumbersome in comparison to changing names. The transition could be
managed by supporting old commands until version 8.0, with an appropriate
notice.

Just my 0.02c worth.

----------------------------------------------------------------
Philip Warner | __---_____
Albatross Consulting Pty. Ltd. |----/ - \
(A.C.N. 008 659 498) | /(@) ______---_
Tel: +61-03-5367 7422 | _________ \
Fax: +61-03-5367 7430 | ___________ |
Http://www.rhyme.com.au | / \|
| --________--
PGP key available upon request, | /
and from pgp5.ai.mit.edu:11371 |/


From: Sevo Stille <sevo(at)ip23(dot)net>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [HACKERS] Proposal for new SET variables for optimizercosts
Date: 2000-02-06 14:35:16
Message-ID: 389D86A4.962AF9A@ip23.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Philip Warner wrote:

> Another option would be to add another command, eg. 'PG', which is used for
> all non-SQLxx commands:
>
> PG SET somename = somevalue
> PG VACUUM
>
> ...etc. But this has the disctinct disadvantage of being more work, and
> being cumbersome in comparison to changing names.

This does not work out in terms of general SQL compatibility. Even if we
treat commands after PG specially, no other SQL database would, and it
would raise at least as many errors as the extension syntax. Nor is
there any significant advantage of it within Postgres if we ever get a
keyword clash with a future SQL revision - I'd rather not have a syntax
that alows for two interpretations for the same keyword depending on
whether it follows PG or not.

Sevo

--
Sevo Stille
sevo(at)ip23(dot)net