Correlation in cost_index()

From: Manfred Koizar <mkoi-pg(at)aon(dot)at>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Correlation in cost_index()
Date: 2002-10-02 19:52:46
Message-ID: n6cmpug13b9rk1srebjvhphg0lm8dou1kn@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

You all know this FAQ: "Why does Postgres not use my index?" Half of
the time this problem can easily be solved by casting a literal to the
type of the respective column; this is not my topic here.

In many other cases it turns out that the planner over-estimates the
cost of an index scan. Sometimes this can be worked around by
lowering random_page_cost. "Of course, that's a hack that is quite
unrelated to the real problem." I strongly agree ;-)

AFAICS (part of) the real problem is in costsize.c:cost_index() where
IO_cost is calculated from min_IO_cost, pages_fetched,
random_page_cost, and indexCorrelation. The current implementation
uses indexCorrelation^2 to interpolate between min_IO_cost and
max_IO_cost, which IMHO gives results that are too close to
max_IO_cost. This conjecture is supported by the fact, that often
actual run times are much lower than estimated, when seqscans are
disabled.

So we have to find a cost function, so that

. min_IO_cost <= cost <= max_IO_cost
for -1 <= indexCorrelation <= 1
. cost --> min_IO_cost for indexCorrelation --> +/- 1
. cost --> max_IO_cost for indexCorrelation --> 0
. cost tends more towards min_IO_cost than current implementation

After playing around a bit I propose three functions satisfying above
conditions. All proposals use absC = abs(indexCorrelation).

Proposal 1: Use absC for interpolation.

IO_cost = absC * min_IO_cost + (1 - absC) * max_IO_cost;

Proposal 2: First calculate estimates for numbers of pages and cost
per page, then multiply the results.

estPages = absC * minPages + (1 - absC) * maxPages;
estPCost = absC * 1 + (1 - absC) * random_page_cost;
/* ^
sequential_page_cost */
IO_cost = estPages * estPCost;

Proposal 3: Interpolate "geometrically", using absC.

IO-cost = exp( absC * ln(min_IO_Cost) +
(1 - absC) * ln(max_IO_cost));

Here are some numbers for
seq_page_cost = 1 (constant)
random_page_cost = 4 (GUC)
minPages = 61
maxPages = 1440

corr current p1 p2 p3
0 5760.00 5760.00 5760.00 5760.00
0.1 5703.01 5190.10 4817.77 3655.22
0.2 5532.04 4620.20 3958.28 2319.55
0.3 5247.09 4050.30 3181.53 1471.96
0.4 4848.16 3480.40 2487.52 934.08
0.5 4335.25 2910.50 1876.25 592.76
0.6 3708.36 2340.60 1347.72 376.16
0.7 2967.49 1770.70 901.93 238.70
0.8 2112.64 1200.80 538.88 151.48
0.9 1143.81 630.90 258.57 96.13
0.95 616.65 345.95 149.44 76.57
0.99 174.41 117.99 77.03 63.84
0.995 117.85 89.50 68.91 62.40
0.999 72.39 66.70 62.57 61.28
1 61.00 61.00 61.00 61.00

Another example for
seq_page_cost = 1 (constant)
random_page_cost = 10 (GUC)
minPages = 20
maxPages = 938.58

corr current p1 p2 p3
0 9385.79 9385.79 9385.79 9385.79
0.1 9292.14 8449.21 7705.17 5073.72
0.2 9011.16 7512.64 6189.88 2742.73
0.3 8542.87 6576.06 4839.94 1482.65
0.4 7887.27 5639.48 3655.34 801.48
0.5 7044.35 4702.90 2636.09 433.26
0.6 6014.11 3766.32 1782.19 234.21
0.7 4796.56 2829.74 1093.62 126.61
0.8 3391.69 1893.16 570.40 68.44
0.9 1799.50 956.58 212.53 37.00
0.95 933.16 488.29 95.60 27.20
0.99 206.38 113.66 31.81 21.27
0.995 113.42 66.83 25.70 20.62
0.999 38.72 29.37 21.11 20.12
1 20.00 20.00 20.00 20.00

(If you want to play around with your own numbers, I can send my OOo
spreadsheet privately or to the list.)

The second example shows that especially with proposal 3 we could
afford to set random_page_cost to a *higher* value, which in contrast
to previous recommendations seems to be appropriate, IIRC that
benchmark results posted here showed values of up to 60.

As nobody knows how each of these proposals performs in real life
under different conditions, I suggest to leave the current
implementation in, add all three algorithms, and supply a GUC variable
to select a cost function.

Comments? Ideas? Objections?

Servus
Manfred

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message scott.marlowe 2002-10-02 20:07:19 Re: Correlation in cost_index()
Previous Message Tom Lane 2002-10-02 17:19:11 Re: (Fwd) Re: Any Oracle 9 users? A test please...