Re: Re. [HACKERS] Some notes on optimizer cost estimates

From: Don Baccus <dhogaza(at)pacifier(dot)com>
To: xun(at)cs(dot)ucsb(dot)edu, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: Re. [HACKERS] Some notes on optimizer cost estimates
Date: 2000-01-21 16:10:44
Message-ID: 3.0.1.32.20000121081044.01036290@mail.pacifier.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At 06:19 PM 1/20/00 -0800, Xun Cheng wrote:
>I'm very glad you bring up this cost estimate issue.
>Recent work in database research have argued a more
>detailed disk access cost model should be used for
>large queries especially joins.
>Traditional cost estimate only considers the number of
>disk pages accessed. However a more detailed model
>would consider three parameters: avg. seek, avg. latency
>and avg. page transfer. For old disk, typical values are
>SEEK=9.5 milliseconds, LATENCY=8.3 ms, TRANSFER=2.6ms.
>A sequential continuous reading of a table (assuming
>1000 continuous pages) would cost
>(SEEK+LATENCY+1000*TRANFER=2617.8ms); while quasi-randomly
>reading 200 times with 2 continuous pages/time would
>cost (SEEK+200*LATENCY+400*TRANSFER=2700ms).
>Someone from IBM lab re-studied the traditional
>ad hoc join algorithms (nested, sort-merge, hash) using the detailed cost
model
>and found some interesting results.

One complication when doing an index scan is that you are
accessing two separate files (table and index), which can frequently
be expected to cause an considerable increase in average seek time.

Oracle and other commercial databases recommend spreading indices and
tables over several spindles if at all possible in order to minimize
this effect.

I suspect it also helps their optimizer make decisions that are
more consistently good for customers with the largest and most
complex databases and queries, by making cost estimates more predictably
reasonable.

Still...this doesn't help with the question about the effect of the
filesystem system cache. I wandered around the web for a little bit
last night, and found one summary of a paper by Osterhout on the
effect of the Solaris cache on a fileserver serving diskless workstations.
There was reference to the hierarchy involved (i.e. the local workstation
cache is faster than the fileserver's cache which has to be read via
the network which in turn is faster than reading from the fileserver's
disk). It appears the rule-of-thumb for the cache-hit ratio on reads,
presumably based on measuring some internal Sun systems, used in their
calculations was 80%.

Just a datapoint to think about.

There's also considerable operating system theory on paging systems
that might be useful for thinking about trying to estimate the
Postgres cache/hit ratio. Then again, maybe Postgres could just
keep count of how many pages of a given table are in the cache at
any given time? Or simply keep track of the current ratio of hits
and misses?

>>I have been spending some time measuring actual runtimes for various
>>sequential-scan and index-scan query plans, and have learned that the
>>current Postgres optimizer's cost estimation equations are not very
>>close to reality at all.

>One interesting question I'd like to ask is if this non-closeness
>really affects the optimal choice of postgresql's query optimizer.
>And to what degree the effects might be? My point is that
>if the optimizer estimated the cost for sequential-scan is 10 and
>the cost for index-scan is 20 while the actual costs are 10 vs. 40,
>it should be ok because the optimizer would still choose sequential-scan
>as it should.

This is crucial, of course - if there are only two types of scans
available, what ever heuristic is used only has to be accurate enough
to pick the right one. Once the choice is made, it doesn't really
matter (from the optimizer's POV) just how long it will actually take,
the time will be spent and presumably it will be shorter than the
alternative.

How frequently will the optimizer choose wrongly if:

1. All of the tables and indices were in PG buffer cache or filesystem
cache? (i.e. fixed access times for both types of scans)

or

2. The table's so big that only a small fraction can reside in RAM
during the scan and join, which means that the non-sequential
disk access pattern of the indexed scan is much more expensive.

Also, if you pick sequential scans more frequently based on a presumption
that index scans are expensive due to increased average seek time, how
often will this penalize the heavy-duty user that invests in extra
drives and lots of RAM?

...

>>The current cost estimation
>>method essentially assumes that the buffer cache plus OS disk cache will
>>be 100% efficient --- we will never have to read the same page of the
>>main table twice in a scan, due to having discarded it between
>>references. This of course is unreasonably optimistic. Worst case
>>is that we'd fetch a main-table page for each selected tuple, but in
>>most cases that'd be unreasonably pessimistic.
>
>This is actually the motivation that I asked before if postgresql
>has a raw disk facility. That way we have much control on this cache
>issue. Of course only if we can provide some algo. better than OS
>cache algo. (depending on the context, like large joins), a raw disk
>facility will be worthwhile (besides the recoverability).

Postgres does have control over its buffer cache. The one thing that
raw disk I/O would give you is control over where blocks are placed,
meaning you could more accurately model the cost of retrieving them.
So presumably the cache could be tuned to the allocation algorithm
used to place various structures on the disk.

I still wonder just how much gain you get by this approach. Compared,
to, say simply spending $2,000 on a gigabyte of RAM. Heck, PCs even
support a couple gigs of RAM now.

>Actually I have another question for you guys which is somehow related
>to this cost estimation issue. You know the difference between OLTP
>and OLAP. My question is how you target postgresql on both kinds
>of applications or just OLTP. From what I know OLTP and OLAP would
>have a big difference in query characteristics and thus
>optimization difference. If postgresql is only targeted on
>OLTP, the above cost estimation issue might not be that
>important. However for OLAP, large tables and large queries are
>common and optimization would be difficult.

- Don Baccus, Portland OR <dhogaza(at)pacifier(dot)com>
Nature photos, on-line guides, Pacific Northwest
Rare Bird Alert Service and other goodies at
http://donb.photo.net.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Lockhart 2000-01-21 16:19:36 Re: memory leak????
Previous Message Tom Lane 2000-01-21 15:44:02 Re: [HACKERS] pg_dump disaster