Re: Yet another abort-early plan disaster on 9.3

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Yet another abort-early plan disaster on 9.3
Date: 2014-11-21 18:38:27
Message-ID: CAMkU=1yoiQt6+_XMACL=PTv-aHwc1E0BgiHGhKW3HL+ByN6zpw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Fri, Oct 10, 2014 at 10:53 AM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:

> On 10.10.2014 14:10, Tomas Vondra wrote:
> > Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
> >> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >>> Yes, it's only intractable if you're wedded to the idea of a tiny,
> >>> fixed-size sample. If we're allowed to sample, say, 1% of the table,
> we
> >>> can get a MUCH more accurate n_distinct estimate using multiple
> >>> algorithms, of which HLL is one. While n_distinct will still have some
> >>> variance, it'll be over a much smaller range.
> >>
> >> I've gone looking for papers on this topic but from what I read this
> >> isn't so. To get any noticeable improvement you need to read 10-50% of
> >> the table and that's effectively the same as reading the entire table
> >> -- and it still had pretty poor results. All the research I could find
> >> went into how to analyze the whole table while using a reasonable
> >> amount of scratch space and how to do it incrementally.
> >
> > I think it's really difficult to discuss the estimation without some
> basic
> > agreement on what are the goals. Naturally, we can't get a perfect
> > estimator with small samples (especially when the sample size is fixed
> and
> > not scaling with the table). But maybe we can improve the estimates
> > without scanning most of the table?
> >
> > FWIW I've been playing with the adaptive estimator described in [1] and
> > the results looks really interesting, IMHO. So far I was testing it on
> > synthetic datasets outside the database, but I plan to use it instead of
> > our estimator, and do some more tests.
>
> Attached is an experimental patch implementing the adaptive estimator.
>
> It was fairly simple (although it's a bit messy). It only computes the
> estimates for the "scalar" case (i.e. data types that we can sort).
> Implementing this for the "minimal" case is possible, but requires a bit
> more work.
>
> It only computes the estimate and prints a WARNING with both the current
> and new estimate, but the old estimate is stored.
>

When I run this patch on the regression database, I get a case where the
current method is exact but the adaptive one is off:

WARNING: ndistinct estimate current=676.00 adaptive=906.00

select count(distinct stringu1) from onek;
676

It should be seeing every single row, so I don't know why the adaptive
method is off. Seems like a bug.

Cheers,

Jeff

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Josh Berkus 2014-11-21 18:44:50 Re: pg_multixact not getting truncated
Previous Message Tom Lane 2014-11-21 18:16:25 Re: Automatic PG_PRINTF_ATTRIBUTE

Browse pgsql-performance by date

  From Date Subject
Next Message Tomas Vondra 2014-11-23 20:19:56 Re: Yet another abort-early plan disaster on 9.3
Previous Message Tom Lane 2014-11-21 05:12:05 Re: A pessimistic planner