Re: Yet another abort-early plan disaster on 9.3

From: Peter Geoghegan <peter(dot)geoghegan86(at)gmail(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Merlin Moncure <mmoncure(at)gmail(dot)com>, postgres performance list <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Yet another abort-early plan disaster on 9.3
Date: 2014-10-03 00:54:06
Message-ID: CAEYLb_UYNyG0g+2Q52GLFc9tTz+XTqEnPh5eETLcV9Z+z7sTGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers pgsql-performance

On Thu, Oct 2, 2014 at 12:56 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> Yes, it's only intractable if you're wedded to the idea of a tiny,
> fixed-size sample. If we're allowed to sample, say, 1% of the table, we
> can get a MUCH more accurate n_distinct estimate using multiple
> algorithms, of which HLL is one. While n_distinct will still have some
> variance, it'll be over a much smaller range.

I think that HyperLogLog, as a streaming algorithm, will always
require that the entire set be streamed. This doesn't need to be a big
deal - in the case of my abbreviated key patch, it appears to
basically be free because of the fact that we were streaming
everything anyway. It's a very cool algorithm, with fixed overhead and
constant memory usage. It makes very useful guarantees around
accuracy.

I have this intuition that even though I'm more or less not paying
anything for a great cardinality estimate, it's kind of a shame that I
still throw it away after the sort, each and every time. I have a hard
time actually putting my finger on how it could be put to further use,
though. And besides, this only helps if you happen to need to do a
sort (or something that requires a sequential scan, since the cost
certainly isn't anywhere near "free" when you didn't need to do that
anyway).

Our current lack of block-based sampling probably implies that we are
almost as badly off as if we *did* a sequential scan. Not that I'm
suggesting that we give up on the idea of sampling (which would be
crazy).

Streaming algorithms like HyperLogLog are very recent ideas, as these
things go. I wouldn't be all that discouraged by the fact that it
might not have been put to use in this way (for database statistics)
by somebody before now.
--
Regards,
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Marti Raudsepp 2014-10-03 00:55:46 Re: CREATE IF NOT EXISTS INDEX
Previous Message Marti Raudsepp 2014-10-03 00:38:27 Re: Patch to add support of "IF NOT EXISTS" to others "CREATE" statements

Browse pgsql-performance by date

  From Date Subject
Next Message Marc Slemko 2014-10-03 02:17:48 performance of SELECT * much faster than SELECT <colname> with large offset
Previous Message George Neuner 2014-10-02 23:00:58 help: function failing