Re: using custom scan nodes to prototype parallel sequential scan

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: using custom scan nodes to prototype parallel sequential scan
Date: 2014-11-10 23:21:05
Message-ID: 20141110232105.GO28007@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Robert, All,

On 2014-11-10 10:57:16 -0500, Robert Haas wrote:
> On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > Something usable, with severe restrictions, is actually better than we
> > have now. I understand the journey this work represents, so don't be
> > embarrassed by submitting things with heuristics and good-enoughs in
> > it. Our mentor, Mr.Lane, achieved much by spreading work over many
> > releases, leaving others to join in the task.
>
> It occurs to me that, now that the custom-scan stuff is committed, it
> wouldn't be that hard to use that, plus the other infrastructure we
> already have, to write a prototype of parallel sequential scan. Given
> where we are with the infrastructure, there would be a number of
> unhandled problems, such as deadlock detection (needs group locking or
> similar), assessment of quals as to parallel-safety (needs
> proisparallel or similar), general waterproofing to make sure that
> pushing down a qual we shouldn't does do anything really dastardly
> like crash the server (another written but yet-to-be-published patch
> adds a bunch of relevant guards), and snapshot sharing (likewise).
> But if you don't do anything weird, it should basically work.
>
> I think this would be useful for a couple of reasons. First, it would
> be a demonstrable show of progress, illustrating how close we are to
> actually having something you can really deploy. Second, we could use
> it to demonstrate how the remaining infrastructure patches close up
> gaps in the initial prototype. Third, it would let us start doing
> real performance testing.

I think it might be a useful experiment - as long as it's clear that
it's that. Which is, I think, what you're thinking about?

> It seems pretty clear that a parallel sequential scan of data that's
> in memory (whether the page cache or the OS cache) can be accelerated
> by having multiple processes scan it in parallel.

Right.

> But it's much less clear what will happen when the data is being read
> in from disk.

I think that *very* heavily depends on the IO subsystem.

> Does parallelism help at all?

I'm pretty damn sure. We can't even make a mildly powerfull storage
fully busy right now. Heck, I can't make my workstation's storage with a
raid 10 out of four spinning disks fully busy.

I think some of that benefit also could be reaped by being better at
hinting the OS...

> What degree of parallelism helps?

That's quite a hard question. Generally the question about how much
parallelism for what is beneficial will be one of the most complicated
areas once the plumbing is in.

> Do we break OS readahead so badly that performance actually regresses?

I don't think it's likely that we break OS readahead - that works on a
per task basis at least on linux afaik. But it's nonetheless very easy
to have too many streams causing to many random reads.

> These are things that are likely to
> need a fair amount of tuning before this is ready for prime time, so
> being able to start experimenting with them in advance of all of the
> infrastructure being completely ready seems like it might help.

I'm not actually entirely sure how much that's going to help. I think
you could very well have a WIP patch ready reasonably quick that doesn't
solve the issues you mention above by patching it in. For the kind of
testing we're talking about that seems likely sufficient - a git branch
somewhere probably is actually easier to compile for people than some
contrib module that needs to be loaded...
And I *do* think that you'll very quickly hit the limits of the custom
scan API. And I'd much rather see you work on improving parallelism than
the custom scan stuff, just so you can prototype further ahead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2014-11-10 23:33:07 Re: INSERT ... ON CONFLICT {UPDATE | IGNORE}
Previous Message Andres Freund 2014-11-10 22:58:54 Re: Add CREATE support to event triggers