Parallel Sort

From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Parallel Sort
Date: 2013-05-13 14:28:59
Message-ID: 20130513142859.GC171500@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It would be great if one client session could take advantage of multiple CPU
cores. EnterpriseDB wishes to start the trek into this problem space for 9.4
by implementing parallel internal (i.e. not spilling to disk) sort. This
touches on a notable subset of the infrastructure components we'll need for
parallel general query. My intent is to map out the key design topics, hear
about critical topics I hadn't considered, and solicit feedback on the quality
of the high-level plan. Full designs for key pieces will come later.

* Worker Backends

A multi-process model, rather than a multi-thread, is best for introducing
parallelism to a PostgreSQL session. With threading, all in-memory state
would become shared by default; with processes, in-memory data structures
remain unshared until we explicitly select them for sharing. Some data
structures will require new sharing, but I expect unshared ones to remain the
typical case. I refer to these additional processes as worker backends. They
will have much in common with ordinary client backends, including a database
connection and a PGPROC. They will have no client socket; instead, they will
take direction from the client-connected backend, termed the master, via
shared memory.

Each worker needs to make SnapshotNow visibility decisions coherent with the
master. For sorting, this allows us to look up comparison functions, even
when the current transaction created or modified those functions. This will
also be an essential building block for any parallelism project that consults
user tables. Implementing this means copying the subtransaction stack and the
combocid hash to each worker. For the sake of completeness, we should also
copy the global MVCC snapshot data (sorting probably won't care). It also
means forbidding, while a parallel task is in flight, operations that affect
the transaction state:

- CommandCounterIncrement()
- GetCurrentCommandId(true)
- AssignTransactionId()
- subtransaction-management functions

I don't intend to commit us to a design fundamentally antagonistic to, for
example, starting a subtransaction in a worker backend. That being said, I
think we could achieve a lot in the parallel query space before we would deem
it important to relax those restrictions.

Workers will copy all GUCs from the master. bttextcmp() needs lc_collate, and
we'll inevitably reach the point of needing broad GUC coverage as we make more
things run in parallel. Further GUC changes will be forbidden while a
parallel task is in flight. (More generally, anything that mutates
backend-local state without reference to shared state will either need to be
forbidden or converted to route through shared state.)

The heavyweight locking mechanism will need to be aware of the association
between the master and its workers. For sorting, workers will acquire
heavyweight locks on system catalogs only. It is almost enough for the
workers to behave as though they are independent of the master for locking
purposes. But an undetected deadlock would arise if the master holds an
AccessExclusiveLock on a system catalog a worker needs.

We can allocate a small amount of permanent shared memory for coordination
among a group of processes, but sorting will benefit from a region as large as
maintenance_work_mem. Expect on-demand memory sharing.

* Identifying Parallel-Compatible Functions

Not all functions can reasonably run on a worker backend. We should not
presume that a VOLATILE function can tolerate the unstable execution order
imposed by parallelism, though a function like clock_timestamp() is perfectly
reasonable to run that way. STABLE does not have that problem, but neither
does it constitute a promise that the function implementation is compatible
with parallel execution. Consider xid_age(), which would need code changes to
operate correctly in parallel. IMMUTABLE almost guarantees enough; there may
come a day when all IMMUTABLE functions can be presumed parallel-safe. For
now, an IMMUTABLE function could cause trouble by starting a (read-only)
subtransaction. The bottom line is that parallel-compatibility needs to be
separate from volatility classes for the time being.

I'm not sure what the specific answer here should look like. Simply having a
CREATE FUNCTION ... PARALLEL_IS_FINE flag is not entirely satisfying, because
the rules are liable to loosen over time.

* Planner & Similar Issues

We should decide whether to actually sort in parallel based on the comparator
cost and the data size. The system currently has no information on comparator
cost: bt*cmp (and indeed almost all built-in functions) all have procost=1,
but bttextcmp is at least 1000x slower than btint4cmp. Let's improve the
procost estimates of all core B-tree and hash operators. This will have other
benefits, but we will need to be cognizant of the risk of upsetting setups
that have tuned cpu_operator_cost based on the present situation.

The choice of whether to parallelize can probably be made a manner similar to
the choice to do an external sort: the planner guesses the outcome for costing
purposes, but the actual decision is made at execution time. The planner
would determine a tuple count cutoff at which parallelism becomes favorable,
and tuplesort would check that to establish its actual decision.

Use of parallelism within one process has a distributed effect on the system,
similar to the use of work_mem. Add a PGC_USERSET GUC representing the number
of worker processes the current session is willing to use. It would default
to a small number, say zero or two. On a throughput-oriented system with high
concurrent query counts, the DBA would tend to set/leave it at zero in
postgresql.conf; parallelism can only help if the system has free resources.
Separately, add a GUC limiting the total number of workers across the system
at any one time.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-05-13 14:39:01 Re: Parallel Sort
Previous Message Noah Misch 2013-05-13 14:26:53 MemoryContextAllocHuge(): selectively bypassing MaxAllocSize