PoC: Duplicate Tuple Elidation during External Sort for DISTINCT

From: Jon Nelson <jnelson+pgsql(at)jamponi(dot)net>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: PoC: Duplicate Tuple Elidation during External Sort for DISTINCT
Date: 2014-01-22 03:16:25
Message-ID: CAKuK5J2x-5kbALXHO1kX-njU0+JLDf4DSfjp5v4JoH3-MCxOOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings -hackers:

I have worked up a patch to PostgreSQL which elides tuples during an
external sort. The primary use case is when sorted input is being used
to feed a DISTINCT operation. The idea is to throw out tuples that
compare as identical whenever it's convenient, predicated on the
assumption that even a single I/O is more expensive than some number
of (potentially extra) comparisons. Obviously, this is where a cost
model comes in, which has not been implemented. This patch is a
work-in-progress.

Being very new to hacking PostgreSQL, I've probably done a bunch of
things wrong, but I've tried to test it thoroughly.

A rough summary of the patch follows:

- a GUC variable enables or disables this capability
- in nodeAgg.c, eliding duplicate tuples is enabled if the number of
distinct columns is equal to the number of sort columns (and both are
greater than zero).
- in createplan.c, eliding duplicate tuples is enabled if we are
creating a unique plan which involves sorting first
- ditto planner.c
- all of the remaining changes are in tuplesort.c, which consist of:
+ a new macro, DISCARDTUP and a new structure member, discardtup, are
both defined and operate similar to COMPARETUP, COPYTUP, etc...
+ in puttuple_common, when state is TSS_BUILDRUNS, we *may* simply
throw out the new tuple if it compares as identical to the tuple at
the top of the heap. Since we're already performing this comparison,
this is essentially free.
+ in mergeonerun, we may discard a tuple if it compares as identical
to the *last written tuple*. This is a comparison that did not take
place before, so it's not free, but it saves a write I/O.
+ We perform the same logic in dumptuples

I have tested this patch with a bunch of different data, and the
results are summarized below.

Test 1:
I used a corpus of approximately 1.5TB of textual data, consisting of
7.4 billion input rows, 12.6% of which is unique. This is a 'real-world'
test.

Before the patch: 44.25 hours using 274GB of temporary files
After the patch: 36.44 hours using 125GB of temporary files
Difference: -7.6 hours, 18% faster

Test 2:
Acquire the unique set of strings from a totally random set that
contains no duplicates. This is a 'worst case'.

Input: 700 million rows, 32GB.
Before: 18,796 seconds.
After: 19,358 seconds
Difference: +562 seconds, *slower* by 3%.
In this particular case, the performance varies from faster to slower, and I'm
not sure why. Statistical noise?

Test 3:
Acquire the unique set of integers from a set of 100 million, all
happen to have the value '1'. This is a 'best case'.

Input: 100 million rows, 3.4GB on-disk
Before: 26.1 seconds using 1.3GB of temporary files
After: 8.9 seconds using 8KB of disk
Difference: -17.2 seconds, 65% faster

Test 4:
Similar to test 3, but using 1 billion rows.
Before: 1450 seconds, 13 GB of temporary files
After: 468 seconds, 8 KB of temporary files
Difference: -982 seconds, 68% faster

See below for details on test 4.

Lastly, 'make check' passes without issues (modulo rangefuncs grumping about
changes in pg_settings).

Tests 1 through 3 were performed against PostgreSQL 9.4 HEAD as of
early September 2013. I have rebased the patch against
PostgreSQL 9.4 HEAD as of 19 January 2014, and re-run tests
2, 3, and 4 with similar results.

Regarding the patch: I've tried to conform to the indenting and style
rules, including using pg_bsd_indent and pg_indent, but I've not had
100% success.

The details on test 4:

A simple example, using 1 billion rows all with the value '1' (integer):
This is a fabricated *best case*.

postgres=# set enable_hashagg = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=468188.900..468188.901 rows=1 loops=1)
Output: i
-> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=468188.897..468188.897 rows=1 loops=1)
Output: i
Sort Key: i.i
Sort Method: external sort Disk: 8kB
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=34.765..254595.990
rows=1000000000 loops=1)
Output: i
Total runtime: 468189.125 ms
(9 rows)

Time: 468189.755 ms
postgres=# set enable_opportunistic_tuple_elide = false;
SET
Time: 30.402 ms
postgres=# explain analyze verbose select distinct * from i ;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=245942793.27..250942793.27 rows=1 width=4) (actual
time=1047226.451..1449371.632 rows=1 loops=1)
Output: i
-> Sort (cost=245942793.27..248442793.27 rows=1000000000 width=4)
(actual time=1047226.447..1284922.616 rows=1000000000 loops=1)
Output: i
Sort Key: i.i
Sort Method: external sort Disk: 13685248kB
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=29.832..450978.694
rows=1000000000 loops=1)
Output: i
Total runtime: 1449644.717 ms
(9 rows)

There are some issues that remain.
For example, the following two queries behave very differently (they
*are* different queries, of course).

Compare this:
postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(distinct i.i) from i ;

QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16924779.00..16924779.01 rows=1 width=4) (actual
time=894138.114..894138.115 rows=1 loops=1)
Output: count(DISTINCT i)
Buffers: shared hit=102 read=4424683, temp read=1466277 written=1466277
-> Seq Scan on public.i (cost=0.00..14424779.00 rows=1000000000
width=4) (actual time=0.012..349142.072 rows=1000000000 loops=1)
Output: i
Buffers: shared hit=96 read=4424683
Total runtime: 894138.233 ms
(7 rows)

Time: 894190.392 ms
postgres=#

to this (with this code path enabled):

postgres=# explain ( analyze true, verbose true, costs true, buffers
true, timing true ) select count(1) from (select distinct i.i from i)
as sub ;

Aggregate (cost=182583418.28..182583418.29 rows=1 width=0) (actual
time=451973.381..451973.417 rows=1 loops=1)
Output: count(1)
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Unique (cost=177583418.27..182583418.27 rows=1 width=4)
(actual time=451973.370..451973.372 rows=1 loops=1)
Output: i.i
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Sort (cost=177583418.27..180083418.27 rows=1000000000
width=4) (actual time=451973.366..451973.367 rows=1 loops=1)
Output: i.i
Sort Key: i.i
Sort Method: external sort Disk: 8kB
Buffers: shared hit=192 read=4424587, temp read=1 written=1
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223587.173
rows=1000000000 loops=1)
Output: i.i
Buffers: shared hit=192 read=4424587
Total runtime: 451994.583 ms

(which also logged the following information)
LOG: tuplesort_end: opportunistically elided 999999999 tuples
(1864133 extra comparisons): 0 during initial accrual, 998135866
during build, 0 during merge, 1864133 during dump

For comparison purposes, here is the same query but with hash
aggregation enabled:

Aggregate (cost=16924779.02..16924779.03 rows=1 width=0) (actual
time=559379.525..559379.525 rows=1 loops=1)
Output: count(1)
Buffers: shared hit=224 read=4424555
-> HashAggregate (cost=16924779.00..16924779.01 rows=1 width=4)
(actual time=559379.513..559379.513 rows=1 loops=1)
Output: i.i
Group Key: i.i
Buffers: shared hit=224 read=4424555
-> Seq Scan on public.i (cost=0.00..14424779.00
rows=1000000000 width=4) (actual time=0.016..223704.468
rows=1000000000 loops=1)
Output: i.i
Buffers: shared hit=224 read=4424555
Total runtime: 559379.608 ms
(11 rows)

A note: this work was supported by my employer, Renesys Corporation (
http://www.renesys.com/ ).

The patch was formatted using 'unified' diff, per
http://wiki.postgresql.org/wiki/Submitting_a_Patch

--
Jon

Attachment Content-Type Size
patch001.diff text/plain 26.5 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2014-01-22 03:29:04 Re: Funny representation in pg_stat_statements.query.
Previous Message Florian Pflug 2014-01-22 03:09:42 Re: Add %z support to elog/ereport?