Using indices for UNION.

From: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Using indices for UNION.
Date: 2013-10-31 10:42:51
Message-ID: 20131031.194251.12577697.horiguchi.kyotaro@lab.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello, This patch might be too complicated (and seems somewhat ad
hoc) for the gain, but more index usage for this kind of
operation should be worth doing.

Currently, PostgreSQL ignores from the very first the availablity
of indexes for UNION. Sorting and SeqScan is choosed as follows,

* EX.1
> uniontest=# create table c11 (a int, b int, c int, d text);
> uniontest=# create unique index cu11_pkey_idx on c11 (a, b)
> uniontest=# create table c12 (like cu11 including all);
> uniontest=# insert into cu11 (select a / 5, 4 - (a % 5), a, 'cu11'
> from generate_series(000000, 099999) a);
> uniontest=# insert into cu12 (select a / 5, 4 - (a % 5), a, 'cu12'
> from generate_series(100000, 199999) a);
> uniontest=# explain (analyze on, costs off)
> select a, b from cu11 union select a, b from cu12 order by a, b limit 100;
> QUERY PLAN
> ---------------------------------------------------------------------
> Limit (actual time=287.611..287.673 rows=100 loops=1)
> -> Unique (actual time=287.610..287.652 rows=100 loops=1)
> -> Sort (actual time=287.609..287.621 rows=100 loops=1)
> Sort Key: cu11.a, cu11.b
> Sort Method: external sort Disk: 3520kB
> -> Append (actual time=0.020..73.859 rows=200000 loops=1)
> -> Seq Scan on cu11 (actual time=0.019..28.780 rows=100000 loops=1)
> -> Seq Scan on cu12 (actual time=0.006..24.550 rows=100000 loops=1)
> Total runtime: 288.596 ms

Actually, index paths are sometimes worth considering on UNION
planning especially with LIMIT. For example,

* EX.2
> uniontest=# explain (analyze on, costs off)
> select a, b from cu11 union select a, b from cu12
> order by a desc, b desc limit 100;
> QUERY PLAN
> ------------------------------------------------------------------
> Limit (actual time=0.041..0.390 rows=100 loops=1)
> -> Unique (actual time=0.040..0.355 rows=100 loops=1)
> -> Merge Append (actual time=0.040..0.290 rows=100 loops=1)
> Sort Key: cu11.a, cu11.b
> -> Limit (actual time=0.025..0.025 rows=1 loops=1)
> -> Unique (actual time=0.025..0.025 rows=1 loops=1)
> -> Index Only Scan Backward on cu11
> (actual time=0.022..0.022 rows=1 loops=1)
> Heap Fetches: 1
> -> Limit (actual time=0.012..0.209 rows=100 loops=1)
> -> Unique (actual time=0.011..0.188 rows=100 loops=1)
> -> Index Only Scan Backward .. on cu12
> (actual time=0.010..0.115 rows=100 loops=1)
> Heap Fetches: 100
> Total runtime: 1.191 ms

This is archieved by this patch.

Rough explanaiton of what this patch does is following,

- Consider whether ORDER BY or grouping of setops can be pushed
down onto leaf subqueries. (recurse_set_operations)

- Merging two children of a setop counting the sort orders of
them. Use MergeAppend if they are in the same ordering.
(generate_union_plan, recurse_union_children)

- Grouping is determined consulting sorting order. This will
allow removing redundant sorting. (generate_setop_grouplist)

The effects of these items are shown in the Ex.2.

What do you think about this?

=======

Nevertheless, only this patch does not effective so far on the
query like following,

>uniontest=# explain (analyze on, costs off)
> select * from cu11 union select * from cu12
> order by a, b limit 100;
> QUERY PLAN
> -----------------------------------------------------------------------
> Limit (actual time=256.263..256.346 rows=100 loops=1)
> -> Unique (actual time=256.262..256.332 rows=100 loops=1)
> -> Sort (actual time=256.261..256.283 rows=100 loops=1)
> Sort Key: cu11.a, cu11.b, cu11.c, cu11.d
> Sort Method: external sort Disk: 5280kB
> -> Append (actual time=0.028..61.161 rows=200000 loops=1)
> -> Seq Scan on cu11 (actual time=0.027..21.067 rows=100000 loops=1)
> -> Seq Scan on cu12 (actual time=0.012..18.428 rows=100000 loops=1)
> Total runtime: 257.778 ms

The indexes made here is UNIQUE one so they should be used for
ordering in this query. This will be overcome by another patch
(will separately proposed).

> uniontest=# explain (analyze on, costs off)
> select * from cu11 union select * from cu12
> order by a, b limit 100;
> QUERY PLAN
> ---------------------------------------------------------------------------
> Limit (actual time=0.102..0.351 rows=100 loops=1)
> -> Unique (actual time=0.100..0.323 rows=100 loops=1)
> -> Merge Append (actual time=0.097..0.222 rows=100 loops=1)
> Sort Key: cu11.a, cu11.b, cu11.c, cu11.d
> -> Limit (actual time=0.051..0.135 rows=100 loops=1)
> -> Index Scan .. on cu11 (actual time=0.051..0.109 rows=100 loops=1)
> -> Limit (actual time=0.044..0.044 rows=1 loops=1)
> -> Index Scan .. on cu12 (actual time=0.044..0.044 rows=1 loops=1)
> Total runtime: 1.244 ms

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment Content-Type Size
union_uses_idx_v1_20131031.patch text/x-patch 22.8 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2013-10-31 10:43:10 Get more from indices.
Previous Message Mitsumasa KONDO 2013-10-31 10:36:45 Add accurate option to pgbench