Re: performance of implicit join vs. explicit conditions on inet queries?

Lists: pgsql-performance
From: Robert Edmonds <edmonds42(at)bellsouth(dot)net>
To: pgsql-performance(at)postgresql(dot)org
Subject: performance of implicit join vs. explicit conditions on inet queries?
Date: 2005-10-24 03:58:09
Message-ID: 20051024035809.GA18261@edmonds.ath.cx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

The preliminaries:

- PostgreSQL 8.1 beta 3, Debian experimental
- database has been VACUUMed FULL ANALYZE.
- a pg_dump -Fc exists at http://199.77.129.48/inet_test.db
- ia32 hardware with 2 GB physical memory and the following settings:

shared_buffers = 40960
temp_buffers = 16384
work_mem = 131072
maintenance_work_mem = 262144
effective_cache_size = 65536

I've populated a table evenly with about 2 million rows of RFC 1918
addresses:

Table "public.inet_addresses"
Column | Type | Modifiers
--------+------+-----------
addr | inet | not null
Indexes:
"inet_addresses_pkey" PRIMARY KEY, btree (addr)

The following query is very fast:

EXPLAIN ANALYZE
SELECT *
FROM inet_addresses
WHERE addr << inet('10.2.0.0/24')
OR addr << inet('10.4.0.0/24')
OR addr << inet('10.8.0.0/24');

Bitmap Heap Scan on inet_addresses (cost=6.51..324.48 rows=1792335 width=11) (actual time=0.350..1.104 rows=381 loops=1)
Recheck Cond: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet))
Filter: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet))
-> BitmapOr (cost=6.51..6.51 rows=85 width=0) (actual time=0.336..0.336 rows=0 loops=1)
-> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 rows=28 width=0) (actual time=0.127..0.127 rows=127 loops=1)
Index Cond: ((addr > '10.2.0.0/24'::inet) AND (addr <= '10.2.0.255'::inet))
-> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 rows=28 width=0) (actual time=0.109..0.109 rows=127 loops=1)
Index Cond: ((addr > '10.4.0.0/24'::inet) AND (addr <= '10.4.0.255'::inet))
-> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17 rows=28 width=0) (actual time=0.096..0.096 rows=127 loops=1)
Index Cond: ((addr > '10.8.0.0/24'::inet) AND (addr <= '10.8.0.255'::inet))
Total runtime: 1.613 ms

Instead of specifying explicit address ranges in the query, I'd like
to store the ranges in a table:

inet_test_db=# \d inet_ranges
Table "public.inet_ranges"
Column | Type | Modifiers
----------+---------+-----------
range | inet | not null
range_id | integer |
Indexes:
"inet_ranges_pkey" PRIMARY KEY, btree (range)
"inet_ranges_range_id_idx" btree (range_id)

inet_test_db=# SELECT * FROM inet_ranges;
range | range_id
--------------+----------
10.2.0.0/24 | 1
10.4.0.0/24 | 1
10.8.0.0/24 | 1
10.16.0.0/24 | 2
10.32.0.0/24 | 2
10.64.0.0/24 | 2
(6 rows)

This query is far slower, even though it generates the same result:

EXPLAIN ANALYZE
SELECT *
FROM inet_addresses as ia, inet_ranges as ir
WHERE ia.addr << ir.range
AND ir.range_id=1;

Nested Loop (cost=0.00..171485.93 rows=3072574 width=26) (actual time=1465.803..16922.979 rows=381 loops=1)
Join Filter: ("inner".addr << "outer".range)
-> Seq Scan on inet_ranges ir (cost=0.00..1.07 rows=3 width=15) (actual time=0.008..0.021 rows=3 loops=1)
Filter: (range_id = 1)
-> Seq Scan on inet_addresses ia (cost=0.00..31556.83 rows=2048383 width=11) (actual time=0.003..2919.405 rows=2048383 loops=3)
Total runtime: 16923.457 ms

Even when disabling sequential scans, the query planner is unable to
make use of the inet_addresses_pkey index:

Nested Loop (cost=100033605.21..100171874.11 rows=3072574 width=26) (actual time=2796.928..23453.585 rows=381 loops=1)
Join Filter: ("inner".addr << "outer".range)
-> Index Scan using inet_ranges_range_id_idx on inet_ranges ir (cost=0.00..3.04 rows=3 width=15) (actual time=0.069..0.095 rows=3 loops=1)
Index Cond: (range_id = 1)
-> Materialize (cost=100033605.21..100054089.04 rows=2048383 width=11) (actual time=0.016..5133.349 rows=2048383 loops=3)
-> Seq Scan on inet_addresses ia (cost=100000000.00..100031556.83 rows=2048383 width=11) (actual time=0.005..2938.012 rows=2048383 loops=1)
Total runtime: 23521.418 ms

Is it possible to attain the speed of the first query and the
flexibility of the second? Or will I have to resort to generating
queries of the first form with the range table in the application
layer?

--
Robert Edmonds
edmonds42(at)bellsouth(dot)net


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: performance of implicit join vs. explicit conditions on inet queries?
Date: 2005-10-31 09:48:41
Message-ID: dk4p9d$30q4$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Robert Edmonds" <edmonds42(at)bellsouth(dot)net> wrote
>
> EXPLAIN ANALYZE
> SELECT *
> FROM inet_addresses
> WHERE addr << inet('10.2.0.0/24')
> OR addr << inet('10.4.0.0/24')
> OR addr << inet('10.8.0.0/24');
>
> Bitmap Heap Scan on inet_addresses (cost=6.51..324.48 rows=1792335
> width=11) (actual time=0.350..1.104 rows=381 loops=1)
> Recheck Cond: ((addr << '10.2.0.0/24'::inet) OR (addr <<
> '10.4.0.0/24'::inet) OR (addr << '10.8.0.0/24'::inet))
> Filter: ((addr << '10.2.0.0/24'::inet) OR (addr << '10.4.0.0/24'::inet)
> OR (addr << '10.8.0.0/24'::inet))
> -> BitmapOr (cost=6.51..6.51 rows=85 width=0) (actual
> time=0.336..0.336 rows=0 loops=1)
> -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17
> rows=28 width=0) (actual time=0.127..0.127 rows=127 loops=1)
> Index Cond: ((addr > '10.2.0.0/24'::inet) AND (addr <=
> '10.2.0.255'::inet))
> -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17
> rows=28 width=0) (actual time=0.109..0.109 rows=127 loops=1)
> Index Cond: ((addr > '10.4.0.0/24'::inet) AND (addr <=
> '10.4.0.255'::inet))
> -> Bitmap Index Scan on inet_addresses_pkey (cost=0.00..2.17
> rows=28 width=0) (actual time=0.096..0.096 rows=127 loops=1)
> Index Cond: ((addr > '10.8.0.0/24'::inet) AND (addr <=
> '10.8.0.255'::inet))
> Total runtime: 1.613 ms
>
>
> Instead of specifying explicit address ranges in the query, I'd like
> to store the ranges in a table:
>
>
> inet_test_db=# \d inet_ranges
> Table "public.inet_ranges"
> Column | Type | Modifiers
> ----------+---------+-----------
> range | inet | not null
> range_id | integer |
> Indexes:
> "inet_ranges_pkey" PRIMARY KEY, btree (range)
> "inet_ranges_range_id_idx" btree (range_id)
>
> inet_test_db=# SELECT * FROM inet_ranges;
> range | range_id
> --------------+----------
> 10.2.0.0/24 | 1
> 10.4.0.0/24 | 1
> 10.8.0.0/24 | 1
> 10.16.0.0/24 | 2
> 10.32.0.0/24 | 2
> 10.64.0.0/24 | 2
> (6 rows)
>
>
>
> EXPLAIN ANALYZE
> SELECT *
> FROM inet_addresses as ia, inet_ranges as ir
> WHERE ia.addr << ir.range
> AND ir.range_id=1;
>
> Nested Loop (cost=0.00..171485.93 rows=3072574 width=26) (actual
> time=1465.803..16922.979 rows=381 loops=1)
> Join Filter: ("inner".addr << "outer".range)
> -> Seq Scan on inet_ranges ir (cost=0.00..1.07 rows=3 width=15)
> (actual time=0.008..0.021 rows=3 loops=1)
> Filter: (range_id = 1)
> -> Seq Scan on inet_addresses ia (cost=0.00..31556.83 rows=2048383
> width=11) (actual time=0.003..2919.405 rows=2048383 loops=3)
> Total runtime: 16923.457 ms
>

Good illustration. I guess we have a problem of the historgram statistical
information. That is, the historgrams we used can effectively record the
linear space ranges(like ordinary <, >, =), but failed to do it for
nonlinear ranges like inet data type. So the Nested Loop node make an error
in estmating number of rows (est: 3072574, real: 381), thus a sequential
scan is obviously better under this estimation.

I am thinking the historgram problem is not easy to fix, but is there a way
to change Inet type a little bit to make it linear for your range operators?
(for example, align the length to 000.000.000.000/00?)

Regards,
Qingqing


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org, edmonds42(at)bellsouth(dot)net
Subject: Re: performance of implicit join vs. explicit conditions on inet queries?
Date: 2005-10-31 14:24:10
Message-ID: 29864.1130768650@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> "Robert Edmonds" <edmonds42(at)bellsouth(dot)net> wrote
>> Instead of specifying explicit address ranges in the query, I'd like
>> to store the ranges in a table:

> Good illustration. I guess we have a problem of the historgram statistical
> information.

No, that's completely irrelevant to his problem. The reason we can't do
this is that the transformation from "x << const" to a range check on x
is a plan-time transformation; there's no mechanism in place to do it
at runtime. This is not easy to fix, because the mechanism that's doing
it is primarily intended for LIKE/regex index optimization, and in that
case a runtime pattern might well not be optimizable at all.

regards, tom lane


From: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: performance of implicit join vs. explicit conditions on inet queries?
Date: 2005-11-02 01:19:37
Message-ID: dk946n$2m3p$1@news.hub.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>
> No, that's completely irrelevant to his problem. The reason we can't do
> this is that the transformation from "x << const" to a range check on x
> is a plan-time transformation; there's no mechanism in place to do it
> at runtime. This is not easy to fix, because the mechanism that's doing
> it is primarily intended for LIKE/regex index optimization, and in that
> case a runtime pattern might well not be optimizable at all.
>

Not quite understand, sorry ...

(1) For this query (in an as-is PG syntax, which find out all rectangles lie
in a given rectangle) :

SELECT r FROM all_rectangles
WHERE r << rectangle('(1,9),(9,1)');

If there is a GiST/Rtree index associated with all_rectangles.r, how do
optimizer estimate the cost to decide that we should use this index or
not(then by a seqscan)?

(2) Does your above explaination mean that we can't use GiST for a spatial
join operation?

Regards,
Qingqing


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: performance of implicit join vs. explicit conditions on inet queries?
Date: 2005-11-02 04:20:52
Message-ID: 1207.1130905252@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-performance

"Qingqing Zhou" <zhouqq(at)cs(dot)toronto(dot)edu> writes:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote
>> No, that's completely irrelevant to his problem. The reason we can't do
>> this is that the transformation from "x << const" to a range check on x
>> is a plan-time transformation; there's no mechanism in place to do it
>> at runtime.

> Not quite understand, sorry ...

> (1) For this query (in an as-is PG syntax, which find out all rectangles lie
> in a given rectangle) :

> SELECT r FROM all_rectangles
> WHERE r << rectangle('(1,9),(9,1)');

No, you're thinking of the wrong << operator. The question was about
the inet network inclusion operator. We have a special case in
indxpath.c to transform "inetcol << inetconstant" into a range check
on the inet variable, much like we can transform a left-anchored LIKE
pattern into a range check on the text variable.

regards, tom lane