Lists: | pgsql-hackers |
---|
From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Range Types - representation and alignment |
Date: | 2011-02-09 08:11:50 |
Message-ID: | 1297239110.27157.455.camel@jdavis |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
After some significant prior discussion:
Here is what I've found:
Doing the simple thing is extremely wasteful. Let's take TSRANGE, for
instance:
4 bytes type oid
1 flag byte
8 bytes lower bound
8 bytes upper bound
But when constructing the value itself, it starts off with VARHDRSZ
bytes. It may later be compacted to a short (1 byte) header, but it
starts off as 4. So that means:
4 bytes VARHDRSZ
4 bytes type oid
1 flag byte
7 pad bytes to get back on a 'd' align boundary
8 bytes lower bound
8 bytes upper bound
Total: 32 bytes. When compacted into the tuple, it might be 29. We can't
skip those pad bytes, because we need to honor the subtype's alignment.
If we move the flag byte to the end, the representation works out much
better:
4 bytes VARHDRSZ
4 bytes type oid
8 bytes lower bound
8 bytes upper bound
1 flag byte
Total: 25 bytes, turns into about 22 bytes when compacted into the
tuple. It's a little awkward to read that way, but the savings are worth
it. The flag byte is necessary to know whether there are lower and/or
upper bounds, so we need to peek ahead to length - 1, and then continue
scanning forward through the attributes.
So, I'll implement this approach. 22 bytes represents 37.5% overhead
above the good ol' PERIOD data type (a lean 16 bytes), but we can make
up some of that if using unbounded ranges. For instance, a half-open
range like "[5, INF)" would only take 14 bytes.
Regards,
Jeff Davis
From: | Chris Browne <cbbrowne(at)acm(dot)org> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Range Types - efficiency |
Date: | 2011-02-09 21:20:03 |
Message-ID: | 877hd8zz0c.fsf@cbbrowne.afilias-int.info |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
One of the things I'd particularly like to use range types for is to
make it easier to construct range-related queries. Classic example is
that of reports that work on date ranges.
I create a table that will have transaction data:
CREATE TABLE some_data (
id serial,
whensit date
-- And it'll have other attributes, but those don't matter here...
);
CREATE INDEX some_when ON some_data USING btree (whensit);
I then populate it with a bunch of date-based data...
rangetest(at)localhost-> select count(*), min(whensit), max(whensit) from some_data;
count | min | max
-------+------------+------------
37440 | 2007-01-01 | 2014-12-27
(1 row)
Here's the traditional way of doing a range-based query on this data:
rangetest(at)localhost-> explain analyze select * from some_data where whensit >= '2010-01-01' and whensit < '2010-02-01';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on some_data (cost=12.30..184.23 rows=395 width=8) (actual time=0.064..0.150 rows=390 loops=1)
Recheck Cond: ((whensit >= '2010-01-01'::date) AND (whensit < '2010-02-01'::date))
-> Bitmap Index Scan on some_when (cost=0.00..12.21 rows=395 width=0) (actual time=0.054..0.054 rows=390 loops=1)
Index Cond: ((whensit >= '2010-01-01'::date) AND (whensit < '2010-02-01'::date))
Total runtime: 0.197 ms
(5 rows)
The RangeType-based equivalent is the following:
rangetest(at)localhost-> explain analyze select * from some_data where '[2010-01-01,2010-02-01)'::daterange @> whensit;
QUERY PLAN
---------------------------------------------------------------------------------------------------------
Seq Scan on some_data (cost=0.00..634.00 rows=1 width=8) (actual time=1.045..111.739 rows=390 loops=1)
Filter: ('[ 2010-01-01, 2010-02-01 )'::daterange @> whensit)
Total runtime: 111.780 ms
(3 rows)
This, alas, reverts to a seq scan on the table, rather than restricting
itself to the tuples of interest.
I realize that, after a fashion, I'm using this backwards. But when I'm
doing temporal stuff, that tends to be the pattern:
- There is a set of temporal configuration, indicating criteria that
are true for particular date ranges
- There is then event data, which has but a single date, but which
needs to be matched against the temporal configuration.
It sure would be nice to expand that filter into subqueries involving
the two criteria, in much the same fashion that is true today for
BETWEEN. I imagine that would allow many queries with this kind of
pattern to make use of indexes, making them visibly thousands of times
faster.
--
"I have traveled the length and breadth of this country and talked
with the best people, and can assure you that data processing is a fad
that won't last out the year". -- Business books editor, Prentice
Hall 1957
From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Chris Browne <cbbrowne(at)acm(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Range Types - efficiency |
Date: | 2011-02-09 22:58:49 |
Message-ID: | 1297292329.1747.6.camel@jdavis-ux.asterdata.local |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, 2011-02-09 at 16:20 -0500, Chris Browne wrote:
> rangetest(at)localhost-> explain analyze select * from some_data where '[2010-01-01,2010-02-01)'::daterange @> whensit;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
> Seq Scan on some_data (cost=0.00..634.00 rows=1 width=8) (actual time=1.045..111.739 rows=390 loops=1)
> Filter: ('[ 2010-01-01, 2010-02-01 )'::daterange @> whensit)
> Total runtime: 111.780 ms
> (3 rows)
>
> This, alas, reverts to a seq scan on the table, rather than restricting
> itself to the tuples of interest.
>
> I realize that, after a fashion, I'm using this backwards. But when I'm
> doing temporal stuff, that tends to be the pattern:
Yes. The index is a btree index on a normal column, so range types can't
exactly help with that directly -- except maybe as a rewrite like you
say.
One thing you might try is a functional index on (range(whensit)) and
then do: where '...' @> range(whensit).
Does that work for you?
Regards,
Jeff Davis
From: | Chris Browne <cbbrowne(at)acm(dot)org> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Range Types - efficiency |
Date: | 2011-02-09 23:07:14 |
Message-ID: | 87y65oyfh9.fsf@cbbrowne.afilias-int.info |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
pgsql(at)j-davis(dot)com (Jeff Davis) writes:
> On Wed, 2011-02-09 at 16:20 -0500, Chris Browne wrote:
>> rangetest(at)localhost-> explain analyze select * from some_data where '[2010-01-01,2010-02-01)'::daterange @> whensit;
>> QUERY PLAN
>> ---------------------------------------------------------------------------------------------------------
>> Seq Scan on some_data (cost=0.00..634.00 rows=1 width=8) (actual time=1.045..111.739 rows=390 loops=1)
>> Filter: ('[ 2010-01-01, 2010-02-01 )'::daterange @> whensit)
>> Total runtime: 111.780 ms
>> (3 rows)
>>
>> This, alas, reverts to a seq scan on the table, rather than restricting
>> itself to the tuples of interest.
>>
>> I realize that, after a fashion, I'm using this backwards. But when I'm
>> doing temporal stuff, that tends to be the pattern:
>
> Yes. The index is a btree index on a normal column, so range types can't
> exactly help with that directly -- except maybe as a rewrite like you
> say.
>
> One thing you might try is a functional index on (range(whensit)) and
> then do: where '...' @> range(whensit).
>
> Does that work for you?
That doesn't appear to actually help:
rangetest(at)localhost-> create index i2 on some_data (range(whensit));
CREATE INDEX
rangetest(at)localhost-> explain analyze select * from some_data where '[2010-01-01,2010-02-01)'::daterange @> range(whensit);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on some_data (cost=0.00..727.60 rows=12480 width=8) (actual time=1.030..110.542 rows=390 loops=1)
Filter: ('[ 2010-01-01, 2010-02-01 )'::daterange @> range(whensit))
Total runtime: 110.585 ms
(3 rows)
In any case, I suggest that as a "couple steps down the road" thing, it
would be desirable to have that query rewrite. Seems like a reasonable
ToDo item to consider for the future, if not in the first deployment.
Maybe that's something to add in 9.2 CommitFest #3! :-)
--
"There isn't any reason why Linux can't be implemented as an
enterprise computing solution. Find out what you've been missing
while you've been rebooting Windows NT." - Infoworld
From: | Jeff Davis <pgsql(at)j-davis(dot)com> |
---|---|
To: | Chris Browne <cbbrowne(at)acm(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Range Types - efficiency |
Date: | 2011-02-10 07:35:01 |
Message-ID: | 1297323301.27157.493.camel@jdavis |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
On Wed, 2011-02-09 at 18:07 -0500, Chris Browne wrote:
> rangetest(at)localhost-> create index i2 on some_data (range(whensit));
> CREATE INDEX
If you make this a GiST index, it should work.
The rewrites so that it can use a btree are an interesting idea though.
Regards,
Jeff Davis
From: | Florian Weimer <fweimer(at)bfk(dot)de> |
---|---|
To: | Chris Browne <cbbrowne(at)acm(dot)org> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Range Types - efficiency |
Date: | 2011-02-10 10:03:25 |
Message-ID: | 82vd0suryq.fsf@mid.bfk.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Lists: | pgsql-hackers |
* Chris Browne:
> The RangeType-based equivalent is the following:
>
> rangetest(at)localhost-> explain analyze select * from some_data where '[2010-01-01,2010-02-01)'::daterange @> whensit;
> QUERY PLAN
> ---------------------------------------------------------------------------------------------------------
> Seq Scan on some_data (cost=0.00..634.00 rows=1 width=8) (actual time=1.045..111.739 rows=390 loops=1)
> Filter: ('[ 2010-01-01, 2010-02-01 )'::daterange @> whensit)
> Total runtime: 111.780 ms
> (3 rows)
>
> This, alas, reverts to a seq scan on the table, rather than restricting
> itself to the tuples of interest.
This is quite similar to LIKE and regexp matches. The backend has a
kludge to use the index in such cases. It did not seem extensible to
me last time I looked, unfortunately.
--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99