Index-only scans for GIST

Lists: pgsql-hackers
From: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Index-only scans for GIST
Date: 2014-08-01 07:58:38
Message-ID: CAP4vRV4xiK7v4XQgdmtOco9WB2mYeTEgbncF-DrMRYGvDk8V3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, hackers!
I work on a GSoC project "Index-only scans for GIST"
https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014

Repository is
https://github.com/lubennikovaav/postgres/tree/indexonlygist2
Patch is in attachments.

It includes index-only scans for multicolumn GIST and new regression test.
Fetch() method is realized for box and point opclasses.

Documentation is not updated yet, but I'm going to do it till the end of
GSoC.

I've got one question about query with OR condition. It is the last query
in regression test "gist_indexonly". It doesn't fail but it doensn't use
index-only scans. Could someone explain to me how it works?
It seems to depend on build_paths_for_OR
<http://doxygen.postgresql.org/indxpath_8c.html#ae660d2e886355e53ed3b9ec693e4afd2>
function.
But I couldn't understand how.

--
Best regards,
Lubennikova Anastasia

Attachment Content-Type Size
indexonlyscan_gist.patch application/octet-stream 49.7 KB

From: Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
To: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scans for GIST
Date: 2014-08-01 16:25:21
Message-ID: CAFcNs+oLeXuuYr6qPn7JCctgQyPzE1oYUcV_=pRqYP6w4ybTnw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Aug 1, 2014 at 4:58 AM, Anastasia Lubennikova <
lubennikovaav(at)gmail(dot)com> wrote:
>
> Hi, hackers!
> I work on a GSoC project "Index-only scans for GIST"
>
https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014
>
> Repository is
> https://github.com/lubennikovaav/postgres/tree/indexonlygist2
> Patch is in attachments.
>
> It includes index-only scans for multicolumn GIST and new regression test.
> Fetch() method is realized for box and point opclasses.
>
> Documentation is not updated yet, but I'm going to do it till the end of
GSoC.
>
> I've got one question about query with OR condition. It is the last query
in regression test "gist_indexonly". It doesn't fail but it doensn't use
index-only scans. Could someone explain to me how it works?
> It seems to depend on build_paths_for_OR function. But I couldn't
understand how.
>

Very nice... please add your patch to the next commit fest [1].

Regards,

[1] https://commitfest.postgresql.org/action/commitfest_view?id=23

--
Fabrízio de Royes Mello
Consultoria/Coaching PostgreSQL
>> Timbira: http://www.timbira.com.br
>> Blog sobre TI: http://fabriziomello.blogspot.com
>> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
>> Twitter: http://twitter.com/fabriziomello


From: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>
To: Fabrízio Mello <fabriziomello(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scans for GIST
Date: 2014-08-01 16:46:44
Message-ID: CAP4vRV7xpFq_YQnqXOgj=K5aJ5WY7=wczaXSWZ92zFMGUXy0aw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thank you for comment
Patch is already added in Performance topic.

2014-08-01 20:25 GMT+04:00 Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>
:

>
> On Fri, Aug 1, 2014 at 4:58 AM, Anastasia Lubennikova <
> lubennikovaav(at)gmail(dot)com> wrote:
> >
> > Hi, hackers!
> > I work on a GSoC project "Index-only scans for GIST"
> >
> https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014
> >
> > Repository is
> > https://github.com/lubennikovaav/postgres/tree/indexonlygist2
> > Patch is in attachments.
> >
> > It includes index-only scans for multicolumn GIST and new regression
> test.
> > Fetch() method is realized for box and point opclasses.
> >
> > Documentation is not updated yet, but I'm going to do it till the end of
> GSoC.
> >
> > I've got one question about query with OR condition. It is the last
> query in regression test "gist_indexonly". It doesn't fail but it doensn't
> use index-only scans. Could someone explain to me how it works?
> > It seems to depend on build_paths_for_OR function. But I couldn't
> understand how.
> >
>
> Very nice... please add your patch to the next commit fest [1].
>
> Regards,
>
>
> [1] https://commitfest.postgresql.org/action/commitfest_view?id=23
>
> --
> Fabrízio de Royes Mello
> Consultoria/Coaching PostgreSQL
> >> Timbira: http://www.timbira.com.br
> >> Blog sobre TI: http://fabriziomello.blogspot.com
> >> Perfil Linkedin: http://br.linkedin.com/in/fabriziomello
> >> Twitter: http://twitter.com/fabriziomello
>

--
Best regards,
Lubennikova Anastasia


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scans for GIST
Date: 2014-08-06 20:30:38
Message-ID: 53E2906E.8080707@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/01/2014 10:58 AM, Anastasia Lubennikova wrote:
> Hi, hackers!
> I work on a GSoC project "Index-only scans for GIST"
> https://wiki.postgresql.org/wiki/Support_for_Index-only_scans_for_GIST_GSoC_2014
>
> Repository is
> https://github.com/lubennikovaav/postgres/tree/indexonlygist2
> Patch is in attachments.

Thanks!

Some comments:

* I got this compiler warning:

gistget.c:556:5: warning: ISO C90 forbids mixed declarations and code
[-Wdeclaration-after-statement]
ListCell *tmpPageData = so->curPageData;
^

* I'm getting two regression failures with this (opr_sanity and join).

* After merging with master, build fails because of duplicate OIDs.

* The regression test queries that use LIMIT are not guaranteed to
always return the same rows, hence they're not very good regression test
cases. I'd suggest using more restricting WHERE clauses, so that each
query only returns a handful of rows.

* What's the reason for turning GISTScanOpaqueData.pageData from an
array to a List?

* I think it's leaking memory, in GIST scan context. I tested this with
a variant of the regression tests:

insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
0.05*i)),
point(0.05*i, 0.05*i) FROM generate_series(0,
10000000) as i;
CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);

set enable_seqscan=off;
set enable_bitmapscan=off;

explain analyze select p from gist_tbl where p <@ box(point(0,0),
point(9999999,9999999)) and length(p::text) < 10;

while the final query runs, 'top' shows constantly increasing memory usage.

> It includes index-only scans for multicolumn GIST and new regression test.
> Fetch() method is realized for box and point opclasses.

Can we have Fetch functions for all the datatypes in btree_gist contrib
module, please? Do other contrib modules contain GiST opclasses that
could have Fetch functions?

> Documentation is not updated yet, but I'm going to do it till the end of
> GSoC.
>
> I've got one question about query with OR condition. It is the last query
> in regression test "gist_indexonly". It doesn't fail but it doensn't use
> index-only scans. Could someone explain to me how it works?
> It seems to depend on build_paths_for_OR
> <http://doxygen.postgresql.org/indxpath_8c.html#ae660d2e886355e53ed3b9ec693e4afd2>
> function.
> But I couldn't understand how.

The query is:

select * from gist_tbl
where b <@ box(point(5,5), point(6,6))
or p <@ box(point(0,0), point(100,100)) limit 10;

It cannot use an index(-only) scan for this, because a single index scan
can only return rows based on one key. In this case, you need to do two
scans, and then return the rows returned by either scan, removing
duplicates. A bitmap scan is possible, because it can remove the
duplicates, but the planner can't produce a plain index scan plan that
would do the same.

A common trick when that happens in a real-world application is to
re-write the query using UNION:

select * from gist_tbl
where b <@ box(point(5,5), point(6,6))
UNION
select * from gist_tbl
where p <@ box(point(0,0), point(100,100))
limit 10;

Although that doesn't seem to actually work:

ERROR: could not identify an equality operator for type box
LINE 1: select * from gist_tbl
^

but that's not your patch's fault, the same happens with unpatched master.

IOW, you don't need to worry about that case.

- Heikki


From: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scans for GIST
Date: 2014-08-17 16:15:38
Message-ID: CAP4vRV4DZAqSR_h_i9xurum45uPK3bAkMvR619_b0YVbLVff4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2014-08-07 0:30 GMT+04:00 Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>:

* I'm getting two regression failures with this (opr_sanity and join).
>

opr_sanity failure is corrected.
But there is remain question with join.
I check the latest version of my github repo and there's no fail in join
regression test
All 145 tests passed.
To tell the truth, I don't understand which changes could led to this
failure.
Could you show me regression diffs?
I want to understand what's wrong with the patch.

* The regression test queries that use LIMIT are not guaranteed to always
> return the same rows, hence they're not very good regression test cases.
> I'd suggest using more restricting WHERE clauses, so that each query only
> returns a handful of rows.
>

Thank you for comment, I rewrote wrong queries. But could you explain why
LIMIT queries may return different results? Is it happens because of
different query optimization?

* I think it's leaking memory, in GIST scan context. I tested this with a
> variant of the regression tests:
>
> insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
> 0.05*i)),
> point(0.05*i, 0.05*i) FROM generate_series(0,
> 10000000) as i;
> CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);
>
> set enable_seqscan=off;
> set enable_bitmapscan=off;
>
> explain analyze select p from gist_tbl where p <@ box(point(0,0),
> point(9999999,9999999)) and length(p::text) < 10;
>
> while the final query runs, 'top' shows constantly increasing memory usage.

I don't think it's memory leak. After some increasing, memory using remain
the same. It works similar without using indexonlyscan.

--
Best regards,
Lubennikova Anastasia


From: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scans for GIST
Date: 2014-08-18 06:33:48
Message-ID: CAP4vRV4Wa2LKgBzvPynWbMfwsUx7-KOJWKiEJ5T=p_mBqhwBng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Updated patch
* Compiler, merge and regression fails checked
* Regression tests was impoved
* GiST and amcanreturn docs updated
--
Best regards,
Lubennikova Anastasia

Attachment Content-Type Size
indexonlyscan_gist2.patch application/octet-stream 49.4 KB
indexonlyscan_gist_docs.patch application/octet-stream 6.3 KB

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scans for GIST
Date: 2014-08-18 08:05:25
Message-ID: 53F1B3C5.3050806@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote:
> 2014-08-07 0:30 GMT+04:00 Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>:
>
> * I'm getting two regression failures with this (opr_sanity and join).
>>
>
> opr_sanity failure is corrected.
> But there is remain question with join.
> I check the latest version of my github repo and there's no fail in join
> regression test
> All 145 tests passed.
> To tell the truth, I don't understand which changes could led to this
> failure.
> Could you show me regression diffs?

Sure, here you go. It seems like a change in a plan. At a quick glance
it seems harmless: the new plan is identical except that the left and
right side of a join have been reversed. But I don't understand either
why this patch would change that, so it needs to be investigated.

> * The regression test queries that use LIMIT are not guaranteed to always
>> return the same rows, hence they're not very good regression test cases.
>> I'd suggest using more restricting WHERE clauses, so that each query only
>> returns a handful of rows.
>
> Thank you for comment, I rewrote wrong queries. But could you explain why
> LIMIT queries may return different results? Is it happens because of
> different query optimization?

Imagine that you have a table with two rows, A and B. If you run a query
like "SELECT * FROM table LIMIT 1", the system can legally return either
row A or B, because there's no ORDER BY.

Now, admittedly you have a similar problem even without the LIMIT - the
system can legally return the rows in either order - but it's less of an
issue because at least you can quickly see from the diff that the result
set is in fact the same, the rows are just in different order. You could
fix that by adding an ORDER BY to all test queries, but we haven't done
that in the regression suite because then we would not have any test
coverage for cases where you don't have an ORDER BY. As a compromise,
test cases are usually written without an ORDER BY, but if e.g. the
buildfarm starts failing because of differences in the result set order
across platforms, then we add an ORDER BY to make it stable.

> * I think it's leaking memory, in GIST scan context. I tested this with a
>> variant of the regression tests:
>>
>> insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
>> 0.05*i)),
>> point(0.05*i, 0.05*i) FROM generate_series(0,
>> 10000000) as i;
>> CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);
>>
>> set enable_seqscan=off;
>> set enable_bitmapscan=off;
>>
>> explain analyze select p from gist_tbl where p <@ box(point(0,0),
>> point(9999999,9999999)) and length(p::text) < 10;
>>
>> while the final query runs, 'top' shows constantly increasing memory usage.
>
> I don't think it's memory leak. After some increasing, memory using remain
> the same. It works similar without using indexonlyscan.

No, it's definitely a leak caused by the patch. Test with the attached
patch, which prints out to the server log the amount of memory used by
the GiST scan memory context every 10000 rows. It clearly shows
increasing memory usage all the way to the end of the query.

It's cleaned up at the end of the query, but that's not good enough
because for a large query you might accumulate gigabytes of leaked
memory until the query has finished. If you (manually) apply the same
patch to git master, you'll see that the memory usage stays consistent
and small.

- Heikki

Attachment Content-Type Size
regression.diffs text/plain 14.8 KB
show-gistgettuple-leak-1.patch text/x-diff 541 bytes

From: Thom Brown <thom(at)linux(dot)com>
To: Anastasia Lubennikova <lubennikovaav(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Index-only scans for GIST
Date: 2014-10-27 11:33:50
Message-ID: CAA-aLv5cArNV6PtAfkqpro3prV0Y4_r1L47Yie2KMK_C=5K2Kg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18 August 2014 09:05, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote:

> On 08/17/2014 07:15 PM, Anastasia Lubennikova wrote:
>
>> 2014-08-07 0:30 GMT+04:00 Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>:
>>
>> * I'm getting two regression failures with this (opr_sanity and join).
>>
>>>
>>>
>> opr_sanity failure is corrected.
>> But there is remain question with join.
>> I check the latest version of my github repo and there's no fail in join
>> regression test
>> All 145 tests passed.
>> To tell the truth, I don't understand which changes could led to this
>> failure.
>> Could you show me regression diffs?
>>
>
> Sure, here you go. It seems like a change in a plan. At a quick glance it
> seems harmless: the new plan is identical except that the left and right
> side of a join have been reversed. But I don't understand either why this
> patch would change that, so it needs to be investigated.
>
> * The regression test queries that use LIMIT are not guaranteed to always
>>
>>> return the same rows, hence they're not very good regression test cases.
>>> I'd suggest using more restricting WHERE clauses, so that each query only
>>> returns a handful of rows.
>>>
>>
>> Thank you for comment, I rewrote wrong queries. But could you explain why
>> LIMIT queries may return different results? Is it happens because of
>> different query optimization?
>>
>
> Imagine that you have a table with two rows, A and B. If you run a query
> like "SELECT * FROM table LIMIT 1", the system can legally return either
> row A or B, because there's no ORDER BY.
>
> Now, admittedly you have a similar problem even without the LIMIT - the
> system can legally return the rows in either order - but it's less of an
> issue because at least you can quickly see from the diff that the result
> set is in fact the same, the rows are just in different order. You could
> fix that by adding an ORDER BY to all test queries, but we haven't done
> that in the regression suite because then we would not have any test
> coverage for cases where you don't have an ORDER BY. As a compromise, test
> cases are usually written without an ORDER BY, but if e.g. the buildfarm
> starts failing because of differences in the result set order across
> platforms, then we add an ORDER BY to make it stable.
>
> * I think it's leaking memory, in GIST scan context. I tested this with a
>>
>>> variant of the regression tests:
>>>
>>> insert into gist_tbl select box(point(0.05*i, 0.05*i), point(0.05*i,
>>> 0.05*i)),
>>> point(0.05*i, 0.05*i) FROM generate_series(0,
>>> 10000000) as i;
>>> CREATE INDEX gist_tbl_point_index ON gist_tbl USING gist (p);
>>>
>>> set enable_seqscan=off;
>>> set enable_bitmapscan=off;
>>>
>>> explain analyze select p from gist_tbl where p <@ box(point(0,0),
>>> point(9999999,9999999)) and length(p::text) < 10;
>>>
>>> while the final query runs, 'top' shows constantly increasing memory
>>> usage.
>>>
>>
>> I don't think it's memory leak. After some increasing, memory using remain
>> the same. It works similar without using indexonlyscan.
>>
>
> No, it's definitely a leak caused by the patch. Test with the attached
> patch, which prints out to the server log the amount of memory used by the
> GiST scan memory context every 10000 rows. It clearly shows increasing
> memory usage all the way to the end of the query.
>
> It's cleaned up at the end of the query, but that's not good enough
> because for a large query you might accumulate gigabytes of leaked memory
> until the query has finished. If you (manually) apply the same patch to git
> master, you'll see that the memory usage stays consistent and small.
>

Hi Anastasia,

Do you have time to address the issues brought up in Heikki's review? It
would be good if we could your work into PostgreSQL 9.5.

Thanks

Thom