[PATCH]Tablesample Submission

Lists: pgsql-hackers
From: Qi Huang <huangqiyx(at)outlook(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: [PATCH]Tablesample Submission
Date: 2012-08-21 01:52:27
Message-ID: BAY002-W5461E1767E36C62C8FA47B0B80@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi, hackers
I made the final version tablesample patch. It is implementing SYSTEM and BERNOULLI sample method, which is basically "feature-complete". The regression test is also included in this patch.
There is an wiki documentation on https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation. The detail about this patch and this project is all included in this documentation.
The pgbench test result below:
Without patch:
transaction
type: SELECT only

scaling
factor: 10

query
mode: simple

duration:
30 s

number of
clients: 2

number of
threads: 2

number of
transactions actually processed: 304857

tps =
10161.803463 (including connections establishing)

tps =
10162.963554 (excluding connections establishing)

number of
clients: 4

number of
threads: 2

number of
transactions actually processed: 245072

tps =
8168.796552 (including connections establishing)

tps =
8173.947876 (excluding connections establishing)

number of
clients: 8

number of
threads: 2

number of
transactions actually processed: 218426

tps =
7280.624465 (including connections establishing)

tps =
7284.863386 (excluding connections establishing)

scaling
factor: 10

number of
clients: 16

number of
threads: 2

number of
transactions actually processed: 199204

tps =
6636.427331 (including connections establishing)

tps =
6650.783233 (excluding connections establishing)

scaling
factor: 10

number of
clients: 32

number of
threads: 2

number of
transactions actually processed: 186793

tps =
6221.025810 (including connections establishing)

tps =
6238.904071 (excluding connections establishing)

With Patch:
number of clients: 2
number of threads: 2
number of transactions actually processed: 329926
tps = 10997.232742 (including connections establishing)
tps = 10998.712762 (excluding connections establishing)
number of clients: 4
number of threads: 2
number of transactions actually processed: 261993
tps = 8732.875565 (including connections establishing)
tps = 8735.013550 (excluding connections establishing)
number of clients: 8
number of threads: 2
number of transactions actually processed: 203579
tps = 6785.601601 (including connections establishing)
tps = 6788.043016 (excluding connections establishing)
number of clients: 16
number of threads: 2
number of transactions actually processed: 190773
tps = 6354.824262 (including connections establishing)
tps = 6361.348206 (excluding connections establishing)
number of clients: 32
number of threads: 2
number of transactions actually processed: 190801
tps = 6353.821626 (including connections establishing)
tps = 6380.813409 (excluding connections establishing)

Thanks and Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Attachment Content-Type Size
tablesample_context.patch application/octet-stream 99.7 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Qi Huang <huangqiyx(at)outlook(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2012-08-21 14:46:20
Message-ID: CA+TgmoZ6ufs=caRPETferpMQaYWWNk_wvf_1DBAYPHrE+f=HrQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 20, 2012 at 9:52 PM, Qi Huang <huangqiyx(at)outlook(dot)com> wrote:
> Hi, hackers
> I made the final version tablesample patch. It is implementing SYSTEM
> and BERNOULLI sample method, which is basically "feature-complete". The
> regression test is also included in this patch.
> There is an wiki documentation on
> https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation. The detail
> about this patch and this project is all included in this documentation.

Please add your patch here:

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Qi Huang <huangqiyx(at)outlook(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2012-08-21 15:08:41
Message-ID: BAY002-W106374866E56CA9F6189CA8B0B80@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Please add your patch here:
>
> https://commitfest.postgresql.org/action/commitfest_view/open
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Hi, Robert
I added it under "Miscellaneous".
https://commitfest.postgresql.org/action/patch_view?id=918

Best RegardsHuang Qi VictorComputer Science of National University of Singapore


From: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
To: Qi Huang <huangqiyx(at)outlook(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2012-09-18 09:32:50
Message-ID: CAP7QgmkMhjoUi3_RetR7nH4ZOm=B_M7BWdAkv5cf-rekYU5H0w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 21, 2012 at 8:08 AM, Qi Huang <huangqiyx(at)outlook(dot)com> wrote:
>> Please add your patch here:
>>
>> https://commitfest.postgresql.org/action/commitfest_view/open
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>
> Hi, Robert
> I added it under "Miscellaneous".
> https://commitfest.postgresql.org/action/patch_view?id=918
>
>

Patch does not apply cleanly against latest master. outfuncs.c,
allpath.c and cost.h have rejected parts. The make check failed in a
lot of cases up to 26 out of 133. I didn't look into each issue but I
suggest rebasing on the latest master and making sure the regression
test passes.

Some of the patch don't follow our coding standard. Please adjust
brace positions, for example. For the header include list and
Makefile, place a new files in alphabetical order.

The patch doesn't include any documentation. Consider add some doc
patch for such a big feature like this.

You should update kwlist.h for REPEATABLE but I'm not sure if
REPEATABLE should become a reserved keyword yet.

I don't see why you created T_TableSampleInfo. TableSampleInfo looks
fine with a simple struct rather than a Node.

If we want to disable a cursor over a sampling table, we should check
it in the parser not the planner. In the wiki page, one of the TODOs
says about cursor support, but how much difficult is it? How does the
standard say?

s/skiped/skipped/

Don't we need to reset seed on ExecReScanSampleScan? Should we add a
new executor node SampleScan? It seems everything about random
sampling is happening under heapam.

It looks like BERNOULLI allocates heap tuple array beforehand, and
copy all the tuples into it. This doesn't look acceptable when you
are dealing with a large number of rows in the table.

As wiki says, BERNOULLI relies on the statistics of the table, which
doesn't sound good to me. Of course we could say this is our
restriction and say good-bye to users who hadn't run ANALYZE first,
but it is too hard for a normal users to use it. We may need
quick-and-rough count(*) for this.

That is pretty much I have so far. I haven't read all the code nor
the standard, so I might be wrong somewhere.

Thanks,
--
Hitoshi Harada


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: Qi Huang <huangqiyx(at)outlook(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2012-10-17 14:59:04
Message-ID: 20121017145904.GC5217@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hitoshi Harada escribió:

> Patch does not apply cleanly against latest master. outfuncs.c,
> allpath.c and cost.h have rejected parts. The make check failed in a
> lot of cases up to 26 out of 133. I didn't look into each issue but I
> suggest rebasing on the latest master and making sure the regression
> test passes.

We've been waiting for a rebase for long enough, so I've marked this
patch as Returned with Feedback (for which we thank Hitoshi Harada).
Since this is said to be useful functionality, please make sure you
update the patch and resubmit to the next commitfest. Thanks.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Qi Huang <huangqiyx(at)outlook(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, "umi(dot)tanuki(at)gmail(dot)com" <umi(dot)tanuki(at)gmail(dot)com>, "alvherre(at)2ndquadrant(dot)com" <alvherre(at)2ndquadrant(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2012-11-05 03:22:12
Message-ID: BAY002-W250D8B20C2BCF2433FB644B0640@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
Best RegardsHuang Qi VictorComputer Science of National University of Singapore

From: huangqiyx(at)outlook(dot)com
To: robertmhaas(at)gmail(dot)com
CC: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] [PATCH]Tablesample Submission
Date: Tue, 21 Aug 2012 23:08:41 +0800

> Please add your patch here:
>
> https://commitfest.postgresql.org/action/commitfest_view/open
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Hi, Robert
I added it under "Miscellaneous".
https://commitfest.postgresql.org/action/patch_view?id=918

Best RegardsHuang Qi VictorComputer Science of National University of Singapore


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Qi Huang <huangqiyx(at)outlook(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "umi(dot)tanuki(at)gmail(dot)com" <umi(dot)tanuki(at)gmail(dot)com>, "alvherre(at)2ndquadrant(dot)com" <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2012-12-10 04:13:27
Message-ID: CAJKUy5j=S9S0tvMpuE5+tu_7KCjdzWExD3pY=P4voUBYi0jeyg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 4, 2012 at 10:22 PM, Qi Huang <huangqiyx(at)outlook(dot)com> wrote:
> Dear hackers
> Sorry for not replying the patch review. I didn't see the review until
> recently as my mail box is full of Postgres mails and I didn't notice the
> one for me, my mail box configuration problem.
> I am still kind of busy with my university final year project. I shall
> not have time to work on updating the patch until this semester finishes
> which is December. I will work on then.

While we are still in the middle of a commitfest, i'm curious...
should we still wait for an update of this patch?

i know, any update on this should go to the next commitfest but i
wanted to ask before i forget about it

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH]Tablesample Submission
Date: 2013-01-17 18:32:42
Message-ID: 50F843CA.6070407@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11/04/2012 07:22 PM, Qi Huang wrote:
> Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
> Best RegardsHuang Qi VictorComputer Science of National University of Singapore

Did you ever do the update of the patch?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH]Tablesample Submission
Date: 2013-01-17 19:04:47
Message-ID: CA+U5nMLTSkRomv7YHqvTS3E5iSqq+vS7MCDrr=ZqOg7Qx5k86g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 January 2013 18:32, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 11/04/2012 07:22 PM, Qi Huang wrote:
>> Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
>> Best RegardsHuang Qi VictorComputer Science of National University of Singapore
>
> Did you ever do the update of the patch?

We aren't just waiting for a rebase, we're waiting for Hitoshi's
comments to be addressed.

I would add to them by saying I am very uncomfortable with the idea of
allowing a TABLESAMPLE clause on an UPDATE or a DELETE. If you really
want that you can use a sub-select.

Plus the patch contains zero documentation.

So I can't see this going anywhere for 9.3. I've moved it to CF1 of
9.4 marked Waiting on Author

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH]Tablesample Submission
Date: 2013-01-17 20:02:06
Message-ID: 50F858BE.4010504@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> So I can't see this going anywhere for 9.3. I've moved it to CF1 of
> 9.4 marked Waiting on Author

Agreed. I wish I'd noticed that it got lost earlier.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2013-01-19 05:38:30
Message-ID: CAJKUy5hAggbur4+bwRFaC-xJr5Xzm9yXE2DNodKn-AY48A7uWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jan 17, 2013 at 2:04 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On 17 January 2013 18:32, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 11/04/2012 07:22 PM, Qi Huang wrote:
>>> Dear hackers Sorry for not replying the patch review. I didn't see the review until recently as my mail box is full of Postgres mails and I didn't notice the one for me, my mail box configuration problem. I am still kind of busy with my university final year project. I shall not have time to work on updating the patch until this semester finishes which is December. I will work on then. Thanks.
>>> Best RegardsHuang Qi VictorComputer Science of National University of Singapore
>>
>> Did you ever do the update of the patch?
>
> We aren't just waiting for a rebase, we're waiting for Hitoshi's
> comments to be addressed.
>
> I would add to them by saying I am very uncomfortable with the idea of
> allowing a TABLESAMPLE clause on an UPDATE or a DELETE. If you really
> want that you can use a sub-select.
>

also, i don't think that the REPEATABLE clause should be included in a
first revision.
and if we ever want to add more sample methods we can't just put
BERNOULLI nor SYSTEM in gram.y, a new catalog is probably needed
there.

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


From: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
To: Qi Huang <huangqiyx(at)outlook(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "umi(dot)tanuki(at)gmail(dot)com" <umi(dot)tanuki(at)gmail(dot)com>, "alvherre(at)2ndquadrant(dot)com" <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2013-05-17 20:46:05
Message-ID: CAJKUy5h9t-ZN54cX1MX32WquWA_xgQWQDXaTiu_es0oJ-zK68Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 4, 2012 at 10:22 PM, Qi Huang <huangqiyx(at)outlook(dot)com> wrote:
> Dear hackers
> Sorry for not replying the patch review. I didn't see the review until
> recently as my mail box is full of Postgres mails and I didn't notice the
> one for me, my mail box configuration problem.
> I am still kind of busy with my university final year project. I shall
> not have time to work on updating the patch until this semester finishes
> which is December. I will work on then.

Hi,

Should we expect an updated patch for next commitfest?

--
Jaime Casanova www.2ndQuadrant.com
Professional PostgreSQL: Soporte 24x7 y capacitación
Phone: +593 4 5107566 Cell: +593 987171157


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Jaime Casanova <jaime(at)2ndquadrant(dot)com>
Cc: Qi Huang <huangqiyx(at)outlook(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "umi(dot)tanuki(at)gmail(dot)com" <umi(dot)tanuki(at)gmail(dot)com>, "alvherre(at)2ndquadrant(dot)com" <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2013-05-22 08:03:21
Message-ID: CA+U5nM+tdmhoSnYWhWtBjdnKO5TZ9cnaYh3Jg8oKFuWpDHQd=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17 May 2013 21:46, Jaime Casanova <jaime(at)2ndquadrant(dot)com> wrote:
> On Sun, Nov 4, 2012 at 10:22 PM, Qi Huang <huangqiyx(at)outlook(dot)com> wrote:
>> Dear hackers
>> Sorry for not replying the patch review. I didn't see the review until
>> recently as my mail box is full of Postgres mails and I didn't notice the
>> one for me, my mail box configuration problem.
>> I am still kind of busy with my university final year project. I shall
>> not have time to work on updating the patch until this semester finishes
>> which is December. I will work on then.

> Should we expect an updated patch for next commitfest?

This was added to CF1 of 9.4. The patch is nowhere near committable
and hasn't been worked on at all since last time it was submitted.

It's important that we have an efficient implementation of TABLESAMPLE
in Postgres.

I'm going to remove it from CF, for now, but I'll also take
responsibility for this for 9.4, barring objections.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com>
Cc: Qi Huang <huangqiyx(at)outlook(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [PATCH]Tablesample Submission
Date: 2013-06-06 19:08:52
Message-ID: CA+U5nMLRtfbA-Zv0q1UvZ9+aMEyZLMAS3=4uDSSZyBA4uo8rGw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 18 September 2012 10:32, Hitoshi Harada <umi(dot)tanuki(at)gmail(dot)com> wrote:

> As wiki says, BERNOULLI relies on the statistics of the table, which
> doesn't sound good to me. Of course we could say this is our
> restriction and say good-bye to users who hadn't run ANALYZE first,
> but it is too hard for a normal users to use it. We may need
> quick-and-rough count(*) for this.

For Bernoulli sampling, SQL Standard says "Further, whether a given
row of RT is included in result of TF is independent of whether other
rows of RT are included in result of TF."

Which means BERNOULLI sampling looks essentially identical to using

WHERE random() <= ($percent/100)

So my proposed implementation route for bernoulli sampling is to
literally add an AND-ed qual that does a random() test (and
repeatability also). That looks fairly simple and it is still
accurate, because it doesn't matter whether we do the indpendent test
to include the tuple before or after any other quals. I realise that
isn't a cool and hip approach, but it works and is exactly accurate.
Which would change the patch quite a bit.

Taking the random() approach would mean we don't rely on statistics either.

Thoughts?

SYSTEM sampling uses a completely different approach and is the really
interesting part of this feature.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services