Quick Links

Re: Gsoc2012 Idea --- Social Network database schema

Lists:	pgsql-hackers

From:	HuangQi <huangqiyx(at)gmail(dot)com>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-17 15:50:57
Message-ID:	CADf_PzLziVDuiNB1NmBiYc+DRsjC3WGSXZP8z_rxkAKq1SBn6g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, all
I am a student of Computer Science and Applied Math in a university in
Singapore. I'm planning to join Google Summer Code 2012 on PostgreSQL. It's
quite an honor to join the postgresql hacker community.
I have some postgresql developing experience while doing my school
project. I'm doing Final Year Project(FYP) in this year, which is focusing
on database and will be implemented in postgresql. My prof worked with some
PhD students on social network database. They designed a database schema,
called RSN (relationship on social network). My FYP is about studying the
properties of RSN and try to implement more efficient optimisation
algorithms for database on RSN. Currently I'm having some ideas on RSN.
This schema reveals a property called acyclic property, and it preserves
some special join property. The acyclic database schema was well studied,
and there are many papers on it. I expect full working on this topic in
three months would be enough to get idea as well as implementation.
Please help me check whether this could become a topic for Postgres
community. It focuses on a particular schema of social network, and this
schema, to tell the truth, is still not quite mature yet. If this topic is
not possible, I would like to check some other topics on query processing
and optimisation, and doing the project together with my FYP in this
summer. If you have some good idea on the proper topic, please recommand to
me.
I'm quite glad if you could offer me some advices. Thanks a lot for
your help!

--
Best Regards
Huang Qi Victor

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	HuangQi <huangqiyx(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-18 01:13:41
Message-ID:	CAAZKuFYEbf_oYbcM1dxQuahV4FM4Nfw4U=M91fBPf0DQEZmOJA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 17, 2012 at 8:50 AM, HuangQi <huangqiyx(at)gmail(dot)com> wrote:
> I'm quite glad if you could offer me some advices. Thanks a lot for your
> help!

Thank you for your interest! However, I am a little confused precisely
what you are thinking about implementing. Are there particular access
methods or operators that you think are useful in this problem space,
or changes to the planner?

As long as you are soliciting for suggestions, I'll make one...

One that bites me (and my organization) all the time is the lack of
the access method skip scan (also called "loose index scan"). It's a
killer for append-mostly tables that track a much smaller number of
entities than the number of records in the table, and we have a
grotesque hack to do it right now. In the more "social" space the
problem reappears in the form of newsfeeds, so I think that work would
have good impact across a nice spectrum of users.

Another skip-related feature that would be very nice is the
SQL-standard TABLESAMPLE feature. I wonder if the notion of a
"SkipKind" could be taught to the executor that would provide cohesion
of implementation for most feature that involve skipping a lot of the
rows in a table while continuing a scan.

--
fdr

From:	HuangQi <huangqiyx(at)gmail(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-18 03:48:14
Message-ID:	CADf_PzJXbaE_5TqRSU8rymhoAZ_pM_ac+Vz7G+FznbCaPe88bQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

(Sorry, Daniel. Forgot to cc pgsql-hackers.)
Hi, Daniel
Thanks a lot for your response.
As I can see for now, in my FYP, as the acyclic schema has the property
that it has a join tree. I will check how many join trees it has and
investigate any best option for the RSN schema. If it does have, I will
modify the planner to just use this join tree if it detects a RSN database,
then no need to use System R to search through all possible join trees. The
implementation is narrow to only RSN schema. It might be too constraint for
Postgres as a generic database system.
For 'loose index join', I googled it and find the two sites useful to
understand.
http://dev.mysql.com/doc/refman/5.5/en/loose-index-scan.html
http://wiki.postgresql.org/wiki/Loose_indexscan
This is implemented in MySQL already, while Postgres only emulates the
access method. Do you have any mail thread talking about the current design
and progress?
About the second topic, so currently TABLESAMPLE is not implemented
inside Postgres? I didn't see this query before, but I googled it just now
and the query seems very weird and interesting.
http://www.fotia.co.uk/fotia/DY.18.TheTableSampleClause.aspx
Still, do you have any mail thread talking about this?
Thanks.

--
Best Regards
Huang Qi Victor

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	HuangQi <huangqiyx(at)gmail(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-18 10:38:45
Message-ID:	CAAZKuFY8ADdo8Zve96ONGrXFqCmAgJ1ikZWj7GixAJJ4d7gNqA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 17, 2012 at 8:48 PM, HuangQi <huangqiyx(at)gmail(dot)com> wrote:
> About the second topic, so currently TABLESAMPLE is not implemented
> inside Postgres? I didn't see this query before, but I googled it just now
> and the query seems very weird and
> interesting. http://www.fotia.co.uk/fotia/DY.18.TheTableSampleClause.aspx
> Still, do you have any mail thread talking about this?

I think there may be a few, but there's a nice implementation plan
discussed by Neil Conway and written into slides from a few years ago:
http://www.pgcon.org/2007/schedule/attachments/9-Introduction_to_Hacking_PostgreSQL_Neil_Conway.pdf

He also had his implementation, although at this point some of the
bitrot will be intense:

http://www.neilconway.org/talks/hacking/

I also seem to remember writing this (to some degree) as a student as
part of a class project, so a full-blown production implementation in
a summer sounds reasonable, unless someone has thought more about this
and ran into some icebergs. I'm not sure exactly what the blockers
were to this being committed back in 2007 (not to suggest there
weren't any).

I haven't thought enough about skipscan, but there a number more
unknowns there to me...

--
fdr

From:	HuangQi <huangqiyx(at)gmail(dot)com>
To:	Daniel Farina <daniel(at)heroku(dot)com>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-19 03:11:28
Message-ID:	CADf_PzKFWpb3EKfOBOL0z6ikeP6EXfZAX0fF=RsdHfCDv=OcZQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The implementation seems to be done quite fully. There is even a patch
file. Why is the implementation not added into the release of Postgres? As
so much has already being done, what could I do in this case for the Gsoc?

On Sun, Mar 18, 2012 at 6:38 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:

> On Sat, Mar 17, 2012 at 8:48 PM, HuangQi <huangqiyx(at)gmail(dot)com> wrote:
> > About the second topic, so currently TABLESAMPLE is not implemented
> > inside Postgres? I didn't see this query before, but I googled it just
> now
> > and the query seems very weird and
> > interesting.
> http://www.fotia.co.uk/fotia/DY.18.TheTableSampleClause.aspx
> > Still, do you have any mail thread talking about this?
>
> I think there may be a few, but there's a nice implementation plan
> discussed by Neil Conway and written into slides from a few years ago:
>
> http://www.pgcon.org/2007/schedule/attachments/9-Introduction_to_Hacking_PostgreSQL_Neil_Conway.pdf
>
> He also had his implementation, although at this point some of the
> bitrot will be intense:
>
> http://www.neilconway.org/talks/hacking/
>
> I also seem to remember writing this (to some degree) as a student as
> part of a class project, so a full-blown production implementation in
> a summer sounds reasonable, unless someone has thought more about this
> and ran into some icebergs. I'm not sure exactly what the blockers
> were to this being committed back in 2007 (not to suggest there
> weren't any).
>
> I haven't thought enough about skipscan, but there a number more
> unknowns there to me...
>
> --
> fdr
>

--
Best Regards
Huang Qi Victor

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-20 01:17:59
Message-ID:	4F67DAC7.9080804@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/18/12 8:11 PM, HuangQi wrote:
> The implementation seems to be done quite fully. There is even a patch
> file. Why is the implementation not added into the release of Postgres? As
> so much has already being done, what could I do in this case for the Gsoc?

That would be good for you to research. archives.postgresql.org will
help you find the discussions around that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Daniel Farina <daniel(at)heroku(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-20 01:36:54
Message-ID:	CAAZKuFYsQYZzR-Yh0PcL9bxx8J004S6gXEkRPmAOtofuEhGYEg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 19, 2012 at 6:17 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 3/18/12 8:11 PM, HuangQi wrote:
>> The implementation seems to be done quite fully. There is even a patch
>> file. Why is the implementation not added into the release of Postgres? As
>> so much has already being done, what could I do in this case for the Gsoc?
>
> That would be good for you to research. archives.postgresql.org will
> help you find the discussions around that.

I actually tried to find out, personally...not sure if I was searching
wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
useful conversations at the right time in history (~2007), even when
the search is given a broad date-horizon (all), so I, too, an
uninformed as to the specific objections.

http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d

--
fdr

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<daniel(at)heroku(dot)com>, <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-20 03:47:52
Message-ID:	BAY159-W477D2EA0481DCE3ABB9C25A3430@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Mon, Mar 19, 2012 at 6:17 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> > On 3/18/12 8:11 PM, HuangQi wrote:
> >> The implementation seems to be done quite fully. There is even a patch
> >> file. Why is the implementation not added into the release of Postgres? As
> >> so much has already being done, what could I do in this case for the Gsoc?
> >
> > That would be good for you to research. archives.postgresql.org will
> > help you find the discussions around that.
>
> I actually tried to find out, personally...not sure if I was searching
> wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
> useful conversations at the right time in history (~2007), even when
> the search is given a broad date-horizon (all), so I, too, an
> uninformed as to the specific objections.
>
> http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d
>

I sent a mail to Nail Conway asking him about this. Hope he could give a good answer. While waiting for the response, how about the skip scan? Daniel mentioned there is still some unknown.I searched this mail thread suggesting the skip scan to TODO list. http://archives.postgresql.org/pgsql-bugs/2010-03/msg00144.phpAlso this thread talking about http://archives.postgresql.org/pgsql-hackers/2010-03/msg00328.php
Not sure whether this is feasible for Gsoc....

Best RegardsHuang Qi Victor

From:	Neil Conway <neil(dot)conway(at)gmail(dot)com>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	daniel(at)heroku(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-20 21:12:45
Message-ID:	CAOW5sYbYMVf80r5rd4uh=XbKpt5iKYHgADhVsaqrzZr65Lb06w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

2012/3/19 Qi Huang <huangqiyx(at)hotmail(dot)com>:
>> I actually tried to find out, personally...not sure if I was searching
>> wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
>> useful conversations at the right time in history (~2007), even when
>> the search is given a broad date-horizon (all), so I, too, an
>> uninformed as to the specific objections.
>>
>> http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d
>
> I sent a mail to Nail Conway asking him about this. Hope he could give a
> good answer.

I never tried to get TABLESAMPLE support into the main PostgreSQL tree
-- I just developed the original code as an exercise for the purposes
of the talk. Implementing TABLESAMPLE would probably be a reasonable
GSoc project.

My memory of the details is fuzzy, but one thing to check is whether
the approach taken by my patch (randomly choose heap pages and then
return all the live tuples in a chosen page) actually meets the
standard's requirements -- obviously it is not true that each heap
page has the same number of live tuples, so you aren't getting a truly
random sample.

Neil

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<neil(dot)conway(at)gmail(dot)com>
Cc:	<daniel(at)heroku(dot)com>, <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 02:22:49
Message-ID:	BAY159-W61A45C1C6B5AC25ECE9B08A3400@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Date: Tue, 20 Mar 2012 14:12:45 -0700> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> From: neil(dot)conway(at)gmail(dot)com
> To: huangqiyx(at)hotmail(dot)com
> CC: daniel(at)heroku(dot)com; josh(at)agliodbs(dot)com; pgsql-hackers(at)postgresql(dot)org
>
> 2012/3/19 Qi Huang <huangqiyx(at)hotmail(dot)com>:
> >> I actually tried to find out, personally...not sure if I was searching
> >> wrongly, but searching for TABLESAMPLE did not yield a cornucopia of
> >> useful conversations at the right time in history (~2007), even when
> >> the search is given a broad date-horizon (all), so I, too, an
> >> uninformed as to the specific objections.
> >>
> >> http://www.postgresql.org/search/?m=1&q=TABLESAMPLE&l=&d=-1&s=d
> >
> > I sent a mail to Nail Conway asking him about this. Hope he could give a
> > good answer.
>
> I never tried to get TABLESAMPLE support into the main PostgreSQL tree
> -- I just developed the original code as an exercise for the purposes
> of the talk. Implementing TABLESAMPLE would probably be a reasonable
> GSoc project.
>
> My memory of the details is fuzzy, but one thing to check is whether
> the approach taken by my patch (randomly choose heap pages and then
> return all the live tuples in a chosen page) actually meets the
> standard's requirements -- obviously it is not true that each heap
> page has the same number of live tuples, so you aren't getting a truly
> random sample.
>
> Neil
>

Thanks so much, Neil. I think I kind of understand the situation for now. The implementation posted by Neil was for the purpose of the talk, thus rushed and may not be up to standard of Postgres Community. Also Neil mentioned the PRNG state in the patch is buggy, and maybe also some others. Thus, in the Gsoc project, I could understand the details of Neil's implementation, fix the bugs, make the code fit for the community standard, and test. Is there any comment on this?

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 14:13:57
Message-ID:	CA+TgmobknWzbU=JeCBKsrRFiaCpD7JWgsYGsJapTJKbmoBsMzw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Mar 20, 2012 at 10:22 PM, Qi Huang <huangqiyx(at)hotmail(dot)com> wrote:
> Thanks so much, Neil.
> I think I kind of understand the situation for now. The implementation
> posted by Neil was for the purpose of the talk, thus rushed and may not be
> up to st andard of Postgres Community. Also Neil mentioned the PRNG state in
> the patch is buggy, and maybe also some others. Thus, in the Gsoc project, I
> could understand the details of Neil's implementation, fix the bugs, make
> the code fit for the community standard, and test.
> Is there any comment on this?

In addition to that, you'll probably find that the old patch doesn't
apply any more, and you'll need to fix a lot of things to get it
working again. The code has changed a lot in the meantime.

One thing we should probably try to establish before you get started
working on this is whether people want the feature, which is basically
the ability to write something like this in the FROM clause of a
query:

table_name TABLESAMPLE { BERNOULLI | SYSTEM } ( sample_percent ) [
REPEATABLE ( repeat_seed ) ] ]

I have at present no position on whether we want that or not, but
maybe someone else does. The upside is that would be a more efficient
replacement for the ORDER BY random() trick that is often used today;
the downside is that it requires dedicated syntax and a whole new
executor node for something that, realistically, isn't going to come
up very often.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Qi Huang <huangqiyx(at)hotmail(dot)com>, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 14:35:54
Message-ID:	375.1332340554@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> One thing we should probably try to establish before you get started
> working on this is whether people want the feature, which is basically
> the ability to write something like this in the FROM clause of a
> query:

> table_name TABLESAMPLE { BERNOULLI | SYSTEM } ( sample_percent ) [
> REPEATABLE ( repeat_seed ) ] ]

> I have at present no position on whether we want that or not, but
> maybe someone else does. The upside is that would be a more efficient
> replacement for the ORDER BY random() trick that is often used today;
> the downside is that it requires dedicated syntax and a whole new
> executor node for something that, realistically, isn't going to come
> up very often.

Yeah --- you're talking about chunks of new code in both planner and
executor. A very rough estimate is that this might be about as
complicated to do properly as MergeAppend was (and we're still shaking
out the bugs in that :-().

Now that would all be fine if this were a widely-desired feature, but
AFAIR the user demand for it has been about nil. So I'm leaning to
the position that we don't want it.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, neil(dot)conway <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 14:47:23
Message-ID:	1332341101-sup-5656@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:

> Now that would all be fine if this were a widely-desired feature, but
> AFAIR the user demand for it has been about nil. So I'm leaning to
> the position that we don't want it.

I disagree with there being zero interest ... the "order by random()"
stuff does come up occasionally.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 14:57:15
Message-ID:	201203211557.15789.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday, March 21, 2012 03:47:23 PM Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
> > Now that would all be fine if this were a widely-desired feature, but
> > AFAIR the user demand for it has been about nil. So I'm leaning to
> > the position that we don't want it.
>
> I disagree with there being zero interest ... the "order by random()"
> stuff does come up occasionally.
Yes.

I wonder if could be hacked ontop of a plain seqscan node instead of building
a completely separate infrastructure. The standards syntax would then simply
be transformed into a select with some special ORDER BY

Andres

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 15:00:59
Message-ID:	4F69ED2B.90702@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/21/2012 10:47 AM, Alvaro Herrera wrote:
> Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
>
>> Now that would all be fine if this were a widely-desired feature, but
>> AFAIR the user demand for it has been about nil. So I'm leaning to
>> the position that we don't want it.
> I disagree with there being zero interest ... the "order by random()"
> stuff does come up occasionally.
>

Presumably the reason that's not good enough is that is scans the whole
table (as well as being non-portable)? Maybe we could find some less
invasive way of avoiding that.

cheers

andrew

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 15:26:17
Message-ID:	CA+TgmoZBi=KFP8Jjf9_=vi0wonBneQcBz9O2J9E_Ct-U0y2c1g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 21, 2012 at 10:57 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Wednesday, March 21, 2012 03:47:23 PM Alvaro Herrera wrote:
>> Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
>> > Now that would all be fine if this were a widely-desired feature, but
>> > AFAIR the user demand for it has been about nil. So I'm leaning to
>> > the position that we don't want it.
>>
>> I disagree with there being zero interest ... the "order by random()"
>> stuff does come up occasionally.
> Yes.
>
> I wonder if could be hacked ontop of a plain seqscan node instead of building
> a completely separate infrastructure. The standards syntax would then simply
> be transformed into a select with some special ORDER BY

Well, the standard syntax apparently aims to reduce the number of
returned rows, which ORDER BY does not. Maybe you could do it with
ORDER BY .. LIMIT, but the idea here I think is that we'd like to
sample the table without reading all of it first, so that seems to
miss the point.

I have to admit I'm not very impressed by the argument that we
shouldn't do this because we'll need a new executor node. Creating a
new executor node is not really that big of a deal; and in any case I
don't think Tom will like hacking another bit of functionality into
seq-scan any better, since he refactored both the Merge Append and
Index-Only Scan patches to avoid doing exactly that, and those were
more similar than this probably would be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 15:29:33
Message-ID:	1366.1332343773@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andrew Dunstan <andrew(at)dunslane(dot)net> writes:
> On 03/21/2012 10:47 AM, Alvaro Herrera wrote:
>> I disagree with there being zero interest ... the "order by random()"
>> stuff does come up occasionally.

> Presumably the reason that's not good enough is that is scans the whole
> table (as well as being non-portable)?

The reason I'm concerned about the implementation effort is precisely
that I'm afraid people will have high expectations for the intelligence
of the feature. If it's not materially better than you can get today
with "order by random()", it's not worth doing. That will mean for
example that it can't just be something we bolt onto seqscans and be
done with --- it'll need to interact with indexscans, maybe joins, etc
etc. And no shortcuts on the quality of the sampling, either.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 15:34:58
Message-ID:	1481.1332344098@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Well, the standard syntax apparently aims to reduce the number of
> returned rows, which ORDER BY does not. Maybe you could do it with
> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
> sample the table without reading all of it first, so that seems to
> miss the point.

I think actually the traditional locution is more like
WHERE random() < constant
where the constant is the fraction of the table you want. And yeah,
the presumption is that you'd like it to not actually read every row.
(Though unless the sampling density is quite a bit less than 1 row
per page, it's not clear how much you're really going to win.)

regards, tom lane

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<andrew(at)dunslane(dot)net>, <alvherre(at)commandprompt(dot)com>
Cc:	<tgl(at)sss(dot)pgh(dot)pa(dot)us>, <robertmhaas(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 15:48:51
Message-ID:	BAY159-W23CA8E42EE8CE5B1975B6CA3400@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Date: Wed, 21 Mar 2012 11:00:59 -0400
> From: andrew(at)dunslane(dot)net
> To: alvherre(at)commandprompt(dot)com
> CC: tgl(at)sss(dot)pgh(dot)pa(dot)us; robertmhaas(at)gmail(dot)com; huangqiyx(at)hotmail(dot)com; neil(dot)conway(at)gmail(dot)com; daniel(at)heroku(dot)com; josh(at)agliodbs(dot)com; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
>
>
>
> On 03/21/2012 10:47 AM, Alvaro Herrera wrote:
> > Excerpts from Tom Lane's message of mié mar 21 11:35:54 -0300 2012:
> >
> >> Now that would all be fine if this were a widely-desired feature, but
> >> AFAIR the user demand for it has been about nil. So I'm leaning to
> >> the position that we don't want it.
> > I disagree with there being zero interest ... the "order by random()"
> > stuff does come up occasionally.
> >
>
> Presumably the reason that's not good enough is that is scans the whole
> table (as well as being non-portable)? Maybe we could find some less
> invasive way of avoiding that.
>
> cheers
>
> andrew

Thanks for your discussion and ideas. As I checked, MS SQL Server and DB2 implemented tablesample for now. At least, it is useful for QUICK sample retrieval for large dataset. I suppose this clause itself will be much faster for using random().About implementation, will the code change be really very large? But the general structure should still be about the same, right?
Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 15:49:55
Message-ID:	CA+TgmoZf__+sPnjO8bFbsSR_YV8Tt7ZBZ-HQeP9bQdp2P6ogGQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 21, 2012 at 11:34 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Well, the standard syntax apparently aims to reduce the number of
>> returned rows, which ORDER BY does not. Maybe you could do it with
>> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
>> sample the table without reading all of it first, so that seems to
>> miss the point.
>
> I think actually the traditional locution is more like
> WHERE random() < constant
> where the constant is the fraction of the table you want. And yeah,
> the presumption is that you'd like it to not actually read every row.
> (Though unless the sampling density is quite a bit less than 1 row
> per page, it's not clear how much you're really going to win.)

Well, there's something mighty tempting about having a way to say
"just give me a random sample of the blocks and I'll worry about
whether that represents a random sample of the rows".

It's occurred to me a few times that it's pretty unfortunate you can't
do that with a TID condition.

rhaas=# explain select * from randomtext where ctid >= '(500,1)' and
ctid < '(501,1)';
QUERY PLAN
--------------------------------------------------------------------
Seq Scan on randomtext (cost=0.00..111764.90 rows=25000 width=31)
Filter: ((ctid >= '(500,1)'::tid) AND (ctid < '(501,1)'::tid))
(2 rows)

The last time this came up for me was when I was trying to find which
row in a large table as making the SELECT blow up; but it seems like
it could be used to implement a poor man's sampling method, too... it
would be nicer, in either case, to be able to specify the block
numbers you'd like to be able to read, rather than bounding the CTID
from both ends as in the above example.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 15:59:49
Message-ID:	4F69FAF5.8000809@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 03/21/2012 11:49 AM, Robert Haas wrote:
> On Wed, Mar 21, 2012 at 11:34 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not. Maybe you could do it with
>>> ORDER BY .. LIMIT, but the idea here I think is that we'd like to
>>> sample the table without reading all of it first, so that seems to
>>> miss the point.
>> I think actually the traditional locution is more like
>> WHERE random()< constant
>> where the constant is the fraction of the table you want. And yeah,
>> the presumption is that you'd like it to not actually read every row.
>> (Though unless the sampling density is quite a bit less than 1 row
>> per page, it's not clear how much you're really going to win.)
> Well, there's something mighty tempting about having a way to say
> "just give me a random sample of the blocks and I'll worry about
> whether that represents a random sample of the rows".
>
> It's occurred to me a few times that it's pretty unfortunate you can't
> do that with a TID condition.
>
> rhaas=# explain select * from randomtext where ctid>= '(500,1)' and
> ctid< '(501,1)';
> QUERY PLAN
> --------------------------------------------------------------------
> Seq Scan on randomtext (cost=0.00..111764.90 rows=25000 width=31)
> Filter: ((ctid>= '(500,1)'::tid) AND (ctid< '(501,1)'::tid))
> (2 rows)
>
> The last time this came up for me was when I was trying to find which
> row in a large table as making the SELECT blow up; but it seems like
> it could be used to implement a poor man's sampling method, too... it
> would be nicer, in either case, to be able to specify the block
> numbers you'd like to be able to read, rather than bounding the CTID
> from both ends as in the above example.

That would rapidly get unmanageable when you wanted lots of pages.

Maybe we could do something like a pagenum pseudovar, or a wildcard
match for ctid against '(123,*)'.

cheers

andrew

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>, daniel <daniel(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-21 16:00:40
Message-ID:	2016.1332345640@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Well, there's something mighty tempting about having a way to say
> "just give me a random sample of the blocks and I'll worry about
> whether that represents a random sample of the rows".

> It's occurred to me a few times that it's pretty unfortunate you can't
> do that with a TID condition.

> rhaas=# explain select * from randomtext where ctid >= '(500,1)' and
> ctid < '(501,1)';

Yeah, as you say that's come up more than once in data-recovery
situations. It seems like it'd be just a SMOP to extend the tidscan
stuff to handle ranges.

Another thing that people sometimes wish for is joins using TIDs.
I think the latter would actually be pretty trivial to do now given the
parameterized-plan infrastructure; I'd hoped to look into it for 9.2 but
ran out of time...

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "neil(dot)conway" <neil(dot)conway(at)gmail(dot)com>,"daniel" <daniel(at)heroku(dot)com>, "Qi Huang" <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-22 16:38:17
Message-ID:	4F6B0F2902000025000465DA@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Well, the standard syntax apparently aims to reduce the number of
>> returned rows, which ORDER BY does not. Maybe you could do it
>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>> like to sample the table without reading all of it first, so that
>> seems to miss the point.
>
> I think actually the traditional locution is more like
> WHERE random() < constant
> where the constant is the fraction of the table you want. And
> yeah, the presumption is that you'd like it to not actually read
> every row. (Though unless the sampling density is quite a bit
> less than 1 row per page, it's not clear how much you're really
> going to win.)

It's all going to depend on the use cases, which I don't think I've
heard described very well yet.

I've had to pick random rows from, for example, a table of
disbursements to support a financial audit. In those cases it has
been the sample size that mattered, and order didn't. One
interesting twist there is that for some of these financial audits
they wanted the probability of a row being selected to be
proportional to the dollar amount of the disbursement. I don't
think you can do this without a first pass across the whole data
set.

I've also been involved in developing software to pick random people
for jury selection processes. In some of these cases, you don't
want a certain percentage, you want a particular number of people,
and you want the list to be ordered so that an initial set of
potential jurors can be seated from the top of the list and then as
the voir dire[1] process progresses you can replace excused jurors
from progressive positions on the randomized list.

In both cases you need to be able to explain the technique used in
lay terms and show why it is very hard for anyone to predict or
control which rows are chosen, but also use statistical analysis to
prove that there is no significant correlation between the rows
chosen and identifiable characteristics of the rows. For example,
selecting all the rows from random blocks would be right out for
juror selection because a list from the state DOT of people with
driver's licenses and state ID cards would probably be in license
number order when loaded, and since the start of the driver's
license number is a soundex of the last name, there is a strong
correlation between blocks of the table and ethnicity.

One technique which might be suitably random without reading the
whole table would be to figure out a maximum block number and tuple
ID for the table, and generate a series of random ctid values to
read. If the tuple doesn't exist or is not visible to the snapshot,
you ignore it and continue, until you have read the requisite number
of rows. You could try to generate them in advance and sort them by
block number, but then you need to solve the problems of what to do
if that set of ctids yields too many rows or too few rows, both of
which have sticky issues.

-Kevin

[1] http://en.wikipedia.org/wiki/Voir_dire#Use_in_the_United_States

From:	Christopher Browne <cbbrowne(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-22 17:17:01
Message-ID:	CAFNqd5UaDOZFoa_bk6n0fo-LvVsWaQmWbzdYx+LaJikh6WykSw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> Well, the standard syntax apparently aims to reduce the number of
>>> returned rows, which ORDER BY does not. Maybe you could do it
>>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
>>> like to sample the table without reading all of it first, so that
>>> seems to miss the point.
>>
>> I think actually the traditional locution is more like
>> WHERE random() < constant
>> where the constant is the fraction of the table you want. And
>> yeah, the presumption is that you'd like it to not actually read
>> every row. (Though unless the sampling density is quite a bit
>> less than 1 row per page, it's not clear how much you're really
>> going to win.)
>
> It's all going to depend on the use cases, which I don't think I've
> heard described very well yet.
>
> I've had to pick random rows from, for example, a table of
> disbursements to support a financial audit. In those cases it has
> been the sample size that mattered, and order didn't. One
> interesting twist there is that for some of these financial audits
> they wanted the probability of a row being selected to be
> proportional to the dollar amount of the disbursement. I don't
> think you can do this without a first pass across the whole data
> set.

This one was commonly called "Dollar Unit Sampling," though the
terminology has gradually gotten internationalized.
http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html

What the article doesn't mention is that some particularly large items
might wind up covering multiple samples. In the example, they're
looking for a sample every $3125 down the list. If there was a single
transaction valued at $30000, that (roughly) covers 10 of the desired
samples.

It isn't possible to do this without scanning across the entire table.

If you want repeatability, you probably want to instantiate a copy of
enough information to indicate the ordering chosen. That's probably
something that needs to be captured as part of the work of the audit,
so not only does it need to involve a pass across the data, it
probably requires capturing a fair bit of data for posterity.
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <josh(at)agliodbs(dot)com>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-24 04:20:26
Message-ID:	BAY159-W491A758F09F0D6D98966A3A3470@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Date: Thu, 22 Mar 2012 13:17:01 -0400
> Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema
> From: cbbrowne(at)gmail(dot)com
> To: Kevin(dot)Grittner(at)wicourts(dot)gov
> CC: pgsql-hackers(at)postgresql(dot)org
>
> On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >>> Well, the standard syntax apparently aims to reduce the number of
> >>> returned rows, which ORDER BY does not. Maybe you could do it
> >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> >>> like to sample the table without reading all of it first, so that
> >>> seems to miss the point.
> >>
> >> I think actually the traditional locution is more like
> >> WHERE random() < constant
> >> where the constant is the fraction of the table you want. And
> >> yeah, the presumption is that you'd like it to not actually read
> >> every row. (Though unless the sampling density is quite a bit
> >> less than 1 row per page, it's not clear how much you're really
> >> going to win.)
> >
> > It's all going to depend on the use cases, which I don't think I've
> > heard described very well yet.
> >
> > I've had to pick random rows from, for example, a table of
> > disbursements to support a financial audit. In those cases it has
> > been the sample size that mattered, and order didn't. One
> > interesting twist there is that for some of these financial audits
> > they wanted the probability of a row being selected to be
> > proportional to the dollar amount of the disbursement. I don't
> > think you can do this without a first pass across the whole data
> > set.
>
> This one was commonly called "Dollar Unit Sampling," though the
> terminology has gradually gotten internationalized.
> http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
>
> What the article doesn't mention is that some particularly large items
> might wind up covering multiple samples. In the example, they're
> looking for a sample every $3125 down the list. If there was a single
> transaction valued at $30000, that (roughly) covers 10 of the desired
> samples.
>
> It isn't possible to do this without scanning across the entire table.
>
> If you want repeatability, you probably want to instantiate a copy of
> enough information to indicate the ordering chosen. That's probably
> something that needs to be captured as part of the work of the audit,
> so not only does it need to involve a pass across the data, it
> probably requires capturing a fair bit of data for posterity.
> --
> When confronted by a difficult problem, solve it by reducing it to the
> question, "How would the Lone Ranger handle this?"

The discussion till now has gone far beyond my understanding.....Could anyone explain briefly what is the idea for now? The designing detail for me is still unfamiliar. I can only take time to understand while possible after being selected and put time on it to read relevant material. For now, I'm still curious why Neil's implementation is no longer working? The Postgres has been patched a lot, but the general idea behind Neil's implementation should still work, isn't it? Besides, whether this query is needed is still not decided. Seems this is another hard to decide point. Is it that this topic is still not so prepared for the Gsoc yet? If really so, I think I still have time to switch to other topics. Any suggestion?
Thanks.

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Joshua Berkus <josh(at)agliodbs(dot)com>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil conway <neil(dot)conway(at)gmail(dot)com>, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-24 20:12:50
Message-ID:	612260444.406269.1332619970287.JavaMail.root@mail-1.01.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qi,

Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough that people want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea.

----- Original Message -----
>
>
>
>
> > Date: Thu, 22 Mar 2012 13:17:01 -0400
> > Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database
> > schema
> > From: cbbrowne(at)gmail(dot)com
> > To: Kevin(dot)Grittner(at)wicourts(dot)gov
> > CC: pgsql-hackers(at)postgresql(dot)org
> >
> > On Thu, Mar 22, 2012 at 12:38 PM, Kevin Grittner
> > <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> > > Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > >> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > >>> Well, the standard syntax apparently aims to reduce the number
> > >>> of
> > >>> returned rows, which ORDER BY does not. Maybe you could do it
> > >>> with ORDER BY .. LIMIT, but the idea here I think is that we'd
> > >>> like to sample the table without reading all of it first, so
> > >>> that
> > >>> seems to miss the point.
> > >>
> > >> I think actually the traditional locution is more like
> >! ; >> WHERE random() < constant
> > >> where the constant is the fraction of the table you want. And
> > >> yeah, the presumption is that you'd like it to not actually read
> > >> every row. (Though unless the sampling density is quite a bit
> > >> less than 1 row per page, it's not clear how much you're really
> > >> going to win.)
> > >
> > > It's all going to depend on the use cases, which I don't think
> > > I've
> > > heard described very well yet.
> > >
> > > I've had to pick random rows from, for example, a table of
> > > disbursements to support a financial audit. In those cases it has
> > > been the sample size that mattered, and order didn't. One
> > > interesting twist there is that for some of these financial
> > > audits
> > > they wanted the probability of a row being selected to be
> > > proportional ! to the dollar amount of the disbursement. I don't
> > > t hink you can do this without a first pass across the whole data
> > > set.
> >
> > This one was commonly called "Dollar Unit Sampling," though the
> > terminology has gradually gotten internationalized.
> > http://www.dummies.com/how-to/content/how-does-monetary-unit-sampling-work.html
> >
> > What the article doesn't mention is that some particularly large
> > items
> > might wind up covering multiple samples. In the example, they're
> > looking for a sample every $3125 down the list. If there was a
> > single
> > transaction valued at $30000, that (roughly) covers 10 of the
> > desired
> > samples.
> >
> > It isn't possible to do this without scanning across the entire
> > table.
> >
> > If you want repeatability, you probably want to instantiate a copy
> > of
> > enough information to indicate the ordering chosen. That's probably
> > something that needs to be captured as part of the work of the
> > audit,
> > so n! ot only does it need to involve a pass across the data, it
> > probably requires capturing a fair bit of data for posterity.
> > --
> > When confronted by a difficult problem, solve it by reducing it to
> > the
> > question, "How would the Lone Ranger handle this?"
>
>
>
>
>
>
> The discussion till now has gone far beyond my understanding.....
> Could anyone explain briefly what is the idea for now?
> The designing detail for me is still unfamiliar. I can only take time
> to understand while possible after being selected and put time on it
> to read relevant material.
> For now, I'm still curious why Neil's implementation is no longer
> working? The Postgres has been patched a lot, but the general idea
> behind Neil's implementation should still work, isn't it?
> Besides, whether this query is needed is still not decided. Seems
> this is another hard to decide point. Is it that this topic is still
> not so prepared for th e Gsoc yet? If really so, I think I still
> have time to switch to other topics. Any suggestion?
>
>
> Thanks.
>
> Best Regards and Thanks
> Huang Qi Victor
> Computer Science of National University of Singapore

From:	"Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To:	"Qi Huang" <huangqiyx(at)hotmail(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-25 10:11:16
Message-ID:	C4DAC901169B624F933534A26ED7DF310861B47C@JENMAIL01.ad.intershop.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

Here is something we'd like to have:

http://archives.postgresql.org/pgsql-hackers/2012-01/msg00650.php

As we are quite busy and this issue hasn't a high priority, we haven't followed it until now :-(

I'm only a Postgres user, not a hacker, so I don't have the knowledge to help on this nor to evaluate if this is might be a good Gssoc project.

Just an idea for the case you are looking for another topic.

best regards,

Marc Mamin

From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-owner(at)postgresql(dot)org] On Behalf Of Qi Huang
Sent: Samstag, 24. März 2012 05:20
To: cbbrowne(at)gmail(dot)com; kevin(dot)grittner(at)wicourts(dot)gov
Cc: pgsql-hackers(at)postgresql(dot)org; andres(at)anarazel(dot)de; alvherre(at)commandprompt(dot)com; neil(dot)conway(at)gmail(dot)com; daniel(at)heroku(dot)com; josh(at)agliodbs(dot)com
Subject: Re: [HACKERS] Gsoc2012 Idea --- Social Network database schema

The discussion till now has gone far beyond my understanding.....

Could anyone explain briefly what is the idea for now?

The designing detail for me is still unfamiliar. I can only take time to understand while possible after being selected and put time on it to read relevant material.

For now, I'm still curious why Neil's implementation is no longer working? The Postgres has been patched a lot, but the general idea behind Neil's implementation should still work, isn't it?

Besides, whether this query is needed is still not decided . Seems this is another hard to decide point. Is it that this topic is still not so prepared for the Gsoc yet? If really so, I think I still have time to switch to other topics. Any suggestion?

Thanks.

Best Regards and Thanks

Huang Qi Victor

Computer Science of National University of Singapore

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
Cc:	Qi Huang <huangqiyx(at)hotmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 Idea --- Social Network database schema
Date:	2012-03-25 15:32:04
Message-ID:	CA+TgmobRSD=3rR4pTWUi1yUoEB8h11YtDV2uedbLcP4sXvQ0cA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 25, 2012 at 6:11 AM, Marc Mamin <M(dot)Mamin(at)intershop(dot)de> wrote:
> Hello,
>
> Here is something we'd like to have:
>
> http://archives.postgresql.org/pgsql-hackers/2012-01/msg00650.php
>
> As we are quite busy and this issue hasn't a high priority, we haven't
> followed it until now :-(
>
> I'm only a Postgres user, not a hacker, so I don't have the knowledge to
> help on this nor to evaluate if this is might be a good Gssoc project.
>
> Just an idea for the case you are looking for another topic.

Good idea. If anyone want so pursue it, I'd strongly suggest building
it as a contrib module rather than dedicated syntax, because I'm not
sure there'd be any consensus on adding syntax for it to core.

Actually, though, I wonder how much faster it would be than CREATE
TABLE AS? Block-level copy should be faster than tuple-level copy,
but I'm not sure whether it would be a lot faster or only slightly
faster.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Joshua Berkus <josh(at)agliodbs(dot)com>
Cc:	Qi Huang <huangqiyx(at)hotmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil conway <neil(dot)conway(at)gmail(dot)com>, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Gsoc2012 idea, tablesample
Date:	2012-04-17 06:16:29
Message-ID:	4F8D0ABD.4080403@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24.03.2012 22:12, Joshua Berkus wrote:
> Qi,
>
> Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough that people want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea.

Yep. The discussion withered, so let me try to summarize:

1. We probably don't want the SQL syntax to be added to the grammar.
This should be written as an extension, using custom functions as the
API, instead of extra SQL syntax.

2. It's not very useful if it's just a dummy replacement for "WHERE
random() < ?". It has to be more advanced than that. Quality of the
sample is important, as is performance. There was also an interesting
idea of on implementing monetary unit sampling.

I think this would be a useful project if those two points are taken
care of.

Another idea that Robert Haas suggested was to add support doing a TID
scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
for GSoC project on its own, but could certainly be a part of it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<heikki(dot)linnakangas(at)enterprisedb(dot)com>, <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 11:55:36
Message-ID:	BAY159-W3319FCEE2FBB7561A1CEE9A33F0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Heikki Thanks for your advice. I will change my plan accordingly. But I have a few questions.
> 1. We probably don't want the SQL syntax to be added to the grammar.
> This should be written as an extension, using custom functions as the
> API, instead of extra SQL syntax.
>

1. "This should be written as an extension, using custom functions as the API". Could you explain a bit more what does this mean?
> 2. It's not very useful if it's just a dummy replacement for "WHERE
> random() < ?". It has to be more advanced than that. Quality of the
> sample is important, as is performance. There was also an interesting
> idea of on implementing monetary unit sampling.

2. In the plan, I mentioned using optimizer statistics to improve the quality of sampling. I may emphasize on that point. I will read about monetary unit sampling and add into the plan about possibility of implementing this idea.
> Another idea that Robert Haas suggested was to add support doing a TID
> scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
> for GSoC project on its own, but could certainly be a part of it.

3. I read about the replies on using ctid. But I don't quite understand how that might help. ctid is just a physical location of row version within the table. If I do "where ctid<'(501, 1)'", what is actually happening? Can I add in this as an optional implementation? I think I can check how to do this if I can have enough time in this project.

Best Regards and ThanksHuang Qi VictorComputer Science Department- National University of Singapore

> Date: Tue, 17 Apr 2012 09:16:29 +0300
> From: heikki(dot)linnakangas(at)enterprisedb(dot)com
> To: josh(at)agliodbs(dot)com
> CC: huangqiyx(at)hotmail(dot)com; pgsql-hackers(at)postgresql(dot)org; andres(at)anarazel(dot)de; alvherre(at)commandprompt(dot)com; neil(dot)conway(at)gmail(dot)com; daniel(at)heroku(dot)com; cbbrowne(at)gmail(dot)com; kevin(dot)grittner(at)wicourts(dot)gov
> Subject: [HACKERS] Gsoc2012 idea, tablesample
>
> On 24.03.2012 22:12, Joshua Berkus wrote:
> > Qi,
> >
> > Yeah, I can see that. That's a sign that you had a good idea for a project, actually: your idea is interesting enough that people want to debate it. Make a proposal on Monday and our potential mentors will help you refine the idea.
>
> Yep. The discussion withered, so let me try to summarize:
>
> 1. We probably don't want the SQL syntax to be added to the grammar.
> This should be written as an extension, using custom functions as the
> API, instead of extra SQL syntax.
>
> 2. It's not very useful if it's just a dummy replacement for "WHERE
> random() < ?". It has to be more advanced than that. Quality of the
> sample is important, as is performance. There was also an interesting
> idea of on implementing monetary unit sampling.
>
> I think this would be a useful project if those two points are taken
> care of.
>
> Another idea that Robert Haas suggested was to add support doing a TID
> scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
> for GSoC project on its own, but could certainly be a part of it.
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<heikki(dot)linnakangas(at)enterprisedb(dot)com>, <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 12:14:20
Message-ID:	BAY159-W366C8A7525F7CF0A88533FA33F0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Besides, I saw the Gsoc site editing has been closed. Should I just submit through this mailing list with attachment?

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 13:49:30
Message-ID:	4F8D74EA.8060504@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 17.04.2012 14:55, Qi Huang wrote:
> Hi, Heikki Thanks for your advice. I will change my plan accordingly. But I have a few questions.
>> 1. We probably don't want the SQL syntax to be added to the grammar.
>> This should be written as an extension, using custom functions as the
>> API, instead of extra SQL syntax.
>
> 1. "This should be written as an extension, using custom functions as the API". Could you explain a bit more what does this mean?

I mean, it won't be integrated into the PostgeSQL server code. Rather,
it will be a standalone module that can be distributed as a separate
.tar.gz file, and installed on a server. PostgreSQL has some facilities
to help you package code as extensions that can be easily distributed
and installed.

Yeah, that's one approach. Would be nice to hear more about that, how
exactly you can use optimizer statistics to help the sampling.

> I may emphasize on that point. I will read about monetary unit sampling and add into the plan about possibility of implementing this idea.

Ok, sounds good.

>> Another idea that Robert Haas suggested was to add support doing a TID
>> scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
>> for GSoC project on its own, but could certainly be a part of it.
>
> 3. I read about the replies on using ctid. But I don't quite understand how that might help. ctid is just a physical location of row version within the table. If I do "where ctid<'(501, 1)'", what is actually happening?

At the moment, if you do "WHERE ctid = '(501,1)', you get an access plan
with a TidScan, which quickly fetches the row from that exact physical
location. But if you do "WHERE ctid < '(501,1'), you get a SeqScan,
which scans the whole table. That's clearly wasteful, you know the
physical range of pages you need to scan: everything up to page 501. But
the SeqScan will scan pages > 501, too. The idea is to improve that so
that you'd only scan the pages up to page 501.

> Can I add in this as an optional implementation? I think I can check how to do this if I can have enough time in this project.

Yeah, that sounds reasonable.

> Besides, I saw the Gsoc site editing has been closed. Should I just submit through this mailing list with attachment?

Just post the updated details to this mailing list. Preferably inline,
not as an attachment. You don't need to post the contact details,
biography, etc, just updated inch-stones and project details parts.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Joshua Berkus <josh(at)agliodbs(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil conway <neil(dot)conway(at)gmail(dot)com>, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 13:49:49
Message-ID:	20120417134949.GR1267@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Heikki Linnakangas (heikki(dot)linnakangas(at)enterprisedb(dot)com) wrote:
> 1. We probably don't want the SQL syntax to be added to the grammar.
> This should be written as an extension, using custom functions as
> the API, instead of extra SQL syntax.

Err, I missed that, and don't particularly agree with it.. Is there a
serious issue with the grammar defined in the SQL standard? The other
DBs which provide this- do they use the SQL grammar or something else?

I'm not sure that I particularly *like* the SQL grammar, but if we're
going to implement this, we should really do it 'right'.

> 2. It's not very useful if it's just a dummy replacement for "WHERE
> random() < ?". It has to be more advanced than that. Quality of the
> sample is important, as is performance. There was also an
> interesting idea of on implementing monetary unit sampling.

In reviewing this, I got the impression (perhaps mistaken..), that
different sampling methods are defined by the SQL standard and that it
would simply be us to implement them according to what the standard
requires.

> I think this would be a useful project if those two points are taken
> care of.

Doing it 'right' certainly isn't going to be simply taking what Neil did
and updating it, and I understand Tom's concerns about having this be
more than a hack on seqscan, so I'm a bit nervous that this would turn
into something bigger than a GSoC project.

> Another idea that Robert Haas suggested was to add support doing a
> TID scan for a query like "WHERE ctid< '(501,1)'". That's not
> enough work for GSoC project on its own, but could certainly be a
> part of it.

I don't think Robert's suggestion would be part of a 'tablesample'
patch. Perhaps a completely different project which was geared towards
allowing hidden columns to be used in various ways in a WHERE clause..
Of course, we'd need someone to actually define that; I don't think
someone relatively new to the project is going to know what experienced
hackers want to do with system columns.

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Berkus <josh(at)agliodbs(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil conway <neil(dot)conway(at)gmail(dot)com>, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 14:11:27
Message-ID:	7569.1334671887@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Heikki Linnakangas (heikki(dot)linnakangas(at)enterprisedb(dot)com) wrote:
>> Another idea that Robert Haas suggested was to add support doing a
>> TID scan for a query like "WHERE ctid< '(501,1)'". That's not
>> enough work for GSoC project on its own, but could certainly be a
>> part of it.

> I don't think Robert's suggestion would be part of a 'tablesample'
> patch.

Yeah, I don't see the connection either. It seems more like this
would be a localized hack in tidpath.c and nodeTidscan.c. I think
it'd be a neat beginning project for somebody, but it's not really
related to the GSoC project as proposed.

regards, tom lane

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Joshua Berkus <josh(at)agliodbs(dot)com>, Qi Huang <huangqiyx(at)hotmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil conway <neil(dot)conway(at)gmail(dot)com>, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin grittner <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 15:16:15
Message-ID:	CAM-w4HNW6ZyNvHNPcYNDfO=v5gy6ArJ4hFS3rnMBBFNOdBhmiw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 17, 2012 at 2:49 PM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> * Heikki Linnakangas (heikki(dot)linnakangas(at)enterprisedb(dot)com) wrote:
>> 1. We probably don't want the SQL syntax to be added to the grammar.
>> This should be written as an extension, using custom functions as
>> the API, instead of extra SQL syntax.
>
> Err, I missed that, and don't particularly agree with it.. Is there a
> serious issue with the grammar defined in the SQL standard? The other
> DBs which provide this- do they use the SQL grammar or something else?
>
> I'm not sure that I particularly *like* the SQL grammar, but if we're
> going to implement this, we should really do it 'right'.

I think the danger is that fiddling with the grammar can be a ratsnest
of learning how to deal with bison grammars and learning how to
interpret the ANSI standard and bikeshedding. These are all parts of
the open source world so maybe an argument could be made they should
be part of a GSOC project but I fear they would swallow the whole
project.

But I think I agree that doing it as an external module would be
strange and not very useful. I picture it instead as a new scan type
which is basically a copy of heapscan or tidscan and uses various
algorithms to decide which tuples to return. For a first cut pof I
would leave out the grammar and just have a guc that enabled replacing
the heap scan with a sample scan everywhere.

But that would have to be done as a patch to Postgres to add the new
scan type. It wouldn't make it much easier to have a hook that
replaced one scan type with another I don't think.

--
greg

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<sfrost(at)snowman(dot)net>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	<josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 15:21:24
Message-ID:	BAY159-W2914D3B136B8733DCFD1A7A33F0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > 2. It's not very useful if it's just a dummy replacement for "WHERE
> > random() < ?". It has to be more advanced than that. Quality of the
> > sample is important, as is performance. There was also an
> > interesting idea of on implementing monetary unit sampling.
>
> In reviewing this, I got the impression (perhaps mistaken..), that
> different sampling methods are defined by the SQL standard and that it
> would simply be us to implement them according to what the standard
> requires.
>
> > I think this would be a useful project if those two points are taken
> > care of.
>
> Doing it 'right' certainly isn't going to be simply taking what Neil did
> and updating it, and I understand Tom's concerns about having this be
> more than a hack on seqscan, so I'm a bit nervous that this would turn
> into something bigger than a GSoC project.
>

As Christopher Browne mentioned, for this sampling method, it is not possible without scanning the whole data set. It improves the sampling quality but increases the sampling cost. I think it should also be using only for some special sampling types, not for general. The general sampling methods, as in the SQL standard, should have only SYSTEM and BERNOULLI methods.

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 15:27:16
Message-ID:	20120417152716.GS1267@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qi,

* Qi Huang (huangqiyx(at)hotmail(dot)com) wrote:
> > Doing it 'right' certainly isn't going to be simply taking what Neil did
> > and updating it, and I understand Tom's concerns about having this be
> > more than a hack on seqscan, so I'm a bit nervous that this would turn
> > into something bigger than a GSoC project.
>
> As Christopher Browne mentioned, for this sampling method, it is not possible without scanning the whole data set. It improves the sampling quality but increases the sampling cost. I think it should also be using only for some special sampling types, not for general. The general sampling methods, as in the SQL standard, should have only SYSTEM and BERNOULLI methods.

I'm not sure what sampling method you're referring to here. I agree
that we need to be looking at implementing the specific sampling methods
listed in the SQL standard. How much information is provided in the
standard about the requirements placed on these sampling methods? Does
the SQL standard only define SYSTEM and BERNOULLI? What do the other
databases support? What does SQL say the requirements are for 'SYSTEM'?

Thanks,

Stephen

From:	Christopher Browne <cbbrowne(at)gmail(dot)com>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 16:33:04
Message-ID:	CAFNqd5UaRgLgVCp9YnvoUD-t4sozSUo-=siW0xmkmhN4Qwv8hw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 17, 2012 at 11:27 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> Qi,
>
> * Qi Huang (huangqiyx(at)hotmail(dot)com) wrote:
>> > Doing it 'right' certainly isn't going to be simply taking what Neil did
>> > and updating it, and I understand Tom's concerns about having this be
>> > more than a hack on seqscan, so I'm a bit nervous that this would turn
>> > into something bigger than a GSoC project.
>>
>> As Christopher Browne mentioned, for this sampling method, it is not possible without scanning the whole data set. It improves the sampling quality but increases the sampling cost. I think it should also be using only for some special sampling types, not for general. The general sampling methods, as in the SQL standard, should have only SYSTEM and BERNOULLI methods.
>
> I'm not sure what sampling method you're referring to here. I agree
> that we need to be looking at implementing the specific sampling methods
> listed in the SQL standard. How much information is provided in the
> standard about the requirements placed on these sampling methods? Does
> the SQL standard only define SYSTEM and BERNOULLI? What do the other
> databases support? What does SQL say the requirements are for 'SYSTEM'?

Well, there may be cases where the quality of the sample isn't
terribly important, it just needs to be "reasonable."

I browsed an article on the SYSTEM/BERNOULLI representations; they
both amount to simple picks of tuples.

- BERNOULLI implies picking tuples with a specified probability.

- SYSTEM implies picking pages with a specified probability. (I think
we mess with this in ways that'll be fairly biased in view that tuples
mayn't be of uniform size, particularly if Slightly Smaller strings
stay in the main pages, whilst Slightly Larger strings get TOASTed...)

I get the feeling that this is a somewhat-magical feature (in that
users haven't much hope of understanding in what ways the results are
deterministic) that is sufficiently "magical" that anyone serious
about their result sets is likely to be unhappy to use either SYSTEM
or BERNOULLI.

Possibly the forms of sampling that people *actually* need, most of
the time, are more like Dollar Unit Sampling, which are pretty
deterministic, in ways that mandate that they be rather expensive
(e.g. - guaranteeing Seq Scan).
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"

From:	Greg Stark <stark(at)mit(dot)edu>
To:	Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 17:06:37
Message-ID:	CAM-w4HObqcAgM2n=cqym5LtY9oXndOzkQtZJ+11NZbYxKSLEFw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 17, 2012 at 5:33 PM, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
> I get the feeling that this is a somewhat-magical feature (in that
> users haven't much hope of understanding in what ways the results are
> deterministic) that is sufficiently "magical" that anyone serious
> about their result sets is likely to be unhappy to use either SYSTEM
> or BERNOULLI.

These both sound pretty useful. "BERNOULLI" is fine for cases where
you aren't worried about time dependency on your data. If you're
looking for the average or total value of some column for example.

SYSTEM just means "I'm willing to trade some unspecified amount of
speed for some unspecified amount of accuracy" which presumably is
only good if you trust the database designers to make a reasonable
trade-off for cases where speed matters and the accuracy requirements
aren't very strict.

> Possibly the forms of sampling that people *actually* need, most of
> the time, are more like Dollar Unit Sampling, which are pretty
> deterministic, in ways that mandate that they be rather expensive
> (e.g. - guaranteeing Seq Scan).

I don't know about that but the cases I would expect to need other
distributions would be ones where you're looking at the tuples in a
non-linear way. Things like "what's the average gap between events" or
"what's the average number of instances per value". These might
require a full table scan but might still be useful if the data is
going to be subsequently aggregated or joined in ways that would be
too expensive on the full data set.

But we shouldn't let best be the enemy of the good here. Having SYSTEM
and BERNOULLI would solve most use cases and having those would make
it easier to add more later.

--
greg

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 18:41:04
Message-ID:	4F8DB940.9080203@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Qi, Hackers:

FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
worked with geographic indexes. This would be tremendously useful for
constructing low-resolution "zoom out" tiles on maps and similar.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 20:29:52
Message-ID:	20120417202952.GU1267@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh,

* Josh Berkus (josh(at)agliodbs(dot)com) wrote:
> FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
> worked with geographic indexes. This would be tremendously useful for
> constructing low-resolution "zoom out" tiles on maps and similar.

I'm familiar with the concept of 'zoom out' tiles and PostGIS, but I
don't actually see the connection between that and TABLESAMPLE. Perhaps
I'm missing something, but I've never seen a case where you create 'zoom
out' tiles by just grabbing some portion of the data at random, as that
would end up creating empty spots or missing pieces.

My experience has been with running an algorithm on each record in the
data set to reduce the number of points on a given record (which ends up
reducing the size of the record, etc). You could also filter out
records which wouldn't be visible due to the small 'size' of the record.
Again, neither of those would benefit from a TABLESAMPLE command.
Perhaps they're thinking it's going to use a GIST index to only pull out
records matching a certain condition..? Except we have WHERE for that..

Thanks,

Stephen

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Christopher Browne <cbbrowne(at)gmail(dot)com>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-17 23:45:09
Message-ID:	CA+CSw_tfz1=VYoY9jmb=Nt5SSP9b_hOdYsfj9RGG8BtxGkOX0Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
> Well, there may be cases where the quality of the sample isn't
> terribly important, it just needs to be "reasonable."
>
> I browsed an article on the SYSTEM/BERNOULLI representations; they
> both amount to simple picks of tuples.
>
> - BERNOULLI implies picking tuples with a specified probability.
>
> - SYSTEM implies picking pages with a specified probability. (I think
> we mess with this in ways that'll be fairly biased in view that tuples
> mayn't be of uniform size, particularly if Slightly Smaller strings
> stay in the main pages, whilst Slightly Larger strings get TOASTed...)

Dealing with non uniform sizes isn't too hard. analyze.c already does
that. Given a table with B blocks it takes a uniform sample of b
blocks, and runs Vitter's reservoir sampling algorithm over the
resulting blocks to get s random tuples in a single pass. It's quite
easy to prove that this results in each tuple having an equal
probability to appear in the final table. However, it isn't fully
independent sampling - depending on the value of b compared to s and
B, there is a slightly higher probability to see multiple tuples
picked from one page. I'm too lazy to do the math, but for the analyze
case of b = s it probably isn't significant for most practical
purposes, even if B is really large. And it seems to me that when b >=
s the reservoir sampling thresholds could be tweaked at block
boundaries to even out the dependencies.

The ratio of b to s could be tweaked to get lower quality sampling
(samples are more spatially clumped) in exchange for less random I/O.

I have a gut feeling that Dollar Unit Sampling and other weighted
samples can be done with a similar approach of uniformly sampling
blocks and then running weighted reservoir sampling [1] over the
result.

[1]: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf

Cheers,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<ants(at)cybertec(dot)at>, <cbbrowne(at)gmail(dot)com>
Cc:	<sfrost(at)snowman(dot)net>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-18 03:11:57
Message-ID:	BAY159-W399366318A30988BCF81FDA33C0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Date: Wed, 18 Apr 2012 02:45:09 +0300
> Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
> From: ants(at)cybertec(dot)at
> To: cbbrowne(at)gmail(dot)com
> CC: sfrost(at)snowman(dot)net; pgsql-hackers(at)postgresql(dot)org
>
> On Tue, Apr 17, 2012 at 7:33 PM, Christopher Browne <cbbrowne(at)gmail(dot)com> wrote:
> > Well, there may be cases where the quality of the sample isn't
> > terribly important, it just needs to be "reasonable."
> >
> > I browsed an article on the SYSTEM/BERNOULLI representations; they
> > both amount to simple picks of tuples.
> >
> > - BERNOULLI implies picking tuples with a specified probability.
> >
> > - SYSTEM implies picking pages with a specified probability. (I think
> > we mess with this in ways that'll be fairly biased in view that tuples
> > mayn't be of uniform size, particularly if Slightly Smaller strings
> > stay in the main pages, whilst Slightly Larger strings get TOASTed...)
Looking at the definition of BERNOULLI method and it means to scan all the tuples, I always have a question. What is the difference of using BERNOULLI method with using "select * .... where rand() < 0.1"? They will both go through all the tuples and cost a seq-scan. If the answer to the above question is "no difference", I have one proposal for another method of BERNOULLI. For a relation, we can have all their tuples assigned an unique and continuous ID( we may use ctid or others). Then for each number in the set of IDs, we assign a random number and check whether that is smaller than the sampling percentage. If it is smaller, we retrieve the tuple corresponding to that ID. This method will not seq scan all the tuples, but it can sample by picking tuples.Thanks

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Sandro Santilli <strk(at)keybit(dot)net>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-19 07:56:26
Message-ID:	20120419075626.GC21317@gnash
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 17, 2012 at 04:29:52PM -0400, Stephen Frost wrote:
> Josh,
>
> * Josh Berkus (josh(at)agliodbs(dot)com) wrote:
> > FWIW, the PostGIS folks would *really* love to have a TABLESAMPLE which
> > worked with geographic indexes. This would be tremendously useful for
> > constructing low-resolution "zoom out" tiles on maps and similar.
>
> I'm familiar with the concept of 'zoom out' tiles and PostGIS, but I
> don't actually see the connection between that and TABLESAMPLE. Perhaps
> I'm missing something, but I've never seen a case where you create 'zoom
> out' tiles by just grabbing some portion of the data at random, as that
> would end up creating empty spots or missing pieces.

Actually a random sample would really be representative of the data
distribution. What the type analyzer gets is a sample and that sample
is what the estimator looks at to answer the question:

How many rows fall in this rectangle ?

You can see how well it works by passing your queries using && operator
to "EXPLAIN ANALYZE" and compare estimated/real.

I'm looking for a way to fetch random samples these days so I confirm
the need for a quick way to fetch the same sample that "analyze"
command fetches but at SQL level.

--strk;

,------o-.
| __/ | Delivering high quality PostGIS 2.0 !
| / 2.0 | http://strk.keybit.net - http://vizzuality.com
`-o------'

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-19 12:47:51
Message-ID:	20120419124751.GX1267@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Sandro Santilli (strk(at)keybit(dot)net) wrote:
> Actually a random sample would really be representative of the data
> distribution. What the type analyzer gets is a sample and that sample
> is what the estimator looks at to answer the question:

That might work if all you have is point data, but lines, polygons, etc,
you're typically going to want to see, just not at the same resolution..
At least, when you're talking about 'zoom-out' tiles, which is what this
was about up thread.

> I'm looking for a way to fetch random samples these days so I confirm
> the need for a quick way to fetch the same sample that "analyze"
> command fetches but at SQL level.

I'm all for supporting that and implementing this feature, I just don't
think it's going to be all that useful for zoom-out tiles when complex
geometries are involved.

Thanks,

Stephen

From:	Sandro Santilli <strk(at)keybit(dot)net>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-19 13:45:12
Message-ID:	20120419134512.GM25859@gnash
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Apr 19, 2012 at 08:47:51AM -0400, Stephen Frost wrote:
> * Sandro Santilli (strk(at)keybit(dot)net) wrote:
> > Actually a random sample would really be representative of the data
> > distribution. What the type analyzer gets is a sample and that sample
> > is what the estimator looks at to answer the question:
>
> That might work if all you have is point data, but lines, polygons, etc,
> you're typically going to want to see, just not at the same resolution..
> At least, when you're talking about 'zoom-out' tiles, which is what this
> was about up thread.
>
> > I'm looking for a way to fetch random samples these days so I confirm
> > the need for a quick way to fetch the same sample that "analyze"
> > command fetches but at SQL level.
>
> I'm all for supporting that and implementing this feature, I just don't
> think it's going to be all that useful for zoom-out tiles when complex
> geometries are involved.

Me neither. But for points it sounds very useful.
And we know it is useful for lines and polygons as well when it comes
to estimate overlaps... (since the estimator does a good job even for
lines and polygons)

I really hope Neil Conway work of 2007 could make it into PostgreSQL.

Look, the same work was a topic of an homework assignment at Berkley in
2005: http://inst.eecs.berkeley.edu/~cs186/fa05/hw/hw2/hw2.html

And the whole thing is in the SQL standard 2003

--strk;

,------o-.
| __/ | Delivering high quality PostGIS 2.0 !
| / 2.0 | http://strk.keybit.net - http://vizzuality.com
`-o------'

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<heikki(dot)linnakangas(at)enterprisedb(dot)com>, <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-21 06:28:52
Message-ID:	BAY159-W354843E3FF73CC06DD86F0A3230@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, Heikki
> 1. We probably don't want the SQL syntax to be added to the grammar.
> This should be written as an extension, using custom functions as the
> API, instead of extra SQL syntax.
>
> 2. It's not very useful if it's just a dummy replacement for "WHERE
> random() < ?". It has to be more advanced than that. Quality of the
> sample is important, as is performance. There was also an interesting
> idea of on implementing monetary unit sampling.
>
>
> Another idea that Robert Haas suggested was to add support doing a TID
> scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
> for GSoC project on its own, but could certainly be a part of it.
>

Based on the discussion these days and my understanding, I don't see much change to be made in my proposal. For the 3 points you raised, the first one and the last one are still not clear. Especially the first point, I see that some people think making the SQL syntax into grammar is first choice. For the second point, the SQL standard 2003 defines two methods for sampling, SYSTEM and BERNOULLI. I think there might be possible quality refinement for them. For the optimization statistics, I have an idea of using it to assign different sampling percentages to different pages, but I'm not sure about the detail yet, I need to see into and learn the optimization statistics (this part is mentioned by Neil in his idea, so I think there should be way of using it). Also there might be enhance on specific sampling, like monetary unit sampling or the geographic indexes sampling. I can do this part(sampling quality improvement) as research based project. We can still discuss deeper to see whether these can be done and how we can do them.
I post my current amended proposal below. The changes are in red color.
-----------------------------------------------------------------------------------------------------------------------Project Details:

Neil Conway has come up
with an implementation at 2007 while he gave a talk of introducing to hacking
in PostgreSQL. The code was just for demo purpose and was incomplete. It was
not integrated into PostgreSQL patch. The PostgreSQL now is quite different from
2007. To implement this query, I need to understand Neil’s implementation and
use the general idea to implement in the most up-to-date PostgreSQL release. In
the end, I need to test the implementation till it can be released in the next
patch. I will also explore possible ways of further enhancing the sampling quality, like using optimization statistics to produce more accurate sample. There is also suggestions that I can integrate different sampling types into this query, like for accounting data, I can use "monetary unit sampling" to get the result, or implement the geographic indexes sampling. These specific types of sampling might require at least a sequence scan on data, which means slow down the sampling speed, but sampling quality will enhance greatly and be very useful in some specific fields.

List
of features:

1. TABLESAMPLE using
select, delete, and update

2. SYSTEM method

3. REPEATABLE support

4. BERNOULLI method

5. Use optimizer
statistics to produce a more accurate sample

6. non-integer
sample percentage and repeat seed7. sampling quality enhancement

4, 5 and 6 are not
included in Neil’s implementation.

For 5, we can use
optimizer statistics to refine the algorithm for the random number selection of
pages or rows. The sample produced shall be more accurate.

Inch-stones:

1. Conduct
the basic features' implementation, able to query TABLESAMPLE clause using
select, SYSTEM, with different combination of SQL queries.

2.
Implementation of other basic features, REPEATABLE and BERNOULLI.

3. Improvement
implementation. Support for using optimizer statistics to produce more accurate
sample, non-integer sample percentage and repeat seed, and sampling quality improvement.

Project Schedule:

1. From
April 23rd-May 10th: learning and understanding.

2. From
Mid May- Mid June: implement simple TABLESAMPLE clause, with SYSTEM method, and
no REPEATABLE support. And do testing.

3. Mid
June-Mid July: implement other supports, like REPEATABLE clause, and BERNOULLI
method, and do testing. Improvement 5 and 6 are also implemented now.

4. Mid July-
Mid Aug: Explore ways of improving sampling quality should be done at period 2 and 3. This period will be used to implement those ideas. -----------------------------------------------------------------------------------------------------------------------

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Sandro Santilli <strk(at)keybit(dot)net>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-23 13:37:42
Message-ID:	20120423133742.GO26868@gnash
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Apr 21, 2012 at 02:28:52PM +0800, Qi Huang wrote:
>
> Hi, Heikki
...
> > Another idea that Robert Haas suggested was to add support doing a TID
> > scan for a query like "WHERE ctid< '(501,1)'". That's not enough work
> > for GSoC project on its own, but could certainly be a part of it.
>
> the first one and the last one are still not clear.

The last one was the TID scan on filters like ctid < '(501,1)'.
TID "scans" are the fastest access method as they directly access
explicitly referenced addresses. Starting from this observation a sampling
function may select random pages and tuples within pages and directly
access them, optimizing accesses by grouping tuples within the same
page so to fetch them all togheter.

This is what the ANALYZE command already does when providing samples
for the type analyzers.

Unfortunately it looks like at SQL level only the equality operator triggers
a TID scan, so things like "WHERE ctid < '(501,1)'" won't be as fast as
fetching all visible tuples in the first 501 pages.

I think that's what Heikki was referring about.

I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
using a tidscan. Something like: WHERE ctid =~ '(501,*)' or a ctidrange.

--strk;

,------o-.
| __/ | Delivering high quality PostGIS 2.0 !
| / 2.0 | http://strk.keybit.net - http://vizzuality.com
`-o------'

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-23 17:34:44
Message-ID:	CA+CSw_twJkJ0P45oGNXRJ4RFfknEuuR4nVd=GqXUGw4gGm-HUA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli <strk(at)keybit(dot)net> wrote:
> I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
> using a tidscan. Something like: WHERE ctid =~ '(501,*)' or a ctidrange.

Among other things, this would enable user-space implementation of
tablesample. Given the operator =~(tid, int) that matches the page
number and planner/executor integration so that it results in a TID
scan, you would need the following functions:

random_pages(tbl regclass, samples int) returns int[]
aggregate function:
reservoir_sample(item anyelement, samples int) returns anyarray

Implementations for both of the functions could be adapted from analyze.c.

Then tablesample could be implemented with the following query:
SELECT (SELECT reservoir_sample(some_table, 50) AS samples
FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
FROM random_pages('some_table', 50) AS rnd_pgtids;

Actually, now that I think about it, it could actually be implemented
without any modifications to core at some cost to efficiency.
random_pages would have to return tid[] that contains for each
generated pagenumber all possible tids on that page.

By making the building blocks available users get more flexibility.
The downside would be that we can't automatically make better sampling
methods available.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	Sandro Santilli <strk(at)keybit(dot)net>
To:	Ants Aasma <ants(at)cybertec(dot)at>
Cc:	Qi Huang <huangqiyx(at)hotmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-24 06:49:26
Message-ID:	20120424064926.GE7891@gnash
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
> On Mon, Apr 23, 2012 at 4:37 PM, Sandro Santilli <strk(at)keybit(dot)net> wrote:
> > I'd love to see enhanced CTID operators, to fetch all visible tuples in a page
> > using a tidscan. Something like: WHERE ctid =~ '(501,*)' or a ctidrange.
>
> Among other things, this would enable user-space implementation of
> tablesample. Given the operator =~(tid, int) that matches the page
> number and planner/executor integration so that it results in a TID
> scan, you would need the following functions:
>
> random_pages(tbl regclass, samples int) returns int[]
> aggregate function:
> reservoir_sample(item anyelement, samples int) returns anyarray
>
> Implementations for both of the functions could be adapted from analyze.c.
>
> Then tablesample could be implemented with the following query:
> SELECT (SELECT reservoir_sample(some_table, 50) AS samples
> FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
> FROM random_pages('some_table', 50) AS rnd_pgtids;
>
> Actually, now that I think about it, it could actually be implemented
> without any modifications to core at some cost to efficiency.
> random_pages would have to return tid[] that contains for each
> generated pagenumber all possible tids on that page.

This is exactly what I'm after.
I've actually started crafting such a TableSample function and I'm in the
process to refine the signature so your suggested interface above is
very useful, thanks !

But I don't understand the reservoir_sample call, what is it supposed to do ?
And how flexibly "anyarray" return would be ? Could you return arbitrary
typed rowtypes from it ?

> By making the building blocks available users get more flexibility.
> The downside would be that we can't automatically make better sampling
> methods available.

One approach doesn't preclude the other. TABLESAMPLE will still be useful,
also for SQL compliance.

--strk;

,------o-.
| __/ | Delivering high quality PostGIS 2.0 !
| / 2.0 | http://strk.keybit.net - http://vizzuality.com
`-o------'

From:	Sandro Santilli <strk(at)keybit(dot)net>
To:	Ants Aasma <ants(at)cybertec(dot)at>, Qi Huang <huangqiyx(at)hotmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-24 07:31:36
Message-ID:	20120424073136.GF7891@gnash
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:

> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples
> > FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
> > FROM random_pages('some_table', 50) AS rnd_pgtids;
>
> But I don't understand the reservoir_sample call, what is it supposed to do ?

Ok got it, that was probably to avoid:

ERROR: more than one row returned by a subquery used as an expression

But this also works nicely:

SELECT * FROM lots_of_points
WHERE ctid = ANY ( ARRAY[(SELECT random_tids('lots_of_points', 100000))] );

and still uses tidscan.

The advanced TID operator would be for random_tids to only return pages rather
than full tids...

--strk;

,------o-.
| __/ | Delivering high quality PostGIS 2.0 !
| / 2.0 | http://strk.keybit.net - http://vizzuality.com
`-o------'

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Ants Aasma <ants(at)cybertec(dot)at>, Qi Huang <huangqiyx(at)hotmail(dot)com>, heikki(dot)linnakangas(at)enterprisedb(dot)com, josh(at)agliodbs(dot)com, pgsql-hackers(at)postgresql(dot)org, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, cbbrowne(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-04-24 09:00:55
Message-ID:	CA+CSw_up_aWm1LOp4A4RWqPPb56hUuWS8+OMSu3xU78KwXChhQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 24, 2012 at 10:31 AM, Sandro Santilli <strk(at)keybit(dot)net> wrote:
> On Tue, Apr 24, 2012 at 08:49:26AM +0200, Sandro Santilli wrote:
>> On Mon, Apr 23, 2012 at 08:34:44PM +0300, Ants Aasma wrote:
>
>> > SELECT (SELECT reservoir_sample(some_table, 50) AS samples
>> > FROM some_table WHERE ctid =~ ANY (rnd_pgtids))
>> > FROM random_pages('some_table', 50) AS rnd_pgtids;
>>
>> But I don't understand the reservoir_sample call, what is it supposed to do ?
>
> Ok got it, that was probably to avoid:
>
> ERROR: more than one row returned by a subquery used as an expression

No, it's to avoid bias towards tuples on more sparsely populated
pages. See http://en.wikipedia.org/wiki/Reservoir_sampling or
http://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/commands/analyze.c;h=ff271644e0f93ee99bfe9c1f536f3dd48455d8d2;hb=HEAD#l1027

> The advanced TID operator would be for random_tids to only return pages rather
> than full tids...

Exactly. But when mainly IO bound (ie. sampling from a large table on
spinning rust) the overhead of probing with TID scan as opposed to
sequentially scanning the pages should be small enough. When CPU bound
I suspect that the function call machinery overhead for
reservoir_sample is going to become a large issue, so a built in
tablesample also has an edge there.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>
Cc:	<sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 08:43:10
Message-ID:	BAY159-W2300077743BFF39B90434BA3160@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi, All Thanks for your ideas on the implementation of TABLESAMPLE. I have a summary below of the high level requirements from the -hacker thread till now. Please give further comment and if I missed any point, please fell free to add.
1. Build a new type of node, as I should not use SEQSCAN node, which will be scanning the whole table. One of the aims of TABLESAMPLE is to avoid scanning every row and thus save time.
2. use TIDSCAN to directly access tuples. The below way of using ctid proposed by Kevin looks good.
-One technique which might be suitably random without reading the-whole table would be to figure out a maximum block number and tuple-ID for the table, and generate a series of random ctid values to-read. If the tuple doesn't exist or is not visible to the snapshot,-you ignore it and continue, until you have read the requisite number-of rows. You could try to generate them in advance and sort them by-block number, but then you need to solve the problems of what to do-if that set of ctids yields too many rows or too few rows, both of-which have sticky issues.--Kevin I think this technique could be considered as an implementation algo for BERNOULLI method. It looks that it could still reduce a lot of cost compared to just assign random number to every tuple and then retrieve.
3. Adding specific sampling, like dollar unit sampling, or geographic index support, which meets specific requirement, but costs more. -I have a gut feeling that Dollar Unit Sampling and other weighted-samples can be done with a similar approach of uniformly sampling-blocks and then running weighted reservoir sampling [1] over the-result.-[1]: http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf-Cheers,--Ants Aasma
Aasma proposed the above method for doing these problems, of using the reservoir weighted random sampling.
4. Implement the way ANALYZE is doing sampling. -I'm looking for a way to fetch random samples these days so I confirm-the need for a quick way to fetch the same sample that "analyze"-command fetches but at SQL level.---strk; Proposed by Sandro Santilli, we can look at the way ANALYZE is doing the sampling and try to implement it in TABLESAMPLE.

The above are the four points I summarized from the -hackers discussion. I put a brief introduction the this gSoc project on Postgres Wiki. You are welcomed to check and modify to enhance it.
http://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation

I will add the ideas formalized in this mailing thread to the wiki page once our discussion has been completed.

Thanks.

Best RegardsHuang Qi VictorComputer Science of National University of Singapore

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	<ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <cbbrowne(at)gmail(dot)com>, <kevin(dot)grittner(at)wicourts(dot)gov>, <sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 11:43:01
Message-ID:	798891F2-2E56-41CB-89A3-5BF5F332F77C@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On May10, 2012, at 10:43 , Qi Huang wrote:
> 2. use TIDSCAN to directly access tuples. The below way of using ctid proposed by Kevin looks good.
>
> -One technique which might be suitably random without reading the
> -whole table would be to figure out a maximum block number and tuple
> -ID for the table, and generate a series of random ctid values to
> -read. If the tuple doesn't exist or is not visible to the snapshot,
> -you ignore it and continue, until you have read the requisite number
> -of rows. You could try to generate them in advance and sort them by
> -block number, but then you need to solve the problems of what to do
> -if that set of ctids yields too many rows or too few rows, both of
> -which have sticky issues.

> I think this technique could be considered as an implementation algo for BERNOULLI method. It looks that it could still reduce a lot of cost compared to just assign random number to every tuple and then retrieve.

One problem I see with this approach is that its efficiency depends on the average tuple length, at least with a naive approach to random ctid generator. The simplest way to generate those randomly without introducing bias is to generate a random page index between 0 and the relation's size in pages, and then generate random tuple index between 0 and MaxHeapTuplesPerPage, which is 291 on x86-64 assuming the standard page size of 8k.

The current toasting threshold (TOAST_TUPLE_THRESHOLD) is approximately 2k, so having tables with an average heap tuple size of a few hundred bytes doesn't seem unlikely. Now, assume the average tuple length is 128 bytes, i.e. on average you'll have ~ 8k/128 = 64 live tuples / page if the fill factor is 100% and all tuples are live. To account for lower fill factors and dead tuples, let's thus say there are 50 live tuples / page. Then, on average, only every 6th randomly generated ctid will point to a live tuple. But whether or not it does can only be decided after reading the page from disk, so you end up with a rate of 6 random-access reads per returned tuple.

IIRC, the cutoff point where an index scan loses compared to a sequential scan is somewhere around 10% of the table read, i.e. if a predicate selects more than 10% of the available rows, a sequential scan is more efficient than an index scan. Scaling that with the 1/6-th success rate from above means that Kevin's approach would only beat a sequential scan if the sampling percentage isn't much larger than 1%, assuming an average row size of 128 bytes.

The algorithm still seems like a good choice for very small sampling percentages, though.

best regards,
Florian Pflug

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Qi Huang" <huangqiyx(at)hotmail(dot)com>,"Florian Pflug" <fgp(at)phlo(dot)org>
Cc:	<josh(at)agliodbs(dot)com>,<andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 14:28:17
Message-ID:	4FAB8A310200002500047ABA@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Florian Pflug <fgp(at)phlo(dot)org> wrote:

> One problem I see with this approach is that its efficiency
> depends on the average tuple length, at least with a naive
> approach to random ctid generator. The simplest way to generate
> those randomly without introducing bias is to generate a random
> page index between 0 and the relation's size in pages,

Right.

> and then generate random tuple index between 0 and
> MaxHeapTuplesPerPage, which is 291 on x86-64 assuming the standard
> page size of 8k.

I think we can do better than that without moving too far from "the
simplest way". It seems like we should be able to get a more
accurate calculation of a minimum base tuple size based on the table
definition, and calculate the maximum number of those which could
fit on a page. On the other hand, ctid uses a line pointer, doesn't
it? Do we need to worry about dead line pointers allowing higher
tuple indexes than the calculated maximum number of tuples per page?

> The current toasting threshold (TOAST_TUPLE_THRESHOLD) is
> approximately 2k, so having tables with an average heap tuple size
> of a few hundred bytes doesn't seem unlikely. Now, assume the
> average tuple length is 128 bytes, i.e. on average you'll have ~
> 8k/128 = 64 live tuples / page if the fill factor is 100% and all
> tuples are live. To account for lower fill factors and dead
> tuples, let's thus say there are 50 live tuples / page. Then, on
> average, only every 6th randomly generated ctid will point to a
> live tuple. But whether or not it does can only be decided after
> reading the page from disk, so you end up with a rate of 6
> random-access reads per returned tuple.
>
> IIRC, the cutoff point where an index scan loses compared to a
> sequential scan is somewhere around 10% of the table read, i.e. if
> a predicate selects more than 10% of the available rows, a
> sequential scan is more efficient than an index scan.

That ratio *might* be better for a ctid scan, since you don't have
the index access in the mix.

> Scaling that with the 1/6-th success rate from above means that
> Kevin's approach would only beat a sequential scan if the sampling
> percentage isn't much larger than 1%, assuming an average row size
> of 128 bytes.
>
> The algorithm still seems like a good choice for very small
> sampling percentages, though.

Yeah, even with a maximum tuple count calculated by page, there are
certainly going to be cases where another approach will be faster,
especially where the sample is a relatively high percentage of the
table. It would be good to have multiple plans compete on costs, if
possible. It would not surprise me at all if the typical break-even
point between the two techniques was somewhere on the order of a 1%
sample size, but one would hope we could get there on the basis of
estimated costs rather than using arbitrary rules or forcing the
user to guess and code a choice explicitly.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Qi Huang <huangqiyx(at)hotmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, josh(at)agliodbs(dot)com, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, ants(at)cybertec(dot)at, heikki(dot)linnakangas(at)enterprisedb(dot)com, cbbrowne(at)gmail(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 15:33:26
Message-ID:	CA+TgmoY1VLK=xJk7Pr++6Saw=OHpx-+ooUpEum6zjoPtzJDi2w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 10, 2012 at 10:28 AM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> One problem I see with this approach is that its efficiency
>> depends on the average tuple length, at least with a naive
>> approach to random ctid generator. The simplest way to generate
>> those randomly without introducing bias is to generate a random
>> page index between 0 and the relation's size in pages,
>
> Right.
>
>> and then generate random tuple index between 0 and
>> MaxHeapTuplesPerPage, which is 291 on x86-64 assuming the standard
>> page size of 8k.
>
> I think we can do better than that without moving too far from "the
> simplest way". It seems like we should be able to get a more
> accurate calculation of a minimum base tuple size based on the table
> definition, and calculate the maximum number of those which could
> fit on a page. On the other hand, ctid uses a line pointer, doesn't
> it? Do we need to worry about dead line pointers allowing higher
> tuple indexes than the calculated maximum number of tuples per page?

I wonder if you could do this with something akin to the Bitmap Heap
Scan machinery. Populate a TID bitmap with a bunch of randomly chosen
TIDs, fetch them all in physical order, and if you don't get as many
rows as you need, rinse and repeat until you do.

I'm worried this project is getting so complicated that it will be
beyond the ability of a new hacker to get anything useful done. Can
we simplify the requirements here to something that is reasonable for
a beginner?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	<josh(at)agliodbs(dot)com>,<andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, "Qi Huang" <huangqiyx(at)hotmail(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 16:36:45
Message-ID:	4FABA84D0200002500047AE5@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I wonder if you could do this with something akin to the Bitmap
> Heap Scan machinery. Populate a TID bitmap with a bunch of
> randomly chosen TIDs, fetch them all in physical order

It would be pretty hard for any other plan to beat that by very
much, so it seems like a good approach which helps keep things
simple.

> and if you don't get as many rows as you need, rinse and repeat
> until you do.

Ay, there's the rub. If you get too many, it is important that you
read all the way to the end and then randomly omit some of them.
While a bit of a bother, that's pretty straightforward and should be
pretty fast, assuming you're not, like, an order of magnitude high.
But falling short is tougher; making up the difference could be an
iterative process, which could always wind up with having you read
all tuples in the table without filling your sample. Still, this
approach seems like it would perform better than generating random
ctid values and randomly fetching until you've tried them all.

> I'm worried this project is getting so complicated that it will be
> beyond the ability of a new hacker to get anything useful done.
> Can we simplify the requirements here to something that is
> reasonable for a beginner?

I would be inclined to omit monetary unit sampling from the first
commit. Do the parts specified in the standard first and get it
committed. Useful as unit sampling is, it seems like the hardest to
do, and should probably be done "if time permits" or left as a
future enhancement. It's probably enough to just remember that it's
there and make a "best effort" attempt not to paint ourselves in a
corner which precludes its development.

-Kevin

From:	Ants Aasma <ants(at)cybertec(dot)at>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Qi Huang <huangqiyx(at)hotmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, josh(at)agliodbs(dot)com, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, cbbrowne(at)gmail(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 17:45:33
Message-ID:	CA+CSw_tT64CuakJZWnMkyYBtCmOqdrKhCZReMRdJKt+c+2PMDg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 10, 2012 at 6:33 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I'm worried this project is getting so complicated that it will be
> beyond the ability of a new hacker to get anything useful done. Can
> we simplify the requirements here to something that is reasonable for
> a beginner?

It seems to me that the simplest thing to do would be to lift the
sampling done in analyze.c (acquire_sample_rows) and use that to
implement the SYSTEM sampling method. The language in the standard
leads me to believe that Vitter's algorithm used in analyze.c is
exactly what was intended by the authors. The difference between
Vitter's algorithm and a pure Bernoulli process is precisely that
Vitter's method increases the chances to see multiple rows picked from
a single page.

One tricky issue is that tablesample is defined in terms of a
percentage of the underlying table while Vitter's algorithm needs a
fixed number of rows. The standard does state that the result needs to
contain "approximately" the stated percentage of rows. I'm not sure if
calculating the amount of rows to return from reltuples would fit that
definition of approximate. If not, would re-estimating the amount of
reltuples after sampling and taking appropriate corrective action make
it better or would an accurate number be necessary. Getting an
accurate number efficiently would require solving of the COUNT(*)
issue.

For the Bernoulli case I can't think of anything simple that would
better than just scanning the table or poking it with random TIDs.
(the latter has the same problem of estimating the desired result set
size) It seems to me that Vitter's approach could be amended to
produce independent samples by selecting slightly more pages than
result tuples and tweaking the acceptance levels to cancel out the
bias. But that definitely isn't in the territory of simple and would
require rigorous statistical analysis.

And as for the monetary unit sampling, I agree that this is better
left as an optional extra.

Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Ants Aasma" <ants(at)cybertec(dot)at>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	<josh(at)agliodbs(dot)com>,<andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, "Qi Huang" <huangqiyx(at)hotmail(dot)com>, "Florian Pflug" <fgp(at)phlo(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 18:07:06
Message-ID:	4FABBD7A0200002500047AF4@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ants Aasma <ants(at)cybertec(dot)at> wrote:

> It seems to me that the simplest thing to do would be to lift the
> sampling done in analyze.c (acquire_sample_rows) and use that to
> implement the SYSTEM sampling method.

Definitely. I thought we had all agreed on that ages ago.

-Kevin

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	Ants Aasma <ants(at)cybertec(dot)at>, josh(at)agliodbs(dot)com, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, cbbrowne(at)gmail(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, Qi Huang <huangqiyx(at)hotmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-10 18:30:35
Message-ID:	CA+TgmoY1P+WSpkZn-wi2P2Oz9S0p0KrKDqDsPfq5r0Y0F6M79w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Ants Aasma <ants(at)cybertec(dot)at> wrote:
>> It seems to me that the simplest thing to do would be to lift the
>> sampling done in analyze.c (acquire_sample_rows) and use that to
>> implement the SYSTEM sampling method.
>
> Definitely. I thought we had all agreed on that ages ago.

Right, and I don't think we should be considering any of this other
stuff until that basic thing is implemented and working.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Florian Pflug <fgp(at)phlo(dot)org>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, <josh(at)agliodbs(dot)com>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <ants(at)cybertec(dot)at>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, "Qi Huang" <huangqiyx(at)hotmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-11 00:38:43
Message-ID:	73B6F18C-C210-4BC2-BEA2-AE9FF50B71B3@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On May10, 2012, at 18:36 , Kevin Grittner wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> I wonder if you could do this with something akin to the Bitmap
>> Heap Scan machinery. Populate a TID bitmap with a bunch of
>> randomly chosen TIDs, fetch them all in physical order
>> and if you don't get as many rows as you need, rinse and repeat
>> until you do.
>
> Ay, there's the rub. If you get too many, it is important that you
> read all the way to the end and then randomly omit some of them.

Why is that? From a statistical point of view it shouldn't matter
whether you pick N random samples, or pick M >= N random samples an
then randomly pick N from M. (random implying uniformly distributed
here).

> While a bit of a bother, that's pretty straightforward and should be
> pretty fast, assuming you're not, like, an order of magnitude high.
> But falling short is tougher; making up the difference could be an
> iterative process, which could always wind up with having you read
> all tuples in the table without filling your sample.

But the likelihood of that happening is extremely low, no? Unless the
sampling percentage is very high, that is, but that case isn't of much
practical importance anyway.

But something else comes to mind. Does the standard permit samples taken
with the BERNOULLI method to contain the same tuple multiple times? If
not, any kind of TID-based approach will have to all previously fetched
TIDs, which seems doable but unfortunate...

best regards,
Florian Pflug

From:	Sandro Santilli <strk(at)keybit(dot)net>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Ants Aasma <ants(at)cybertec(dot)at>, josh(at)agliodbs(dot)com, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, cbbrowne(at)gmail(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, Qi Huang <huangqiyx(at)hotmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org, sfrost(at)snowman(dot)net
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-11 13:59:51
Message-ID:	20120511135951.GG7698@gnash
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote:
> On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> > Ants Aasma <ants(at)cybertec(dot)at> wrote:
> >> It seems to me that the simplest thing to do would be to lift the
> >> sampling done in analyze.c (acquire_sample_rows) and use that to
> >> implement the SYSTEM sampling method.
> >
> > Definitely. I thought we had all agreed on that ages ago.
>
> Right, and I don't think we should be considering any of this other
> stuff until that basic thing is implemented and working.

Agreed. That's what I'd love to see as well, for the GIS part.

--strk;

,------o-.
| __/ | Delivering high quality PostGIS 2.0 !
| / 2.0 | http://strk.keybit.net - http://vizzuality.com
`-o------'

From:	Qi Huang <huangqiyx(at)hotmail(dot)com>
To:	<strk(at)keybit(dot)net>, <robertmhaas(at)gmail(dot)com>
Cc:	<kevin(dot)grittner(at)wicourts(dot)gov>, <ants(at)cybertec(dot)at>, <josh(at)agliodbs(dot)com>, <andres(at)anarazel(dot)de>, <alvherre(at)commandprompt(dot)com>, <heikki(dot)linnakangas(at)enterprisedb(dot)com>, <cbbrowne(at)gmail(dot)com>, <neil(dot)conway(at)gmail(dot)com>, <daniel(at)heroku(dot)com>, <fgp(at)phlo(dot)org>, <pgsql-hackers(at)postgresql(dot)org>, <sfrost(at)snowman(dot)net>
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-14 02:42:19
Message-ID:	BAY159-W5429F8EC1530626002847FA31A0@phx.gbl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Date: Fri, 11 May 2012 15:59:51 +0200
> From: strk(at)keybit(dot)net
> To: robertmhaas(at)gmail(dot)com
> CC: Kevin(dot)Grittner(at)wicourts(dot)gov; ants(at)cybertec(dot)at; josh(at)agliodbs(dot)com; andres(at)anarazel(dot)de; alvherre(at)commandprompt(dot)com; heikki(dot)linnakangas(at)enterprisedb(dot)com; cbbrowne(at)gmail(dot)com; neil(dot)conway(at)gmail(dot)com; daniel(at)heroku(dot)com; huangqiyx(at)hotmail(dot)com; fgp(at)phlo(dot)org; pgsql-hackers(at)postgresql(dot)org; sfrost(at)snowman(dot)net
> Subject: Re: [HACKERS] Gsoc2012 idea, tablesample
>
> On Thu, May 10, 2012 at 02:30:35PM -0400, Robert Haas wrote:
> > On Thu, May 10, 2012 at 2:07 PM, Kevin Grittner
> > <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> > > Ants Aasma <ants(at)cybertec(dot)at> wrote:
> > >> It seems to me that the simplest thing to do would be to lift the
> > >> sampling done in analyze.c (acquire_sample_rows) and use that to
> > >> implement the SYSTEM sampling method.
> > >
> > > Definitely. I thought we had all agreed on that ages ago.
> >
> > Right, and I don't think we should be considering any of this other
> > stuff until that basic thing is implemented and working.
>
> Agreed. That's what I'd love to see as well, for the GIS part.
>
> --strk;

Thanks, guys, for your hot discussion. I'll explore the ANALYZE command first and try make SYSTEM sampling work. Other parts, I'll look at them later.

Best Regards and ThanksHuang Qi VictorComputer Science of National University of Singapore

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Qi Huang <huangqiyx(at)hotmail(dot)com>
Cc:	strk(at)keybit(dot)net, robertmhaas(at)gmail(dot)com, kevin(dot)grittner(at)wicourts(dot)gov, ants(at)cybertec(dot)at, josh(at)agliodbs(dot)com, andres(at)anarazel(dot)de, alvherre(at)commandprompt(dot)com, heikki(dot)linnakangas(at)enterprisedb(dot)com, cbbrowne(at)gmail(dot)com, neil(dot)conway(at)gmail(dot)com, daniel(at)heroku(dot)com, fgp(at)phlo(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Gsoc2012 idea, tablesample
Date:	2012-05-14 11:57:15
Message-ID:	20120514115715.GJ1267@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Qi Huang (huangqiyx(at)hotmail(dot)com) wrote:
> Thanks, guys, for your hot discussion. I'll explore the ANALYZE command first and try make SYSTEM sampling work. Other parts, I'll look at them later.

That sounds like the right first steps to me. Reviewing this
discussion, it was my thought to create a new node, ala seqscan, which
implemented analyze's algorithm for scanning the table. The eventual
idea being that analyze would actually use it in the future. There was
mention up-thread about just calling the analyze code. Personally, I'd
rather we make analyze more SQL like (once we have this functionality
implemented) than make things which are supposed to be SQL call out into
analyze bits.

Thoughts? Other opinions?

Thanks,

Stephen