Patch Review: Collect frequency statistics and selectivity estimation for arrays

Lists: pgsql-hackers
From: Nathan Boley <npboley(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Patch Review: Collect frequency statistics and selectivity estimation for arrays
Date: 2011-07-14 22:13:10
Message-ID: CAHetpQSDG85YH41URD33+C34MnbSjkJgo0EG6cUhqYFb41HF6Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Review of patch: https://commitfest.postgresql.org/action/patch_view?id=539

I glanced through the code and played around with the feature and,
in general, it looks pretty good. But I have a some comments/questions.

Design:

First, it makes me uncomfortable that you are using the MCV and histogram slot
kinds in a way that is very different from other data types.

I realize that tsvector uses MCV in the same way that you do but:

1) I don't like that very much either.
2) TS vector is different in that equality ( in the btree sense )
doesn't make sense, whereas it does for arrays.

Using the histogram slot for the array lengths is also very surprising to me.

Why not just use a new STA_KIND? It's not like we are running out of
room, and this will be the second 'container' type that splits the container
and stores stats about the elements.

Second, I'm fairly unclear about what the actual underlying statistical
method is and what assumptions it makes. Before I can really review
the method, I will probably need to see some documentation, as I say below.
Do you have any tests/simulations that explore the estimation procedure
that I can look at? When will it fail? In what cases does it improve
estimation?

Documentation:

Seems a bit lacking - especially if you keep the non standard usage of
mcv/histograms. Also, it would be really nice to see some update to
the row-estimation-examples page ( in chapter 55 ).

The code comments are good in general, but there are not enough high level
comments . It would be really nice to have a comment that gives an overview
of what each of the following functions are supposed to do:

mcelem_array_selec
mcelem_array_contain_overlap_selec

and especially

mcelem_array_contained_selec

Also, in the compute_array_stats you reference compute_tsvector_stats for
the algorithm, but I think that it would be smarter to either copy the
relevant portions of the comment over or to reference a published document.
If compute_tsvector_stats changes, I doubt that anyone will remember to
update the comment.

Finally, it would be really nice to see, explicitly, and in
mathematical notation

1) The data that is being collected and summarized. ( the statistic )
2) The estimate being derived from the statistic ( ie the selectivity )
i) Any assumptions being made ( ie independence of elements within
an array )

For instance, the only note I could find that actually addressed the
estimator and
assumptions underneath it was

+ * mult * dist[i] / mcelem_dist[i] gives us probability probability
+ * of qual matching from assumption of independent element occurence
+ * with condition that length = i.

and that's pretty cryptic. That should be expanded upon and placed
more prominently.

A couple nit picks:

1) In calc_distr you go to some lengths to avoid round off errors. Since it is
certainly just the order of the estimate that matters, why not just
perform the calculation in log space?

2) compute_array_stats is really long. Could you break it up? ( I know that
the other analyze functions are the same way, but why continue in
that vein? )

Best,
Nathan


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Nathan Boley <npboley(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays
Date: 2011-07-15 07:40:12
Message-ID: CAPpHfdu_Hn3++doYxAh1sYKUwyH6Pk0jnis6r=RkeMXE5Wb20A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi!

Thank you for review. I've few questions about it.

On Fri, Jul 15, 2011 at 2:13 AM, Nathan Boley <npboley(at)gmail(dot)com> wrote:

> First, it makes me uncomfortable that you are using the MCV and histogram
> slot
> kinds in a way that is very different from other data types.
>
> I realize that tsvector uses MCV in the same way that you do but:
>
> 1) I don't like that very much either.
> 2) TS vector is different in that equality ( in the btree sense )
> doesn't make sense, whereas it does for arrays.
>
> Using the histogram slot for the array lengths is also very surprising to
> me.
>
> Why not just use a new STA_KIND? It's not like we are running out of
> room, and this will be the second 'container' type that splits the
> container
> and stores stats about the elements.
>
Thus, do you think we should collect both btree and frequency/length
statistics for arrays?

> 1) In calc_distr you go to some lengths to avoid round off errors. Since it
> is
> certainly just the order of the estimate that matters, why not just
> perform the calculation in log space?
>
It seems to me that I didn't anything to avoid round off errors there...

------
With best regards,
Alexander Korotkov.


From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Nathan Boley <npboley(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays
Date: 2011-07-15 09:09:42
Message-ID: CAPpHfdvM621F_hfLUUHrA+Kf7MnzH15BjjrznnGM0u1TCr421Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 15, 2011 at 2:13 AM, Nathan Boley <npboley(at)gmail(dot)com> wrote:

> First, it makes me uncomfortable that you are using the MCV and histogram
> slot
> kinds in a way that is very different from other data types.
>
> I realize that tsvector uses MCV in the same way that you do but:
>
> 1) I don't like that very much either.
> 2) TS vector is different in that equality ( in the btree sense )
> doesn't make sense, whereas it does for arrays.
>
Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats
view map both into same fileds. Surely, non-standard use of histogram slot
should be avoided.

------
With best regards,
Alexander Korotkov.


From: Nathan Boley <npboley(at)gmail(dot)com>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Patch Review: Collect frequency statistics and selectivity estimation for arrays
Date: 2011-07-15 22:03:41
Message-ID: CAHetpQSG1AjmkBWf3SN547URVDhFV=EoQf=OpZOP9_E==MB11g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Actually, not MCV slot is used but MCELEM. It is a different slot. ps_stats
> view map both into same fileds.

Yes, you're right. I am sorry about the error.

> Surely, non-standard use of histogram slot
> should be avoided.

Agreed.

Best,
Nathan