Stats for multi-column indexes

Lists: pgsql-hackers
From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Stats for multi-column indexes
Date: 2007-03-20 00:56:00
Message-ID: 1174352160.23455.453.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I know the idea has come up a few times to do cross-column statistics to
improve plans when the data distributions are dependent. I found a
couple references in the archives:

http://archives.postgresql.org/pgsql-hackers/2006-09/msg02118.php
http://archives.postgresql.org/pgsql-hackers/2006-08/msg00924.php

We can already keep stats for a functional index. Is there a reason we
can't keep stats for a multi-column index?

Out of the two situations outlined by Jim Nasby here:

http://archives.postgresql.org/pgsql-hackers/2006-08/msg00948.php

It would not help with joins, but it would help with columns that are
referred to as a group (e.g. a=1 AND b<20).

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 01:24:50
Message-ID: 2249.1174353890@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> We can already keep stats for a functional index. Is there a reason we
> can't keep stats for a multi-column index?

The questions that need to be answered are (1) what stats are you gonna
collect, and (2) exactly what are you going to do with them when you
have 'em?

All the previous discussions have stalled on the question of how to
avoid trying to collect stats about an exponentially large number of
column combinations; we've never even reached the question of what
stats we'd actually want given that a particular combination has been
determined to be interesting. Perhaps that's a trivial question,
but it's been a mighty long time since I took statistics ...

regards, tom lane


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 01:55:56
Message-ID: 1174355756.23455.470.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote:
> Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > We can already keep stats for a functional index. Is there a reason we
> > can't keep stats for a multi-column index?
>
> The questions that need to be answered are (1) what stats are you gonna
> collect, and (2) exactly what are you going to do with them when you
> have 'em?
>
> All the previous discussions have stalled on the question of how to
> avoid trying to collect stats about an exponentially large number of
> column combinations; we've never even reached the question of what
> stats we'd actually want given that a particular combination has been
> determined to be interesting. Perhaps that's a trivial question,
> but it's been a mighty long time since I took statistics ...
>

I know we can't keep stats on every combination of columns. My initial
idea would be to only keep stats about a multi-column index (and
probably optional for those, too).

My thinking was that we could keep a histogram (and MCVs, etc.) of the
non-scalar key in the multi-column index. That would provide the data
the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a
and b are dependent and you have an index on (a,b).

It seemed within reach to me initially because I could use a functional
index (in which the function turns multiple values into a comparable
scalar) and postgresql would index that and keep stats. And when it has
those stats, it makes the correct plan. Of course, I have to litter the
SQL with unnecessary function calls (so that it can use the functional
index), which makes this undesirable.

AndrewSN pointed out on IRC that keeping a histogram of non-scalar
values is not as easy as I thought, because PostgreSQL doesn't allow
arrays of composite types, among other problems.

Is this a worthwhile area of exploration?

Regards,
Jeff Davis


From: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 02:14:29
Message-ID: 45FF4385.5010800@paradise.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Davis wrote:
>
> I know we can't keep stats on every combination of columns. My initial
> idea would be to only keep stats about a multi-column index (and
> probably optional for those, too).
>

Maybe you would want to keep single column indexes too, so that (more)
accurate estimates for bitmap-and type plans are possible.

Cheers

Mark


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Mark Kirkwood" <markir(at)paradise(dot)net(dot)nz>
Cc: "Jeff Davis" <pgsql(at)j-davis(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 09:03:58
Message-ID: 1174381438.4160.857.camel@silverbirch.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote:
> Jeff Davis wrote:
> >
> > I know we can't keep stats on every combination of columns. My initial
> > idea would be to only keep stats about a multi-column index (and
> > probably optional for those, too).
> >
>
> Maybe you would want to keep single column indexes too, so that (more)
> accurate estimates for bitmap-and type plans are possible.

We should allow the DBA to specify which groups of cols to keep
statistics on, if there is no index on that group.

That solves the combinatorial explosion problem.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Richard Huxton <dev(at)archonet(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 09:30:54
Message-ID: 45FFA9CE.8010305@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote:
>> Jeff Davis wrote:
>>> I know we can't keep stats on every combination of columns. My initial
>>> idea would be to only keep stats about a multi-column index (and
>>> probably optional for those, too).
>>>
>> Maybe you would want to keep single column indexes too, so that (more)
>> accurate estimates for bitmap-and type plans are possible.
>
> We should allow the DBA to specify which groups of cols to keep
> statistics on, if there is no index on that group.
>
> That solves the combinatorial explosion problem.

This is one hint I think everyone can agree on. Being able to say that
values in different columns are related just gives the planner more
information to work with.

--
Richard Huxton
Archonet Ltd


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 12:42:08
Message-ID: 20070320124208.GL24234@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Richard Huxton wrote:
> Simon Riggs wrote:
> >On Tue, 2007-03-20 at 14:14 +1200, Mark Kirkwood wrote:
> >>Jeff Davis wrote:
> >>>I know we can't keep stats on every combination of columns. My initial
> >>>idea would be to only keep stats about a multi-column index (and
> >>>probably optional for those, too).
> >>>
> >>Maybe you would want to keep single column indexes too, so that (more)
> >>accurate estimates for bitmap-and type plans are possible.
> >
> >We should allow the DBA to specify which groups of cols to keep
> >statistics on, if there is no index on that group.
> >
> >That solves the combinatorial explosion problem.
>
> This is one hint I think everyone can agree on. Being able to say that
> values in different columns are related just gives the planner more
> information to work with.

It was also suggested that column pairs in FK relationship could be
automatically enabled. So you don't need to specify those manually.

Now, the hard question is deciding what to keep track of. I don't think
MCV makes much sense, because what's the MCV of two columns? Some sort
of correlation index would seem to make certain sense.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 14:37:21
Message-ID: 200703201537.23798.peter_e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera wrote:
> Now, the hard question is deciding what to keep track of. I don't
> think MCV makes much sense, because what's the MCV of two columns?

The combination that occurs most often.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Richard Huxton <dev(at)archonet(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 15:28:52
Message-ID: 14054.1174404532@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> It was also suggested that column pairs in FK relationship could be
> automatically enabled. So you don't need to specify those manually.

Actually, I think you don't particularly need stats for that in most
cases --- if the planner simply took note that the FK relationship
exists, it would know that each row of the FK side joins to exactly
one row of the PK side, which in typical cases is sufficient.

regards, tom lane


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 17:12:14
Message-ID: 460015EE.90900@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom,

> Actually, I think you don't particularly need stats for that in most
> cases --- if the planner simply took note that the FK relationship
> exists, it would know that each row of the FK side joins to exactly
> one row of the PK side, which in typical cases is sufficient.

Is it? What about the other direction? Currently, doesn't the planner
assume that the rowcount relationship is 1 to ( child total rows /
parent total rows) ? That's ok for tables with relatively even
distribution, but not for skewed ones.

--Josh Berkus


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 17:18:54
Message-ID: 1174411134.23455.506.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-20 at 09:03 +0000, Simon Riggs wrote:
> We should allow the DBA to specify which groups of cols to keep
> statistics on, if there is no index on that group.
>
> That solves the combinatorial explosion problem.
>

I think it would be a good first step if we could just keep stats on
multiple columns in the same table. If we can do more than that, great.

We could probably keep stats on multiple columns across different
tables, but I don't know how those statistics should be used. Using
statistics to estimate joins seems like a tricky problem. Maybe it's
already solved with known algorithms?

Regards,
Jeff Davis


From: Jeff Davis <pgsql(at)j-davis(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 17:46:56
Message-ID: 1174412816.23455.521.camel@dogma.v10.wvs
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-20 at 18:12 +0100, Josh Berkus wrote:
> Tom,
>
> > Actually, I think you don't particularly need stats for that in most
> > cases --- if the planner simply took note that the FK relationship
> > exists, it would know that each row of the FK side joins to exactly
> > one row of the PK side, which in typical cases is sufficient.
>
> Is it? What about the other direction? Currently, doesn't the planner
> assume that the rowcount relationship is 1 to ( child total rows /
> parent total rows) ? That's ok for tables with relatively even
> distribution, but not for skewed ones.
>

In theory, the PK constrains the available values of the FK, but doesn't
provide any additional information about the relationship between the
columns.

However, in practice there is limited space to store MCVs and limited
accuracy to n_distinct. So there may be a reason to store more
information, but I don't know what we'd store. Do we have reports of bad
estimates by the planner in this situation?

Regards,
Jeff Davis


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 17:49:06
Message-ID: 15589.1174412946@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> Actually, I think you don't particularly need stats for that in most
>> cases --- if the planner simply took note that the FK relationship
>> exists, it would know that each row of the FK side joins to exactly
>> one row of the PK side, which in typical cases is sufficient.

> Is it? What about the other direction?

I recall that we had decided at the Greenplum meeting last year that
we could use a better heuristic if we noted that a join was being done
on an FK-and-PK combination, but I don't recall the details right at
the moment. Did anyone take notes?

regards, tom lane


From: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 19:05:50
Message-ID: 20070320190549.GH83523@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 19, 2007 at 06:55:56PM -0700, Jeff Davis wrote:
> On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote:
> > Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> > > We can already keep stats for a functional index. Is there a reason we
> > > can't keep stats for a multi-column index?
> >
> > The questions that need to be answered are (1) what stats are you gonna
> > collect, and (2) exactly what are you going to do with them when you
> > have 'em?
> >
> > All the previous discussions have stalled on the question of how to
> > avoid trying to collect stats about an exponentially large number of
> > column combinations; we've never even reached the question of what
> > stats we'd actually want given that a particular combination has been
> > determined to be interesting. Perhaps that's a trivial question,
> > but it's been a mighty long time since I took statistics ...
> >
>
> I know we can't keep stats on every combination of columns. My initial
> idea would be to only keep stats about a multi-column index (and
> probably optional for those, too).
>
> My thinking was that we could keep a histogram (and MCVs, etc.) of the
> non-scalar key in the multi-column index. That would provide the data
> the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a
> and b are dependent and you have an index on (a,b).
<snip>
> AndrewSN pointed out on IRC that keeping a histogram of non-scalar
> values is not as easy as I thought, because PostgreSQL doesn't allow
> arrays of composite types, among other problems.

I don't think the array problem is that big a deal, since PostgreSQL
doesn't enforce array dimensions at all. You can just make the arrays
for multi-column stats 2 dimensional, though handling indexes with
different data types among the columns would be a bit tricky... right
now the only choice I can think of would be to require that values could
be cast to and from text and just store text in the array. Though
obviously it'd be better to just allow arrays of composite types...

The other challenge is that you can't make all the same assumptions with
a multi-field histogram that you can with a single-field one. For
example, if this is our index:

a b
- -
1 1
1 2
...
1 1000
2 500
2 501
...
3 5000

The histogram would likely position the buckets such that 1,1000 and
2,500 would fall within one bucket, which means the planner has no idea
that b doesn't exceed 1000 when a is 1. I'm not sure how big of an issue
that is in reality, though, because the planner does know that the
bucket can only represent so many rows.

It might be worth coming up with a different means to store the
histogram for the multi-column case.

> Is this a worthwhile area of exploration?

ISTM it trips people up often enough to make it worth at least
exploring...
--
Jim Nasby jim(at)nasby(dot)net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jim(at)nasby(dot)net>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Stats for multi-column indexes
Date: 2007-03-20 19:19:15
Message-ID: 16545.1174418355@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Jim C. Nasby" <jim(at)nasby(dot)net> writes:
> It might be worth coming up with a different means to store the
> histogram for the multi-column case.

A separate array for each column involved seems a whole lot less
fragile than pretending we can handle mixed-type arrays.

We probably need a different catalog anyway, or at least a reimagining
of pg_statistic, since it doesn't hold more than one value of staattnum
per row.

regards, tom lane


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Stats for multi-column indexes
Date: 2007-03-21 09:27:19
Message-ID: 1174469239.10829.53.camel@coppola.muc.ecircle.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2007-03-20 at 18:12, Josh Berkus wrote:
> Tom,
>
> > Actually, I think you don't particularly need stats for that in most
> > cases --- if the planner simply took note that the FK relationship
> > exists, it would know that each row of the FK side joins to exactly
> > one row of the PK side, which in typical cases is sufficient.
>
> Is it? What about the other direction? Currently, doesn't the planner
> assume that the rowcount relationship is 1 to ( child total rows /
> parent total rows) ? That's ok for tables with relatively even
> distribution, but not for skewed ones.

Wouldn't that be improved if the MCVs/histogram of the FK column are
taken into account ? Considering that the FK part is unique, the
skewness in the relationship is completely determined by the FK parts
histogram. That would give at least a lower/upper bound and MCVs to the
relationship.

Cheers,
Csaba.


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Richard Huxton <dev(at)archonet(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Mark Kirkwood <markir(at)paradise(dot)net(dot)nz>, Jeff Davis <pgsql(at)j-davis(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Stats for multi-column indexes
Date: 2007-03-21 09:47:13
Message-ID: 1174470433.10829.55.camel@coppola.muc.ecircle.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This should read:

> Considering that the FK part is unique, the
^^PK^^
> skewness in the relationship is completely determined by the FK part's
> histogram. That would give at least a lower/upper bound and MCVs to the
> relationship.

Cheers,
Csaba.