Re: BUG #8598: Row count estimates of partial indexes

Lists: pgsql-bugs
From: marko(at)joh(dot)to
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-17 12:58:18
Message-ID: E1Vi1ve-0005cP-Ax@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 8598
Logged by: Marko Tiikkaja
Email address: marko(at)joh(dot)to
PostgreSQL version: 9.1.9
Operating system: Linux
Description:

Hi,

We have the following partial index on a small subset of a larger table:

"index_transactions_transaction_balance_details" btree (transactionid)
WHERE NOT processed AND accountbalancesdailyid IS NOT NULL

However, querying with the WHERE clause completely ignores the
pg_class.reltuples value for the index:

=# EXPLAIN ANALYZE SELECT * FROM transactions WHERE NOT processed AND
accountbalancesdailyid IS NOT NULL;

QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using index_transactions_transaction_balance_details on
transactions (cost=0.00..3883160.47 rows=66259403 width=130) (actual
time=0.033..18.268 rows=13962 loops=1)
Total runtime: 18.874 ms
(2 rows)

.. which makes for some silly joins when this index is part of a larger
query.

Is this expected on 9.1? Has this been fixed in more recent versions?


From: Kevin Grittner <kgrittn(at)ymail(dot)com>
To: "marko(at)joh(dot)to" <marko(at)joh(dot)to>, "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-17 15:34:10
Message-ID: 1384702450.20685.YahooMailNeo@web162902.mail.bf1.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

"marko(at)joh(dot)to" <marko(at)joh(dot)to> wrote:

>   "index_transactions_transaction_balance_details" btree
> (transactionid)
> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
>
> However, querying with the WHERE clause completely ignores the
> pg_class.reltuples value for the index:

What is the pg_class.reltuples value for the index?

> =# EXPLAIN ANALYZE SELECT * FROM transactions WHERE NOT processed
> AND accountbalancesdailyid IS NOT NULL;

> Index Scan using index_transactions_transaction_balance_details
> on transactions  (cost=0.00..3883160.47 rows=66259403 width=130)
> (actual time=0.033..18.268 rows=13962 loops=1)

> .. which makes for some silly joins when this index is part of a
> larger query.
>
> Is this expected on 9.1?  Has this been fixed in more recent
> versions?

Please provide a little more information:

SELECT version();
SELECT name, current_setting(name), source
  FROM pg_settings
  WHERE source NOT IN ('default', 'override');

What VACUUM or ANALYZE commands are run outside of autovacuum?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: marko(at)joh(dot)to
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-17 16:29:15
Message-ID: 28474.1384705755@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

marko(at)joh(dot)to writes:
> We have the following partial index on a small subset of a larger table:
> "index_transactions_transaction_balance_details" btree (transactionid)
> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
> However, querying with the WHERE clause completely ignores the
> pg_class.reltuples value for the index:

Yup. Row count estimates are derived by estimating the selectivity of the
given WHERE clauses and multiplying by the (estimated) current table size.
In the particular case you show here, with a partial index that *exactly*
matches the WHERE clause, we could get a better answer by looking at the
index size --- but that doesn't scale to any less simplistic case, such
as a query with additional WHERE clauses.

It's also important to realize that reltuples for an index is a whole lot
less trustworthy than it is for a table; ANALYZE doesn't update the
former, for example. And scaling from the last-reported VACUUM stats
to current reality is going to be shakier.

So on the whole, I don't think this would be a good idea.

regards, tom lane


From: Marko Tiikkaja <marko(at)joh(dot)to>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-17 19:55:30
Message-ID: 52891F32.8010902@joh.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 11/17/13, 5:29 PM, Tom Lane wrote:
> marko(at)joh(dot)to writes:
>> We have the following partial index on a small subset of a larger table:
>> "index_transactions_transaction_balance_details" btree (transactionid)
>> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
>> However, querying with the WHERE clause completely ignores the
>> pg_class.reltuples value for the index:
>
> Yup. Row count estimates are derived by estimating the selectivity of the
> given WHERE clauses and multiplying by the (estimated) current table size.
> In the particular case you show here, with a partial index that *exactly*
> matches the WHERE clause, we could get a better answer by looking at the
> index size --- but that doesn't scale to any less simplistic case, such
> as a query with additional WHERE clauses.
>
> It's also important to realize that reltuples for an index is a whole lot
> less trustworthy than it is for a table; ANALYZE doesn't update the
> former, for example. And scaling from the last-reported VACUUM stats
> to current reality is going to be shakier.
>
> So on the whole, I don't think this would be a good idea.

Any suggestions for a workaround? When reading this index as a part of
a bigger query the horrible estimate ensures that nobody's having fun.
I currently have something like:

SELECT * FROM
(SELECT * FROM transactions WHERE <partial index> LIMIT 25000)
transactions
JOIN ..

And I *really* don't like that as a workaround.

Regards,
Marko Tiikkaja


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Tiikkaja <marko(at)joh(dot)to>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-17 20:18:39
Message-ID: 16493.1384719519@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Marko Tiikkaja <marko(at)joh(dot)to> writes:
> Any suggestions for a workaround? When reading this index as a part of
> a bigger query the horrible estimate ensures that nobody's having fun.

Why is the estimate so bad? I suppose the answer is that those two
columns are very strongly correlated. Maybe you could refactor your
data representation to avoid that?

The long-term answer as far as Postgres is concerned is to learn about
cross-column correlations, but that's not happening in the near future.

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Marko Tiikkaja <marko(at)joh(dot)to>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pg Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-17 21:06:17
Message-ID: CAMkU=1wdiLuJ5e9mOD8h_tVxXj45=6DOeRPd-MCkT42CV3_44Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Sun, Nov 17, 2013 at 11:55 AM, Marko Tiikkaja <marko(at)joh(dot)to> wrote:

> On 11/17/13, 5:29 PM, Tom Lane wrote:
>
>> marko(at)joh(dot)to writes:
>>
>>> We have the following partial index on a small subset of a larger table:
>>> "index_transactions_transaction_balance_details" btree
>>> (transactionid)
>>> WHERE NOT processed AND accountbalancesdailyid IS NOT NULL
>>> However, querying with the WHERE clause completely ignores the
>>> pg_class.reltuples value for the index:
>>>
>>
>> Yup. Row count estimates are derived by estimating the selectivity of the
>> given WHERE clauses and multiplying by the (estimated) current table size.
>> In the particular case you show here, with a partial index that *exactly*
>> matches the WHERE clause, we could get a better answer by looking at the
>> index size --- but that doesn't scale to any less simplistic case, such
>> as a query with additional WHERE clauses.
>>
>> It's also important to realize that reltuples for an index is a whole lot
>> less trustworthy than it is for a table; ANALYZE doesn't update the
>> former, for example. And scaling from the last-reported VACUUM stats
>> to current reality is going to be shakier.
>>
>> So on the whole, I don't think this would be a good idea.
>>
>
> Any suggestions for a workaround? When reading this index as a part of a
> bigger query the horrible estimate ensures that nobody's having fun. I
> currently have something like:
>

Define a new column which is true iff the where condition is true? It
sounds like that one magic combination has a meaning all of its own, so it
would make sense to encode it in one column.

>
> SELECT * FROM
> (SELECT * FROM transactions WHERE <partial index> LIMIT 25000)
> transactions
> JOIN ..
>
> And I *really* don't like that as a workaround.
>

I've wanted a function that always returns true, but which the planner
things returns false most of the time, for use in such situations. It
looks like you can make one of these with a compiled module (by creating an
operator and then wrapping that in a function), but I have not found a way
to do it without using C. (CREATE FUNCTION takes a COST and ROWS, but not
a SELECTIVITY.)

Cheers,

Jeff


From: Marko Tiikkaja <marko(at)joh(dot)to>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-18 10:05:04
Message-ID: 5289E650.5070802@joh.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 11/17/13 9:18 PM, Tom Lane wrote:
> Marko Tiikkaja <marko(at)joh(dot)to> writes:
>> Any suggestions for a workaround? When reading this index as a part of
>> a bigger query the horrible estimate ensures that nobody's having fun.
>
> Why is the estimate so bad? I suppose the answer is that those two
> columns are very strongly correlated. Maybe you could refactor your
> data representation to avoid that?

I'll look into that, thanks.

> The long-term answer as far as Postgres is concerned is to learn about
> cross-column correlations, but that's not happening in the near future.

I'm completely clueless about how the planner works, but wouldn't it be
easier to have some kind of separate stats for the conditions in partial
indexes? It seems better in all cases than trying infer the stats from
cross-column correlations, even if we had that.

Regards,
Marko Tiikkaja


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Tiikkaja <marko(at)joh(dot)to>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-18 14:08:53
Message-ID: 9078.1384783733@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Marko Tiikkaja <marko(at)joh(dot)to> writes:
> On 11/17/13 9:18 PM, Tom Lane wrote:
>> The long-term answer as far as Postgres is concerned is to learn about
>> cross-column correlations, but that's not happening in the near future.

> I'm completely clueless about how the planner works, but wouldn't it be
> easier to have some kind of separate stats for the conditions in partial
> indexes? It seems better in all cases than trying infer the stats from
> cross-column correlations, even if we had that.

There's been some discussion of providing a way to hint to ANALYZE about
which combinations of columns are worth gathering cross-column statistics
for. But partial index predicates seem like a pretty bad mechanism for
that.

regards, tom lane


From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Tiikkaja <marko(at)joh(dot)to>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8598: Row count estimates of partial indexes
Date: 2013-11-18 15:46:08
Message-ID: CAGTBQpb5GHCFUM0+AF7_=qKH5mYe3eZMkJLonDqMy-KTWDoFTQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Mon, Nov 18, 2013 at 11:08 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Marko Tiikkaja <marko(at)joh(dot)to> writes:
>> On 11/17/13 9:18 PM, Tom Lane wrote:
>>> The long-term answer as far as Postgres is concerned is to learn about
>>> cross-column correlations, but that's not happening in the near future.
>
>> I'm completely clueless about how the planner works, but wouldn't it be
>> easier to have some kind of separate stats for the conditions in partial
>> indexes? It seems better in all cases than trying infer the stats from
>> cross-column correlations, even if we had that.
>
> There's been some discussion of providing a way to hint to ANALYZE about
> which combinations of columns are worth gathering cross-column statistics
> for. But partial index predicates seem like a pretty bad mechanism for
> that.
>
> regards, tom lane

Why?

If there's a partial index on some predicate, it does mean the
predicate is of common occurence or at least important and it's quite
expectable that more precise estimations regarding those queries
valuable.

Analyze should simply record the selectivity of partial index
predicates as it would the MFV of the boolean variable equal to the
predicate's result, and modifying the MFV estimation code to look up
for those specific stats doesn't seem too difficult.