Re: Planner estimates and cast operations ,...

Lists: pgsql-hackers
From: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Planner estimates and cast operations ,...
Date: 2006-09-04 14:24:01
Message-ID: 8B9AC7D9-407E-4B72-90E0-E129AF370DD9@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

i am looking at some corner case which might also cause troubles for
other people.
consider the following:

SELECT some_timestamp::date FROM very_large_table GROUP BY
some_timestamp::date

my very_large_table is around 1billion entries.
the problem is: the planner has a problem here as it is taking the
(correct) estimates for timestamp. this avoids a HashAggregate
because the dataset seems to large for work_mem.
what the planner cannot know is that the number of days is quite
limited (in my case around 1000 different values).
i wonder how to teach the planner to take the cast into consideration.

at the moment the planner uses the per column statistics - it cannot
know that the cast might change the number of different values.
how about the following?

Command: CREATE CAST
Description: define a new cast
Syntax:
CREATE CAST (sourcetype AS targettype)
[USING SELECTIVITY number | funcname(argtypes)]
WITH FUNCTION funcname (argtypes)
[ AS ASSIGNMENT | AS IMPLICIT ]

if it was possible to assign a constant or some function to the cast
i think we could make the example used above work. by default no
costs are changed. if somebody is doing some fancy query it would be
possible to tweak GOUOP BY planning by assigning some cleverly
written function or a constant to the scenery.

a constant would be useful in terms of casts to boolean or so.

does anybody have an idea which could help solving this issue?

best regards,

hans


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planner estimates and cast operations ,...
Date: 2006-09-04 14:57:20
Message-ID: 23903.1157381840@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hans-Juergen Schoenig <postgres(at)cybertec(dot)at> writes:
> consider the following:

> SELECT some_timestamp::date FROM very_large_table GROUP BY
> some_timestamp::date

> my very_large_table is around 1billion entries.
> the problem is: the planner has a problem here as it is taking the
> (correct) estimates for timestamp. this avoids a HashAggregate
> because the dataset seems to large for work_mem.
> what the planner cannot know is that the number of days is quite
> limited (in my case around 1000 different values).
> i wonder how to teach the planner to take the cast into consideration.

Create an index on that expression.

regression=# create table foo(x) as select x * '864 sec'::interval + now()::timestamp from generate_series(1,10000) x;
SELECT
regression=# analyze foo;
ANALYZE
regression=# explain select x::date from foo group by x::date;
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=205.00..330.00 rows=10000 width=8)
-> Seq Scan on foo (cost=0.00..180.00 rows=10000 width=8)
(2 rows)

regression=# create index fooi on foo((x::date));
CREATE INDEX
regression=# analyze foo;
ANALYZE
regression=# explain select x::date from foo group by x::date;
QUERY PLAN
---------------------------------------------------------------
HashAggregate (cost=205.00..206.26 rows=101 width=8)
-> Seq Scan on foo (cost=0.00..180.00 rows=10000 width=8)
(2 rows)

regression=#

I had to cheat a little bit here: I tried to do this example with a
timestamptz column, and the index creation failed because timestamptz to
date isn't immutable (it depends on TimeZone). If yours is too, you
could perhaps do something involving AT TIME ZONE to generate an
immutable conversion to date.

It would perhaps make sense to provide a way to cue ANALYZE to compute
stats on expressions that aren't actually being indexed, but I see no
good reason to limit our attention to cast expressions.

regards, tom lane


From: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planner estimates and cast operations ,...
Date: 2006-09-04 15:19:37
Message-ID: 14AF07E7-CEFC-4BC3-96C5-CFC394D7DE87@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

hi tom ...

i thought about creating an index on the expression but the problem
is that this is hardly feasable.
in 8.0 (what i have here) this would block the table and i would run
out of disk space as well. this is a 600 gb biest :(

what about the planner approach?
this would solve the problem for some other issues as well. an index
might not be flexible enough :(.

many thanks,

hans

On Sep 4, 2006, at 4:57 PM, Tom Lane wrote:

> Hans-Juergen Schoenig <postgres(at)cybertec(dot)at> writes:
>> consider the following:
>
>> SELECT some_timestamp::date FROM very_large_table GROUP BY
>> some_timestamp::date
>
>> my very_large_table is around 1billion entries.
>> the problem is: the planner has a problem here as it is taking the
>> (correct) estimates for timestamp. this avoids a HashAggregate
>> because the dataset seems to large for work_mem.
>> what the planner cannot know is that the number of days is quite
>> limited (in my case around 1000 different values).
>> i wonder how to teach the planner to take the cast into
>> consideration.
>
> Create an index on that expression.
>
> regression=# create table foo(x) as select x * '864 sec'::interval
> + now()::timestamp from generate_series(1,10000) x;
> SELECT
> regression=# analyze foo;
> ANALYZE
> regression=# explain select x::date from foo group by x::date;
> QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate (cost=205.00..330.00 rows=10000 width=8)
> -> Seq Scan on foo (cost=0.00..180.00 rows=10000 width=8)
> (2 rows)
>
> regression=# create index fooi on foo((x::date));
> CREATE INDEX
> regression=# analyze foo;
> ANALYZE
> regression=# explain select x::date from foo group by x::date;
> QUERY PLAN
> ---------------------------------------------------------------
> HashAggregate (cost=205.00..206.26 rows=101 width=8)
> -> Seq Scan on foo (cost=0.00..180.00 rows=10000 width=8)
> (2 rows)
>
> regression=#
>
> I had to cheat a little bit here: I tried to do this example with a
> timestamptz column, and the index creation failed because
> timestamptz to
> date isn't immutable (it depends on TimeZone). If yours is too, you
> could perhaps do something involving AT TIME ZONE to generate an
> immutable conversion to date.
>
> It would perhaps make sense to provide a way to cue ANALYZE to compute
> stats on expressions that aren't actually being indexed, but I see no
> good reason to limit our attention to cast expressions.
>
> regards, tom lane


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planner estimates and cast operations ,...
Date: 2006-09-04 17:04:44
Message-ID: 20060904170444.GA8543@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 04, 2006 at 17:19:37 +0200,
Hans-Juergen Schoenig <postgres(at)cybertec(dot)at> wrote:
>
> i thought about creating an index on the expression but the problem
> is that this is hardly feasable.
> in 8.0 (what i have here) this would block the table and i would run

That may be hard to deal with.

> out of disk space as well. this is a 600 gb biest :(

I wouldn't expect this to be a problem. If you have 10^9 rows, I would expect
the index to be less than 10% of you current size. If you are so close to
your disk space limit that that is a problem, you have a problem in any case.

>
> what about the planner approach?
> this would solve the problem for some other issues as well. an index
> might not be flexible enough :(.

If you disable sorting you might be able to get it to switch plans. Lying
about the amount of work memory so that the planner thinks the hash
will fit in memory despite its misguessing the number of buckets might also
help.


From: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
To: Bruno Wolff III <bruno(at)wolff(dot)to>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planner estimates and cast operations ,...
Date: 2006-09-04 17:09:16
Message-ID: 89BBA24F-84F9-4A12-934A-F26AC74B29AB@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sep 4, 2006, at 7:04 PM, Bruno Wolff III wrote:

> On Mon, Sep 04, 2006 at 17:19:37 +0200,
> Hans-Juergen Schoenig <postgres(at)cybertec(dot)at> wrote:
>>
>> i thought about creating an index on the expression but the problem
>> is that this is hardly feasable.
>> in 8.0 (what i have here) this would block the table and i would run
>
> That may be hard to deal with.
>

it is ...
but the problem is not primarily that i have some problem with a
certain query. somehow this can be solved somehow. i am thinking
about GROUP BY and estimates in general here ...
just wondering if there is a chance to improve ...

>> out of disk space as well. this is a 600 gb biest :(
>
> I wouldn't expect this to be a problem. If you have 10^9 rows, I
> would expect
> the index to be less than 10% of you current size. If you are so
> close to
> your disk space limit that that is a problem, you have a problem in
> any case.
>

the index itself is not too large but when building it up it is
written several times. it is not funny when dealing with so much
data ...

>>
>> what about the planner approach?
>> this would solve the problem for some other issues as well. an index
>> might not be flexible enough :(.
>
> If you disable sorting you might be able to get it to switch plans.
> Lying
> about the amount of work memory so that the planner thinks the hash
> will fit in memory despite its misguessing the number of buckets
> might also
> help.

setting work_mem to 2gb does not help here ;)
set it to the max value on 8.0.
this was my first try too.
the problem is - there is no magic switch to mislead the planner a
little without hacking the system stats (which is not what people
should do i would say ;) ).

my question is: is adding hooks for selectivity a feasable way of
dealing with things like that?

hans


From: Bruno Wolff III <bruno(at)wolff(dot)to>
To: Hans-Juergen Schoenig <postgres(at)cybertec(dot)at>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Planner estimates and cast operations ,...
Date: 2006-09-04 18:10:48
Message-ID: 20060904181048.GA26868@wolff.to
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Sep 04, 2006 at 19:09:16 +0200,
Hans-Juergen Schoenig <postgres(at)cybertec(dot)at> wrote:
>
> setting work_mem to 2gb does not help here ;)
> set it to the max value on 8.0.
> this was my first try too.
> the problem is - there is no magic switch to mislead the planner a
> little without hacking the system stats (which is not what people
> should do i would say ;) ).

Did you combine that with telling it not to use sorts? I am not sure that
will really work for GROUP BY, but it is probably an easy test. You can
do an explain to see what it will try without actually running the query
in case it picks the poor plan again.

> my question is: is adding hooks for selectivity a feasable way of
> dealing with things like that?

I think the expectation is that you create a functional index and that's
how you would tell the system to keep stats for particular functions. I
don't think data on the most common values are kept now for functional
indexes, but the index itself will still have clues about the data.