Re: Bad optimizer data for xml (WAS: xml data type implications of no =)

Lists: pgsql-bugs
From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: pgsql-bugs(at)postgresql(dot)org
Subject: xml data type implications of no =
Date: 2010-05-25 04:43:46
Message-ID: 4BFB5582.8000605@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Today I ran into some interesting consequences of the xml data type
being without an "=" operator. One I thought I'd post here because it
has a *possible* planner impact. I'm not sure it is actually a bug as
such, but this seemed the best forum to post in initially:

test=# \d bug
Table "public.bug"
Column | Type | Modifiers
--------+---------+-----------
id | integer |
val | xml |

test=# explain select val::text from bug;
QUERY PLAN
--------------------------------------------------------------
Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32)

Note the width estimate. However a more realistic estimate for width is:

test=# select 8192/(reltuples/relpages) as width from pg_class where
relname='bug';
width
------------------
394.130431739976

So we are going to massively underestimate the "size" of such a dataset.
Now this appears to be a consequence of no "=" operator (std_typanalyze
in analyze.c bails if there isn't one), so the planner has no idea about
how wide 'val' actually is. I'm wondering if it is worth having at least
an "=" operator to enable some minimal stats to be available for xml
columns.

regards

Mark


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: xml data type implications of no =
Date: 2010-05-27 01:37:50
Message-ID: 4BFDCCEE.4010103@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 25/05/10 16:43, Mark Kirkwood wrote:
> Today I ran into some interesting consequences of the xml data type
> being without an "=" operator. One I thought I'd post here because it
> has a *possible* planner impact. I'm not sure it is actually a bug as
> such, but this seemed the best forum to post in initially:
>
> test=# \d bug
> Table "public.bug"
> Column | Type | Modifiers
> --------+---------+-----------
> id | integer |
> val | xml |
>
> test=# explain select val::text from bug;
> QUERY PLAN
> --------------------------------------------------------------
> Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32)
>
>
> Note the width estimate. However a more realistic estimate for width is:
>
> test=# select 8192/(reltuples/relpages) as width from pg_class where
> relname='bug';
> width
> ------------------
> 394.130431739976
>
> So we are going to massively underestimate the "size" of such a
> dataset. Now this appears to be a consequence of no "=" operator
> (std_typanalyze in analyze.c bails if there isn't one), so the planner
> has no idea about how wide 'val' actually is. I'm wondering if it is
> worth having at least an "=" operator to enable some minimal stats to
> be available for xml columns.
>

Adding a minimal = op (see attached) and an analyze results in:

test=# explain select val::text from bug;
QUERY PLAN
---------------------------------------------------------------
Seq Scan on bug (cost=0.00..62632.08 rows=1000008 width=385)

which gives a much better indication of dataset size.

Attachment Content-Type Size
xmleq.sql text/x-sql 409 bytes

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-03 23:16:48
Message-ID: 4C0837E0.7060706@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 27/05/10 13:37, Mark Kirkwood wrote:
> On 25/05/10 16:43, Mark Kirkwood wrote:
>> Today I ran into some interesting consequences of the xml data type
>> being without an "=" operator. One I thought I'd post here because it
>> has a *possible* planner impact. I'm not sure it is actually a bug as
>> such, but this seemed the best forum to post in initially:
>>
>> test=# \d bug
>> Table "public.bug"
>> Column | Type | Modifiers
>> --------+---------+-----------
>> id | integer |
>> val | xml |
>>
>> test=# explain select val::text from bug;
>> QUERY PLAN
>> --------------------------------------------------------------
>> Seq Scan on bug (cost=0.00..58127.78 rows=1000278 width=32)
>>
>>
>> Note the width estimate. However a more realistic estimate for width is:
>>
>> test=# select 8192/(reltuples/relpages) as width from pg_class where
>> relname='bug';
>> width
>> ------------------
>> 394.130431739976
>>
>> So we are going to massively underestimate the "size" of such a
>> dataset. Now this appears to be a consequence of no "=" operator
>> (std_typanalyze in analyze.c bails if there isn't one), so the
>> planner has no idea about how wide 'val' actually is. I'm wondering
>> if it is worth having at least an "=" operator to enable some minimal
>> stats to be available for xml columns.
>>
>
> Adding a minimal = op (see attached) and an analyze results in:
>
> test=# explain select val::text from bug;
> QUERY PLAN
> ---------------------------------------------------------------
> Seq Scan on bug (cost=0.00..62632.08 rows=1000008 width=385)
>
> which gives a much better indication of dataset size.
>
>
>

Maybe I gave this guy a bad title - is it a concern that the 'width'
estimate is so far off for xml datatypes (because of no = op)? It seemed
to me that this could result in some bad plan choices (e.g in subqueries
etc).

regards

Mark


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 03:22:46
Message-ID: AANLkTikXDv-OkulgAg8vegjMFiced3rgdHFUXSoNyKfD@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood
<mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:
> Maybe I gave this guy a bad title - is it a concern that the 'width'
> estimate is so far off for xml datatypes (because of no = op)? It seemed to
> me that this could result in some bad plan choices (e.g in subqueries etc).

It's possible. I don't really see a reason not to add an = operator
for XML - does anyone else?

It would need to be done by updating src/include/catalog/pg_*.h,
rather than via SQL, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 03:28:57
Message-ID: 4C0F0A79.7060609@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 09/06/10 15:22, Robert Haas wrote:
> On Thu, Jun 3, 2010 at 7:16 PM, Mark Kirkwood
> <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:
>
>> Maybe I gave this guy a bad title - is it a concern that the 'width'
>> estimate is so far off for xml datatypes (because of no = op)? It seemed to
>> me that this could result in some bad plan choices (e.g in subqueries etc).
>>
> It's possible. I don't really see a reason not to add an = operator
> for XML - does anyone else?
>
> It would need to be done by updating src/include/catalog/pg_*.h,
> rather than via SQL, of course.
>
>

Heh, sure should - I merely included the SQL stuff in case anyone else
wanted to reproduce what I was seeing!

Cheers

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 05:14:56
Message-ID: 16827.1276060496@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> It's possible. I don't really see a reason not to add an = operator
> for XML - does anyone else?

Yes, that was considered and rejected, IIRC. What is your definition
of equality for xml?

regards, tom lane


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 05:41:50
Message-ID: 4C0F299E.7000803@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 09/06/10 17:14, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>
>> It's possible. I don't really see a reason not to add an = operator
>> for XML - does anyone else?
>>
> Yes, that was considered and rejected, IIRC. What is your definition
> of equality for xml?
>

Yes - but in that previous discussion the optimizer (lack of)
information was not considered (or known I suspect), so maybe a rethink
is worthwhile?

It seems that the nub of this issue is that there are conceptually two
types of =, one for datatype specific comparison, and one for optimizer
statistical information calculation. However the system allows only the
first, so if you don't (or can't) have one then you lose some possibly
important optimization data.

regards

Mark


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 14:17:41
Message-ID: 24204.1276093061@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> writes:
> It seems that the nub of this issue is that there are conceptually two
> types of =, one for datatype specific comparison, and one for optimizer
> statistical information calculation. However the system allows only the
> first, so if you don't (or can't) have one then you lose some possibly
> important optimization data.

Nonsense. ANALYZE and the optimizer work with the datatype's usual
notion of '=', whatever it is.

It's possible that we should install a simplified code path in analyze.c
that can collect width data for a column even in the absence of any '='
operator. I'm less than convinced that it's worth the trouble though.
Do you have an actual example where such data would have affected a
plan choice?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 14:53:03
Message-ID: AANLkTikKCRt6sGM6lpJFr-TvhTJeGpOQYbcUx6nbbPPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> It's possible.  I don't really see a reason not to add an = operator
>> for XML - does anyone else?
>
> Yes, that was considered and rejected, IIRC.  What is your definition
> of equality for xml?

I'd vote for !memcmp().

There can be (and probably already are) other ways to test for other
kinds of equality, too, of course.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 15:08:22
Message-ID: 25105.1276096102@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Jun 9, 2010 at 1:14 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Yes, that was considered and rejected, IIRC. What is your definition
>> of equality for xml?

> I'd vote for !memcmp().

Surely not. xml is not text.

regards, tom lane


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-bugs(at)postgresql(dot)org
Subject: Re: Bad optimizer data for xml (WAS: xml data type implications of no =)
Date: 2010-06-09 22:03:15
Message-ID: 4C100FA3.8050408@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-bugs

On 10/06/10 02:17, Tom Lane wrote:
> Mark Kirkwood<mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> writes:
>
>> It seems that the nub of this issue is that there are conceptually two
>> types of =, one for datatype specific comparison, and one for optimizer
>> statistical information calculation. However the system allows only the
>> first, so if you don't (or can't) have one then you lose some possibly
>> important optimization data.
>>
> Nonsense. ANALYZE and the optimizer work with the datatype's usual
> notion of '=', whatever it is.
>
>

Slow down the reading Tom... and read what I was actually saying - note
the"conceptually". Of course the code uses the datatype's defined "=".

> It's possible that we should install a simplified code path in analyze.c
> that can collect width data for a column even in the absence of any '='
> operator.
>

Yeah I was thinking along the same lines.

> Do you have an actual example where such data would have affected a
> plan choice?
>
>
>

Not at the moment, I was thinking that anywhere that used such datatypes
in a subquery of similar might be a likely case. I guess I was looking
at this as a case of "this is an area where we have less accurate
optimizer data that we could have", and thinking of ways to improve it.

regards

Mark