texteq/byteaeq: avoid detoast

Lists: pgsql-hackers
From: Noah Misch <noah(at)leadboat(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: texteq/byteaeq: avoid detoast
Date: 2010-12-20 18:19:53
Message-ID: 20101220181953.GB29282@tornado.gateway.2wire.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

texteq, textne, byteaeq and byteane detoast their arguments, then check for
equality of length. Unequal lengths imply the answer trivially; given equal
lengths, the functions proceed to compare the actual bytes. We can skip
detoasting entirely when the lengths are unequal. The attached patch implements
this. As submitted, it applies atop of my recent strncmp->memcmp patch, but
they are logically independent. To benchmark some optimal and pessimal cases, I
used the attached "bench-skip-texteq.sql". It uses a few datum sizes and varies
whether the length check succeeds:

bench-skip-texteq.sql, 10 MiB nomatch: 58.4s previous, 0.00664s patched
bench-skip-texteq.sql, 144 B match: 73.0s previous, 71.9s patched
bench-skip-texteq.sql, 3 B match: 68.8s previous, 67.3s patched
bench-skip-texteq.sql, 3 B nomatch: 45.0s previous, 46.0s patched

The timing differences in the smaller-length test cases are probably not
statistically significant.

Thanks,
nm

Attachment Content-Type Size
varlena-avoid-detoast.patch text/plain 5.1 KB
bench-skip-texteq.sql text/plain 2.5 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast
Date: 2011-01-04 03:23:03
Message-ID: AANLkTinBFfeWxcaW8=4fcL72ErUoHA+KuTxYbbr+3mZ_@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Dec 20, 2010 at 1:19 PM, Noah Misch <noah(at)leadboat(dot)com> wrote:
> texteq, textne, byteaeq and byteane detoast their arguments, then check for
> equality of length.  Unequal lengths imply the answer trivially; given equal
> lengths, the functions proceed to compare the actual bytes.  We can skip
> detoasting entirely when the lengths are unequal.  The attached patch implements
> this.  As submitted, it applies atop of my recent strncmp->memcmp patch, but
> they are logically independent.  To benchmark some optimal and pessimal cases, I
> used the attached "bench-skip-texteq.sql".  It uses a few datum sizes and varies
> whether the length check succeeds:
>
> bench-skip-texteq.sql, 10 MiB nomatch: 58.4s previous, 0.00664s patched
> bench-skip-texteq.sql,  144 B   match: 73.0s previous, 71.9s patched
> bench-skip-texteq.sql,    3 B   match: 68.8s previous, 67.3s patched
> bench-skip-texteq.sql,    3 B nomatch: 45.0s previous, 46.0s patched
>
> The timing differences in the smaller-length test cases are probably not
> statistically significant.

Can you add this to the currently-open CommitFest, so we don't lose track of it?

https://commitfest.postgresql.org/action/commitfest_view/open

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast
Date: 2011-01-04 13:46:29
Message-ID: 20110104134629.GA13716@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 03, 2011 at 10:23:03PM -0500, Robert Haas wrote:
> Can you add this to the currently-open CommitFest, so we don't lose track of it?
>
> https://commitfest.postgresql.org/action/commitfest_view/open

Done.


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast
Date: 2011-01-04 14:13:11
Message-ID: AANLkTinmnuWGUE1gUxBMZrMhJYgRUto2EMap1w02g3HK@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I looked on patch

does work toast_raw_datum_size on packed varlena corectly?

regards

Pavel Stehule

2011/1/4 Noah Misch <noah(at)leadboat(dot)com>:
> On Mon, Jan 03, 2011 at 10:23:03PM -0500, Robert Haas wrote:
>> Can you add this to the currently-open CommitFest, so we don't lose track of it?
>>
>> https://commitfest.postgresql.org/action/commitfest_view/open
>
> Done.
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Noah Misch <noah(at)leadboat(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast
Date: 2011-01-04 14:23:40
Message-ID: 20110104142340.GA13981@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Pavel,

On Tue, Jan 04, 2011 at 03:13:11PM +0100, Pavel Stehule wrote:
> I looked on patch

Thanks.

> does work toast_raw_datum_size on packed varlena corectly?

Yes, as best I can tell.


From: Andy Colson <andy(at)squeakycode(dot)net>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-16 19:05:11
Message-ID: 4D334167.7060108@squeakycode.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is a review of:
https://commitfest.postgresql.org/action/patch_view?id=468

Purpose:
========
Equal and not-equal _may_ be quickly determined if their lengths are different. This _may_ be a huge speed up if we dont have to detoat.

The Patch:
==========
I was able to read and understand the patch, its a simple change and looked correct to me (a non PG hacker).
It applies clean to git head, compiles and runs fine with debug enabled.

make check passes

Usability:
==========
I used _may_ above. The benchmark included with the patch, showing huge speedups, is really contrived. It uses a where clause with a thousand character constant: (where c = 'long...long...long...long...ConstantText...etc'). In my opinion this is very uncommon (the author does note this is a "best case"). If you have a field large enough to be toasted you are not going to be using that to search on, you are going to have an ID field that is indexed. (select c where id = 7)

This also only touches = and <>. > < and like wont be touched. So I think the scope of this is limited.

THAT being said, the patch is simple, and if you do happen to hit the code, it will speed things up. As a user of PG I'd like to have this included. Its a corner case, but a big corner, and its a small, simple change, and it wont slow anything else down.

Performance:
============
I created myself a more real world test, with a table with indexes and id's and a large toasted field.

create table junk(id serial primary key, xgroup integer, c text);
create index junk_group on junk(xgroup);

I filled it full of junk:

do $$
declare i integer;
declare j integer;
begin
for i in 1..100 loop
for j in 1..500 loop
insert into junk(xgroup, c) values (j, 'c'||i);
insert into junk (xgroup, c) select j, repeat('abc', 2000)|| n from generate_series(1, 5) n;
end loop;
end loop;
end$$;

This will make about 600 records within the same xgroup. As well as a simple 'c15' type of value in c we can search for. My thinking is you may not know the exact unique id, but you do know what group its in, so that'll cut out 90% of the records, and then you'll have to add " and c = 'c15'" to get the exact one you want.

I still saw a nice performance boost.

Old PG:
$ psql < bench3.sql
Timing is on.
DO
Time: 2010.412 ms

Patched:
$ psql < bench3.sql
Timing is on.
DO
Time: 184.602 ms

bench3.sql:
do $$
declare i integer;
begin
for i in 1..400 loop
perform count(*) from junk where xgroup = i and c like 'c' || i;
end loop;
end$$;

Summary:
========
Performance speed-up: Oh yeah! If you just happen to hit it, and if you do hit it, you might want to re-think your layout a little bit.

Do I want it? Yes please.


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-16 21:07:13
Message-ID: AANLkTimBy4QCXodj7XHqQhEqCqWZhme=WirzNa1vAEbF@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hello

I looked on this patch too.

It's good idea.

I think, so we can have a function or macro that compare a varlena
sizes. Some like

Datum texteq(..)
{
if (!datumsHasSameLength(PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))
PG_RETURN_FALSE();

... actual code ..
}

Regards

Pavel Stehule
2011/1/16 Andy Colson <andy(at)squeakycode(dot)net>:
> This is a review of:
> https://commitfest.postgresql.org/action/patch_view?id=468
>
> Purpose:
> ========
> Equal and not-equal _may_ be quickly determined if their lengths are
> different.   This _may_ be a huge speed up if we dont have to detoat.
>
>
> The Patch:
> ==========
> I was able to read and understand the patch, its a simple change and looked
> correct to me (a non PG hacker).
> It applies clean to git head, compiles and runs fine with debug enabled.
>
> make check passes
>
>
> Usability:
> ==========
> I used _may_ above.  The benchmark included with the patch, showing huge
> speedups, is really contrived.  It uses a where clause with a thousand
> character constant:  (where c =
> 'long...long...long...long...ConstantText...etc').  In my opinion this is
> very uncommon (the author does note this is a "best case").  If you have a
> field large enough to be toasted you are not going to be using that to
> search on, you are going to have an ID field that is indexed.  (select c
> where id = 7)
>
> This also only touches = and <>.  > < and like wont be touched.  So I think
> the scope of this is limited.
>
> THAT being said, the patch is simple, and if you do happen to hit the code,
> it will speed things up.  As a user of PG I'd like to have this included.
>  Its a corner case, but a big corner, and its a small, simple change, and it
> wont slow anything else down.
>
>
> Performance:
> ============
> I created myself a more real world test, with a table with indexes and id's
> and a large toasted field.
>
> create table junk(id serial primary key, xgroup integer, c text);
> create index junk_group on junk(xgroup);
>
>
> I filled it full of junk:
>
> do $$
>        declare i integer;
>        declare j integer;
> begin
>        for i in 1..100 loop
>                for j in 1..500 loop
>                        insert into junk(xgroup, c) values (j, 'c'||i);
>                        insert into junk (xgroup, c) select j, repeat('abc',
> 2000)|| n from generate_series(1, 5) n;
>                end loop;
>        end loop;
> end$$;
>
>
> This will make about 600 records within the same xgroup.  As well as a
> simple 'c15' type of value in c we can search for.  My thinking is you may
> not know the exact unique id, but you do know what group its in, so that'll
> cut out 90% of the records, and then you'll have to add " and c = 'c15'" to
> get the exact one you want.
>
> I still saw a nice performance boost.
>
> Old PG:
> $ psql < bench3.sql
> Timing is on.
> DO
> Time: 2010.412 ms
>
> Patched:
> $ psql < bench3.sql
> Timing is on.
> DO
> Time: 184.602 ms
>
>
> bench3.sql:
> do $$
>        declare i integer;
> begin
>        for i in 1..400 loop
>                perform count(*) from junk where xgroup = i and c like 'c' ||
> i;
>        end loop;
> end$$;
>
>
>
> Summary:
> ========
> Performance speed-up:  Oh yeah!  If you just happen to hit it, and if you do
> hit it, you might want to re-think your layout a little bit.
>
> Do I want it?  Yes please.
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Noah Misch <noah(at)leadboat(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-16 22:23:39
Message-ID: 20110116222339.GC4600@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 16, 2011 at 01:05:11PM -0600, Andy Colson wrote:
> This is a review of:
> https://commitfest.postgresql.org/action/patch_view?id=468

Thanks!

> I created myself a more real world test, with a table with indexes and id's and a large toasted field.

> This will make about 600 records within the same xgroup. As well as a simple 'c15' type of value in c we can search for. My thinking is you may not know the exact unique id, but you do know what group its in, so that'll cut out 90% of the records, and then you'll have to add " and c = 'c15'" to get the exact one you want.

Good to have a benchmark like that, rather than just looking at the extrema.


From: Noah Misch <noah(at)leadboat(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Andy Colson <andy(at)squeakycode(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-16 22:47:38
Message-ID: 20110116224738.GD4600@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jan 16, 2011 at 10:07:13PM +0100, Pavel Stehule wrote:
> I think, so we can have a function or macro that compare a varlena
> sizes. Some like
>
> Datum texteq(..)
> {
> if (!datumsHasSameLength(PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))
> PG_RETURN_FALSE();
>
> ... actual code ..
> }

Good point. Is this something that would be useful many places? One thing that
bugged me slightly writing this patch is that texteq, textne, byteaeq and
byteane all follow the same pattern rather tightly. (Indeed, I think one could
easily implement texteq and byteaeq with the exact same C function.) I like how
we handle this for tsvector (see TSVECTORCMPFUNC in tsvector_op.c) to avoid the
redundancy. If datumHasSameLength would mainly apply to these four functions or
ones very similar to them, maybe we should abstract out the entire function body
like we do for tsvector?

A topic for a different patch in any case, I think.


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Andy Colson <andy(at)squeakycode(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-16 23:44:37
Message-ID: AANLkTimW_=tSVrMCygOY5Nf36jf-Y-jpV=VK0gjDVxFo@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/1/16 Noah Misch <noah(at)leadboat(dot)com>:
> On Sun, Jan 16, 2011 at 10:07:13PM +0100, Pavel Stehule wrote:
>> I think, so we can have a function or macro that compare a varlena
>> sizes. Some like
>>
>> Datum texteq(..)
>> {
>>      if (!datumsHasSameLength(PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))
>>         PG_RETURN_FALSE();
>>
>>      ... actual code ..
>> }
>
> Good point.  Is this something that would be useful many places?  One thing that
> bugged me slightly writing this patch is that texteq, textne, byteaeq and
> byteane all follow the same pattern rather tightly.  (Indeed, I think one could
> easily implement texteq and byteaeq with the exact same C function.).

It isn't good idea. Theoretically, there can be differencies between
text and bytea in future - there can be important collations. Now,
these types are distinct and some basic methods should be distinct
too. Different situation is on varlena level.

Regards

Pavel Stehule

I like how
> we handle this for tsvector (see TSVECTORCMPFUNC in tsvector_op.c) to avoid the
> redundancy.  If datumHasSameLength would mainly apply to these four functions or
> ones very similar to them, maybe we should abstract out the entire function body
> like we do for tsvector?
>
> A topic for a different patch in any case, I think.
>


From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: Andy Colson <andy(at)squeakycode(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 05:51:59
Message-ID: AANLkTinURwPsfyaqJRY6P0=fp5ma-ai4NJ2kvF40f+8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 04:05, Andy Colson <andy(at)squeakycode(dot)net> wrote:
> This is a review of:
> https://commitfest.postgresql.org/action/patch_view?id=468
>
> Purpose:
> ========
> Equal and not-equal _may_ be quickly determined if their lengths are
> different.   This _may_ be a huge speed up if we don't have to detoast.

We can skip detoast to compare lengths of two text/bytea values
with the patch, but we still need detoast to compare the contents
of the values.

If we always generate same toasted byte sequences from the same raw
values, we don't need to detoast at all to compare the contents.
Is it possible or not?

--
Itagaki Takahiro


From: KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 06:22:40
Message-ID: 4D33E030.20800@ak.jp.nec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

(2011/01/17 14:51), Itagaki Takahiro wrote:
> On Mon, Jan 17, 2011 at 04:05, Andy Colson<andy(at)squeakycode(dot)net> wrote:
>> This is a review of:
>> https://commitfest.postgresql.org/action/patch_view?id=468
>>
>> Purpose:
>> ========
>> Equal and not-equal _may_ be quickly determined if their lengths are
>> different. This _may_ be a huge speed up if we don't have to detoast.
>
> We can skip detoast to compare lengths of two text/bytea values
> with the patch, but we still need detoast to compare the contents
> of the values.
>
> If we always generate same toasted byte sequences from the same raw
> values, we don't need to detoast at all to compare the contents.
> Is it possible or not?
>
Are you talking about an idea to apply toast id as an alternative key?

I did similar idea to represent security label on user tables for row
level security in the v8.4/9.0 based implementation. Because a small
number of security labels are shared by massive tuples, it is waste of
space if we have all the text data being toasted individually, not only
performance loss in toast/detoast.

In this case, I represented security label (text) using security-id (oid)
which is a primary key pointing out a certain text data in catalog table.
It well reduced storage consumption and achieved good performance in
comparison operation.

One challenge was to reclaim orphan texts. In this case, we needed to
lock out a user table referencing the toast values, then we delete all
the orphan entries.

Thanks,
--
KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 06:35:52
Message-ID: AANLkTik9dbiQRwBsysuJ_iOFim74s_bwOnqxgjNR_zka@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 06:51, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> On Mon, Jan 17, 2011 at 04:05, Andy Colson <andy(at)squeakycode(dot)net> wrote:
>> This is a review of:
>> https://commitfest.postgresql.org/action/patch_view?id=468
>>
>> Purpose:
>> ========
>> Equal and not-equal _may_ be quickly determined if their lengths are
>> different.   This _may_ be a huge speed up if we don't have to detoast.
>
> We can skip detoast to compare lengths of two text/bytea values
> with the patch, but we still need detoast to compare the contents
> of the values.
>
> If we always generate same toasted byte sequences from the same raw
> values, we don't need to detoast at all to compare the contents.
> Is it possible or not?

For bytea, it seems it would be possible.

For text, I think locales may make that impossible. Aren't there
locale rules where two different characters can "behave the same" when
comparing them? I know in Swedish at least w and v behave the same
when sorting (but not when comparing) in some variants of the locale.

In fact, aren't there cases where the *length test* also fails? I
don't know this for sure, but unless we know for certain that two
different length strings can never be the same *independent of
locale*, this whole patch has a big problem...

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 07:13:28
Message-ID: AANLkTinGy6+DracG-0+V7kTuiiQUQ4rDGo3ZEjyZeAWU@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/1/17 Magnus Hagander <magnus(at)hagander(dot)net>:
> On Mon, Jan 17, 2011 at 06:51, Itagaki Takahiro
> <itagaki(dot)takahiro(at)gmail(dot)com> wrote:
>> On Mon, Jan 17, 2011 at 04:05, Andy Colson <andy(at)squeakycode(dot)net> wrote:
>>> This is a review of:
>>> https://commitfest.postgresql.org/action/patch_view?id=468
>>>
>>> Purpose:
>>> ========
>>> Equal and not-equal _may_ be quickly determined if their lengths are
>>> different.   This _may_ be a huge speed up if we don't have to detoast.
>>
>> We can skip detoast to compare lengths of two text/bytea values
>> with the patch, but we still need detoast to compare the contents
>> of the values.
>>
>> If we always generate same toasted byte sequences from the same raw
>> values, we don't need to detoast at all to compare the contents.
>> Is it possible or not?
>
> For bytea, it seems it would be possible.
>
> For text, I think locales may make that impossible. Aren't there
> locale rules where two different characters can "behave the same" when
> comparing them? I know in Swedish at least w and v behave the same
> when sorting (but not when comparing) in some variants of the locale.
>
> In fact, aren't there cases where the *length test* also fails? I
> don't know this for sure, but unless we know for certain that two
> different length strings can never be the same *independent of
> locale*, this whole patch has a big problem...
>

Some string's comparation operations are binary now too. But it is
question what will be new with collate support.

Regards

Pavel Stehule

> --
>  Magnus Hagander
>  Me: http://www.hagander.net/
>  Work: http://www.redpill-linpro.com/
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 07:51:56
Message-ID: AANLkTinB9=Yd_ci8=AZqOEk7Mr1Ze+L6k+ubpKbEbfoX@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 16:13, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>>> If we always generate same toasted byte sequences from the same raw
>>> values, we don't need to detoast at all to compare the contents.
>>> Is it possible or not?
>>
>> For bytea, it seems it would be possible.
>>
>> For text, I think locales may make that impossible. Aren't there
>> locale rules where two different characters can "behave the same" when
>> comparing them? I know in Swedish at least w and v behave the same
>> when sorting (but not when comparing) in some variants of the locale.
>>
> Some string's comparation operations are binary now too. But it is
> question what will be new with collate support.

Right. We are using memcmp() in texteq and textne now. We consider
collations only in <, <=, =>, > and compare support functions.
So, I think there is no regression here as long as raw values and
toasted byte sequences have one-to-one correspondence.

--
Itagaki Takahiro


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 07:56:18
Message-ID: 1295250978.1455.2.camel@fsopti579.F-Secure.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-01-17 at 07:35 +0100, Magnus Hagander wrote:
> For text, I think locales may make that impossible. Aren't there
> locale rules where two different characters can "behave the same" when
> comparing them? I know in Swedish at least w and v behave the same
> when sorting (but not when comparing) in some variants of the locale.
>
> In fact, aren't there cases where the *length test* also fails? I
> don't know this for sure, but unless we know for certain that two
> different length strings can never be the same *independent of
> locale*, this whole patch has a big problem...

Currently, two text values are only equal of strcoll() considers them
equal and the bits are the same. So this patch is safe in that regard.

There is, however, some desire to loosen this. Possible applications
are case-insensitive comparison and Unicode normalization. It's not
going to happen soon, but it may be worth considering not putting in an
optimization that we'll end up having to rip out again in a year
perhaps.


From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 07:56:38
Message-ID: AANLkTimTQFvamwjBeb8RcmqghQf6CtJauROVo7Mm-H5T@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/1/17 Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>:
> On Mon, Jan 17, 2011 at 16:13, Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com> wrote:
>>>> If we always generate same toasted byte sequences from the same raw
>>>> values, we don't need to detoast at all to compare the contents.
>>>> Is it possible or not?
>>>
>>> For bytea, it seems it would be possible.
>>>
>>> For text, I think locales may make that impossible. Aren't there
>>> locale rules where two different characters can "behave the same" when
>>> comparing them? I know in Swedish at least w and v behave the same
>>> when sorting (but not when comparing) in some variants of the locale.
>>>
>> Some string's comparation operations are binary now too. But it is
>> question what will be new with collate support.
>
> Right. We are using memcmp() in texteq and textne now. We consider
> collations only in <, <=, =>, > and compare support functions.
> So, I think there is no regression here as long as raw values and
> toasted byte sequences have one-to-one correspondence.
>

I am sure, so this isn't a problem in Czech locale, but I am not sure
about German or Turkish.

There was issue (if I remember well with German "ss" char) ?

Pavel

> --
> Itagaki Takahiro
>


From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>
Cc: Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 08:13:29
Message-ID: AANLkTinXDJ-sQJNAYXRbF4B=msN4MzLMU2GLHUGcFBEi@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2011/1/17 KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>:
> Are you talking about an idea to apply toast id as an alternative key?

No, probably. I'm just talking about whether "diff -q A.txt B.txt" and
"diff -q A.gz B.gz" always returns the same result or not.

... I found it depends on version of gzip. So, if we use such logic,
we cannot improve toast compression logic because the data is migrated
by pg_upgrade.

> I did similar idea to represent security label on user tables for row
> level security in the v8.4/9.0 based implementation. Because a small
> number of security labels are shared by massive tuples, it is waste of
> space if we have all the text data being toasted individually, not only
> performance loss in toast/detoast.

It looks the same issue as large object rather than the discussion here.
We have vacuumlo in contrib to solve the issue. So, we could have
vacuumlo-like special sweeping logic for the security label table.

--
Itagaki Takahiro


From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 10:05:09
Message-ID: AANLkTikbYnSHyrp_1eonqOMMgnj8YHjJQaH40G9ak8Am@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 09:13, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> 2011/1/17 KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>:
>> Are you talking about an idea to apply toast id as an alternative key?
>
> No, probably. I'm just talking about whether "diff -q A.txt B.txt" and
> "diff -q A.gz  B.gz" always returns the same result or not.
>
> ... I found it depends on version of gzip. So, if we use such logic,
> we cannot improve toast compression logic because the data is migrated
> by pg_upgrade.

Yeah, that might be a bad tradeoff.

I wonder if we can trust the *equality* test, but not the inequality?
E.g. if compressed(A) == compressed(B) we know they're the same, but
if compressed(A) != compressed(B) we don't know they're not they still
might be.

I guess with two different versions or even completely different
algorithms we could end up with exactly the same compressed value for
different plaintexts (it's not a cryptographic hash after all), so
that's probably not an acceptable comparison either.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 12:55:15
Message-ID: AANLkTi=df3s6zVh8v1bnzou29gEMsFXDVS3KwodpM5cn@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 2:56 AM, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On mån, 2011-01-17 at 07:35 +0100, Magnus Hagander wrote:
>> For text, I think locales may make that impossible. Aren't there
>> locale rules where two different characters can "behave the same" when
>> comparing them? I know in Swedish at least w and v behave the same
>> when sorting (but not when comparing) in some variants of the locale.
>>
>> In fact, aren't there cases where the *length test* also fails? I
>> don't know this for sure, but unless we know for certain that two
>> different length strings can never be the same *independent of
>> locale*, this whole patch has a big problem...
>
> Currently, two text values are only equal of strcoll() considers them
> equal and the bits are the same.  So this patch is safe in that regard.
>
> There is, however, some desire to loosen this.  Possible applications
> are case-insensitive comparison and Unicode normalization.  It's not
> going to happen soon, but it may be worth considering not putting in an
> optimization that we'll end up having to rip out again in a year
> perhaps.

Hmm. I hate to give up on this - it's a nice optimization for the
cases to which it applies. Would it be possible to jigger things so
that we can still do it byte-for-byte when case-insensitive comparison
or Unicode normalization AREN'T in use?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Noah Misch <noah(at)leadboat(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 15:22:51
Message-ID: 20110117152251.GA19587@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 07:35:52AM +0100, Magnus Hagander wrote:
> On Mon, Jan 17, 2011 at 06:51, Itagaki Takahiro
> <itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> > On Mon, Jan 17, 2011 at 04:05, Andy Colson <andy(at)squeakycode(dot)net> wrote:
> >> This is a review of:
> >> https://commitfest.postgresql.org/action/patch_view?id=468
> >>
> >> Purpose:
> >> ========
> >> Equal and not-equal _may_ be quickly determined if their lengths are
> >> different. ? This _may_ be a huge speed up if we don't have to detoast.
> >
> > We can skip detoast to compare lengths of two text/bytea values
> > with the patch, but we still need detoast to compare the contents
> > of the values.
> >
> > If we always generate same toasted byte sequences from the same raw
> > values, we don't need to detoast at all to compare the contents.
> > Is it possible or not?
>
> For bytea, it seems it would be possible.
>
> For text, I think locales may make that impossible. Aren't there
> locale rules where two different characters can "behave the same" when
> comparing them? I know in Swedish at least w and v behave the same
> when sorting (but not when comparing) in some variants of the locale.
>
> In fact, aren't there cases where the *length test* also fails? I
> don't know this for sure, but unless we know for certain that two
> different length strings can never be the same *independent of
> locale*, this whole patch has a big problem...

Just to be clear, the code already has these length tests today. This patch
just moves them before the detoast.


From: Noah Misch <noah(at)leadboat(dot)com>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 15:28:27
Message-ID: 20110117152827.GB19587@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 11:05:09AM +0100, Magnus Hagander wrote:
> On Mon, Jan 17, 2011 at 09:13, Itagaki Takahiro
> <itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> > 2011/1/17 KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>:
> >> Are you talking about an idea to apply toast id as an alternative key?
> >
> > No, probably. I'm just talking about whether "diff -q A.txt B.txt" and
> > "diff -q A.gz ?B.gz" always returns the same result or not.

Interesting.

> > ... I found it depends on version of gzip. So, if we use such logic,
> > we cannot improve toast compression logic because the data is migrated
> > by pg_upgrade.
>
> Yeah, that might be a bad tradeoff.
>
> I wonder if we can trust the *equality* test, but not the inequality?
> E.g. if compressed(A) == compressed(B) we know they're the same, but
> if compressed(A) != compressed(B) we don't know they're not they still
> might be.

Exactly.

> I guess with two different versions or even completely different
> algorithms we could end up with exactly the same compressed value for
> different plaintexts (it's not a cryptographic hash after all), so
> that's probably not an acceptable comparison either.

It's safe to assume that will never happen. If compressed(A) == compressed(B)
when A != B, we would have a lossy compression algorithm.

As you say, though, _inequality_ implies nothing for an arbitrary decompressor.
One can trivially construct many inputs to the zlib decompressor that yield the
same output. "gzip -1" ... "gzip -9" do this, for example. So the main win
here would come if we tightly controlled the compressor, such that we could
infer something from compressed(A) != compressed(B). That would be an
intriguing path to explore.

nm


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 19:12:57
Message-ID: 1295291577.12898.1.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-01-17 at 07:55 -0500, Robert Haas wrote:
> > There is, however, some desire to loosen this. Possible
> applications
> > are case-insensitive comparison and Unicode normalization. It's not
> > going to happen soon, but it may be worth considering not putting in
> an
> > optimization that we'll end up having to rip out again in a year
> > perhaps.
>
> Hmm. I hate to give up on this - it's a nice optimization for the
> cases to which it applies. Would it be possible to jigger things so
> that we can still do it byte-for-byte when case-insensitive comparison
> or Unicode normalization AREN'T in use?

Well, at the moment it's all theoretical anyway. These features aren't
on the table, and no one has implemented them.

It's quite possible, however, that if we go that way and investigate
which locales are safe for this, we might end up with the same answer as
for which locales are safe for LIKE optimization, namely none.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 20:33:57
Message-ID: 20235.1295296437@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On mn, 2011-01-17 at 07:35 +0100, Magnus Hagander wrote:
>> In fact, aren't there cases where the *length test* also fails?

> Currently, two text values are only equal of strcoll() considers them
> equal and the bits are the same. So this patch is safe in that regard.

> There is, however, some desire to loosen this.

That isn't ever going to happen, unless you'd like to give up hash joins
and hash aggregation on text values.

regards, tom lane


From: Jim Nasby <jim(at)nasby(dot)net>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 20:36:56
Message-ID: A5370FA2-AB83-48E9-83CB-4F3683CB26A2@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jan 17, 2011, at 9:22 AM, Noah Misch wrote:
> On Mon, Jan 17, 2011 at 07:35:52AM +0100, Magnus Hagander wrote:
>> On Mon, Jan 17, 2011 at 06:51, Itagaki Takahiro
>> <itagaki(dot)takahiro(at)gmail(dot)com> wrote:
>>> On Mon, Jan 17, 2011 at 04:05, Andy Colson <andy(at)squeakycode(dot)net> wrote:
>>>> This is a review of:
>>>> https://commitfest.postgresql.org/action/patch_view?id=468
>>>>
>>>> Purpose:
>>>> ========
>>>> Equal and not-equal _may_ be quickly determined if their lengths are
>>>> different. ? This _may_ be a huge speed up if we don't have to detoast.
>>>
>>> We can skip detoast to compare lengths of two text/bytea values
>>> with the patch, but we still need detoast to compare the contents
>>> of the values.
>>>
>>> If we always generate same toasted byte sequences from the same raw
>>> values, we don't need to detoast at all to compare the contents.
>>> Is it possible or not?
>>
>> For bytea, it seems it would be possible.
>>
>> For text, I think locales may make that impossible. Aren't there
>> locale rules where two different characters can "behave the same" when
>> comparing them? I know in Swedish at least w and v behave the same
>> when sorting (but not when comparing) in some variants of the locale.
>>
>> In fact, aren't there cases where the *length test* also fails? I
>> don't know this for sure, but unless we know for certain that two
>> different length strings can never be the same *independent of
>> locale*, this whole patch has a big problem...
>
> Just to be clear, the code already has these length tests today. This patch
> just moves them before the detoast.

Any reason we can't do this for other varlena? I'm wondering if it makes more sense to have some generic toast comparison functions that don't care what the data in toast actually is...
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Magnus Hagander <magnus(at)hagander(dot)net>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 20:39:56
Message-ID: 20347.1295296796@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Magnus Hagander <magnus(at)hagander(dot)net> writes:
> I wonder if we can trust the *equality* test, but not the inequality?
> E.g. if compressed(A) == compressed(B) we know they're the same, but
> if compressed(A) != compressed(B) we don't know they're not they still
> might be.

I haven't looked at this patch, but it seems to me that it would be
reasonable to conclude A != B if the va_extsize values in the toast
pointers don't agree. Once you've fetched the toasted values, you've
spent enough cycles that there's not going to be much point in
trying to do any cute optimizations beyond that. So if the patch is
doing a memcmp on the compressed data, I'd be inclined to get rid of
that part.

regards, tom lane


From: Noah Misch <noah(at)leadboat(dot)com>
To: Jim Nasby <jim(at)nasby(dot)net>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-17 21:13:12
Message-ID: 20110117211312.GA14843@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jan 17, 2011 at 02:36:56PM -0600, Jim Nasby wrote:
> On Jan 17, 2011, at 9:22 AM, Noah Misch wrote:
> > Just to be clear, the code already has these length tests today. This patch
> > just moves them before the detoast.
>
> Any reason we can't do this for other varlena? I'm wondering if it makes more sense to have some generic toast comparison functions that don't care what the data in toast actually is...

We could not apply the optimization to varlenas generically. For example,
bpchareq() ignores trailing white space during comparison, so "foo " = "foo ".
It would work for biteq(), though I'm not sure how often large-scale varbits
come up. numericeq() does not qualify, because you might have a NumericLong in
a binary-upgraded table that would now become a NumericShort. So, there very
well may be other places where we should apply the same optimization, but each
one needs individual consideration.

Thanks,
nm


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 04:18:55
Message-ID: 1295324335.12898.8.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-01-17 at 15:33 -0500, Tom Lane wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> > On mån, 2011-01-17 at 07:35 +0100, Magnus Hagander wrote:
> >> In fact, aren't there cases where the *length test* also fails?
>
> > Currently, two text values are only equal of strcoll() considers them
> > equal and the bits are the same. So this patch is safe in that regard.
>
> > There is, however, some desire to loosen this.
>
> That isn't ever going to happen, unless you'd like to give up hash joins
> and hash aggregation on text values.

Since citext exists and supports hashing, it's obviously possible
nevertheless.


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 08:03:01
Message-ID: 4D354935.7090702@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 17.01.2011 22:33, Tom Lane wrote:
> Peter Eisentraut<peter_e(at)gmx(dot)net> writes:
>> On mån, 2011-01-17 at 07:35 +0100, Magnus Hagander wrote:
>>> In fact, aren't there cases where the *length test* also fails?
>
>> Currently, two text values are only equal of strcoll() considers them
>> equal and the bits are the same. So this patch is safe in that regard.
>
>> There is, however, some desire to loosen this.
>
> That isn't ever going to happen, unless you'd like to give up hash joins
> and hash aggregation on text values.

You could canonicalize the string first in the hash function. I'm not
sure if we have all the necessary information at hand there, but at
least with some encoding/locale-specific support functions it'd be possible.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 08:06:24
Message-ID: AANLkTimDY5tiwoSMGVGmAr18CXzF5q__b6rnhZYuTTcb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 18, 2011 at 05:39, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I haven't looked at this patch, but it seems to me that it would be
> reasonable to conclude A != B if the va_extsize values in the toast
> pointers don't agree.

It's a very light-weight alternative of memcmp the byte data,
but there is still the same issue -- we might have different
compressed results if we use different algorithm for TOASTing.

So, it would be better to apply the present patch as-is.
We can improve the comparison logic over the patch in another
development cycle if possible.

--
Itagaki Takahiro


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Magnus Hagander <magnus(at)hagander(dot)net>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 16:15:02
Message-ID: 11083.1295367302@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com> writes:
> On Tue, Jan 18, 2011 at 05:39, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> I haven't looked at this patch, but it seems to me that it would be
>> reasonable to conclude A != B if the va_extsize values in the toast
>> pointers don't agree.

> It's a very light-weight alternative of memcmp the byte data,
> but there is still the same issue -- we might have different
> compressed results if we use different algorithm for TOASTing.

Which makes it a lightweight waste of cycles.

> So, it would be better to apply the present patch as-is.

No, I don't think so. Has any evidence been submitted that that part of
the patch is of benefit?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 16:32:10
Message-ID: AANLkTikz260CPvn0wnU-Lu3X_ErYiAF-0D4fq6+6i8R5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> It's a very light-weight alternative of memcmp the byte data,
>> but there is still the same issue -- we might have different
>> compressed results if we use different algorithm for TOASTing.
>
> Which makes it a lightweight waste of cycles.
>
>> So, it would be better to apply the present patch as-is.
>
> No, I don't think so.  Has any evidence been submitted that that part of
> the patch is of benefit?

I think you might be mixing up what's actually in the patch with
another idea that was proposed but isn't actually in the patch. The
patch itself does nothing controversial.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 16:44:16
Message-ID: 11705.1295369056@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> No, I don't think so. Has any evidence been submitted that that part of
>> the patch is of benefit?

> I think you might be mixing up what's actually in the patch with
> another idea that was proposed but isn't actually in the patch. The
> patch itself does nothing controversial.

Oh, I misread Itagaki-san's comment to imply that that *was* in the
patch. Maybe I should go read it.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 16:45:05
Message-ID: AANLkTikM85u8v+ViscYVGPOiPpH69urs3v=2Y-NW1C_6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 18, 2011 at 11:44 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Jan 18, 2011 at 11:15 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> No, I don't think so.  Has any evidence been submitted that that part of
>>> the patch is of benefit?
>
>> I think you might be mixing up what's actually in the patch with
>> another idea that was proposed but isn't actually in the patch.  The
>> patch itself does nothing controversial.
>
> Oh, I misread Itagaki-san's comment to imply that that *was* in the
> patch.  Maybe I should go read it.

Perhaps. :-)

While you're at it you might commit it. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Magnus Hagander <magnus(at)hagander(dot)net>, KaiGai Kohei <kaigai(at)ak(dot)jp(dot)nec(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-18 16:53:07
Message-ID: 11884.1295369587@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Jan 18, 2011 at 11:44 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Oh, I misread Itagaki-san's comment to imply that that *was* in the
>> patch. Maybe I should go read it.

> Perhaps. :-)

> While you're at it you might commit it. :-)

Yeah, as penance I'll take this one.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast
Date: 2011-01-18 19:13:15
Message-ID: 690.1295377995@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Noah Misch <noah(at)leadboat(dot)com> writes:
> texteq, textne, byteaeq and byteane detoast their arguments, then check for
> equality of length. Unequal lengths imply the answer trivially; given equal
> lengths, the functions proceed to compare the actual bytes. We can skip
> detoasting entirely when the lengths are unequal. The attached patch implements
> this.

Applied with stylistic changes.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Magnus Hagander <magnus(at)hagander(dot)net>, Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>, Andy Colson <andy(at)squeakycode(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: texteq/byteaeq: avoid detoast [REVIEW]
Date: 2011-01-19 08:22:41
Message-ID: 20110119082241.GB11804@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jan 18, 2011 at 10:03:01AM +0200, Heikki Linnakangas wrote:
>> That isn't ever going to happen, unless you'd like to give up hash joins
>> and hash aggregation on text values.
>
> You could canonicalize the string first in the hash function. I'm not
> sure if we have all the necessary information at hand there, but at
> least with some encoding/locale-specific support functions it'd be
> possible.

This is what strxfrm() was created for.

strcoll(a,b) == strcmp(strxfrm(a),strxfrm(b))

Sure there's a cost, the question is only how much and whether it makes
hash join unfeasible. I doubt it, since by definition it must be faster
than strcoll(). I suppose a test would be interesting.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
> - Charles de Gaulle