Re: MD5 aggregate

Lists: pgsql-hackers
From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: MD5 aggregate
Date: 2013-06-13 09:35:43
Message-ID: CAEZATCUq--+zYAQxYfyO+3oPjCY6DWS5scET8PPHA7mhn7WDxg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

Attached is a patch implementing a new aggregate function md5_agg() to
compute the aggregate MD5 sum across a number of rows. This is
something I've wished for a number of times. I think the primary use
case is to do a quick check that 2 tables, possibly on different
servers, contain the same data, using a query like

SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;

or

SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

these would be equivalent to

SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;

but without the excessive memory consumption for the intermediate
concatenated string, and the resulting 1GB table size limit.

I've added 2 variants: md5_agg(text) and md5_agg(bytea) to match the 2
variants of md5(), so pure binary data can also be checksummed.

In passing, I've tidied up and optimised the code in md5.c a bit ---
specifically I've removed the malloc()/memcpy()/free() code that was
unnecessarily making a copy of the entire input data just to pad it
and append the bit count. This reduces the memory consumption of the
existing md5() functions for large inputs, and gives a modest
performance boost. As a result, the md5() function can no longer throw
an out-of-memory error.

Regards,
Dean

Attachment Content-Type Size
md5_agg.patch application/octet-stream 29.8 KB

From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-13 21:46:48
Message-ID: 51BA3DC8.4070607@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/13/13 5:35 AM, Dean Rasheed wrote:
> Attached is a patch implementing a new aggregate function md5_agg() to
> compute the aggregate MD5 sum across a number of rows.

That seems somewhat useful.

> In passing, I've tidied up and optimised the code in md5.c a bit ---
> specifically I've removed the malloc()/memcpy()/free() code that was
> unnecessarily making a copy of the entire input data just to pad it
> and append the bit count. This reduces the memory consumption of the
> existing md5() functions for large inputs, and gives a modest
> performance boost. As a result, the md5() function can no longer throw
> an out-of-memory error.

I think it would be better if you split this into two separate patches.


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 12:00:29
Message-ID: CACMqXCKZJmPPsFm2G8wp-0BZ93ua06uz9m+1txMHrM1yEcmSZw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> Attached is a patch implementing a new aggregate function md5_agg() to
> compute the aggregate MD5 sum across a number of rows. This is
> something I've wished for a number of times. I think the primary use
> case is to do a quick check that 2 tables, possibly on different
> servers, contain the same data, using a query like
>
> SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;
>
> or
>
> SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;
>
> these would be equivalent to
>
> SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo;
>
> but without the excessive memory consumption for the intermediate
> concatenated string, and the resulting 1GB table size limit.

It's more efficient to calculate per-row md5, and then sum() them.
This avoids the need for ORDER BY.

--
marko


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 13:14:32
Message-ID: 8110.1371215672@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marko Kreen <markokr(at)gmail(dot)com> writes:
> On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> Attached is a patch implementing a new aggregate function md5_agg() to
>> compute the aggregate MD5 sum across a number of rows.

> It's more efficient to calculate per-row md5, and then sum() them.
> This avoids the need for ORDER BY.

Good point. The aggregate md5 function also fails to distinguish the
case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
from the case where they contain 'xyz' followed by 'zyxyz'.

Now, as against that, you lose any sensitivity to the ordering of the
values.

Personally I'd be a bit inclined to xor the per-row md5's rather than
sum them, but that's a small matter.

regards, tom lane


From: Benedikt Grundmann <bgrundmann(at)janestreet(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <markokr(at)gmail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 13:20:45
Message-ID: CADbMkNPFHbevxzfDN5iCnR+pF18cPTbf0SC+GcVC0+7epZO3fw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Marko Kreen <markokr(at)gmail(dot)com> writes:
> > On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
> wrote:
> >> Attached is a patch implementing a new aggregate function md5_agg() to
> >> compute the aggregate MD5 sum across a number of rows.
>
> > It's more efficient to calculate per-row md5, and then sum() them.
> > This avoids the need for ORDER BY.
>
> Good point. The aggregate md5 function also fails to distinguish the
> case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
> from the case where they contain 'xyz' followed by 'zyxyz'.
>
> Now, as against that, you lose any sensitivity to the ordering of the
> values.
>
> Personally I'd be a bit inclined to xor the per-row md5's rather than
> sum them, but that's a small matter.
>
> regards, tom lane
>
>
xor works but only if each row is different (e.g. at the very least all
columns together make a unique key).

>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <markokr(at)gmail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 13:40:51
Message-ID: 20130614134051.GJ6417@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Marko Kreen <markokr(at)gmail(dot)com> writes:
> > On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> >> Attached is a patch implementing a new aggregate function md5_agg() to
> >> compute the aggregate MD5 sum across a number of rows.
>
> > It's more efficient to calculate per-row md5, and then sum() them.
> > This avoids the need for ORDER BY.
>
> Good point. The aggregate md5 function also fails to distinguish the
> case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
> from the case where they contain 'xyz' followed by 'zyxyz'.
>
> Now, as against that, you lose any sensitivity to the ordering of the
> values.
>
> Personally I'd be a bit inclined to xor the per-row md5's rather than
> sum them, but that's a small matter.

Where I'd take this is actually in a completely different direction..
I'd like the aggregate to be able to match the results of running the
'md5sum' unix utility on a file that's been COPY'd out. Yes, that means
we'd need a way to get back "what would this row look like if it was
sent through COPY with these parameters", but I've long wanted that
also.

No, no clue about how to put all that together. Yes, having this would
be better than nothing, so I'm still for adding this even if we can't
make it match COPY output. :)

Thanks,

Stephen


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <markokr(at)gmail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 13:59:01
Message-ID: 51BB21A5.3060300@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 06/14/2013 09:40 AM, Stephen Frost wrote:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> Marko Kreen <markokr(at)gmail(dot)com> writes:
>>> On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>>> Attached is a patch implementing a new aggregate function md5_agg() to
>>>> compute the aggregate MD5 sum across a number of rows.
>>> It's more efficient to calculate per-row md5, and then sum() them.
>>> This avoids the need for ORDER BY.
>> Good point. The aggregate md5 function also fails to distinguish the
>> case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
>> from the case where they contain 'xyz' followed by 'zyxyz'.
>>
>> Now, as against that, you lose any sensitivity to the ordering of the
>> values.
>>
>> Personally I'd be a bit inclined to xor the per-row md5's rather than
>> sum them, but that's a small matter.
> Where I'd take this is actually in a completely different direction..
> I'd like the aggregate to be able to match the results of running the
> 'md5sum' unix utility on a file that's been COPY'd out. Yes, that means
> we'd need a way to get back "what would this row look like if it was
> sent through COPY with these parameters", but I've long wanted that
> also.
>
> No, no clue about how to put all that together. Yes, having this would
> be better than nothing, so I'm still for adding this even if we can't
> make it match COPY output. :)
>
>

I'd rather go the other way, processing the records without having to
process them otherwise at all. Turning things into text must slow things
down, surely.

cheers

andrew


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <markokr(at)gmail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 14:19:09
Message-ID: 20130614141909.GK6417@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andrew Dunstan (andrew(at)dunslane(dot)net) wrote:
> I'd rather go the other way, processing the records without having
> to process them otherwise at all. Turning things into text must slow
> things down, surely.

That's certainly an interesting idea also..

Thanks,

Stephen


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marko Kreen <markokr(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 14:20:25
Message-ID: CAEZATCWZcij6dpv=q9KjxqpPtdh9pbwDebqd=u5YboAMjZyrDA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14 June 2013 14:14, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Marko Kreen <markokr(at)gmail(dot)com> writes:
>> On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>>> Attached is a patch implementing a new aggregate function md5_agg() to
>>> compute the aggregate MD5 sum across a number of rows.
>
>> It's more efficient to calculate per-row md5, and then sum() them.
>> This avoids the need for ORDER BY.
>
> Good point. The aggregate md5 function also fails to distinguish the
> case where we have 'xyzzy' followed by 'xyz' in two adjacent rows
> from the case where they contain 'xyz' followed by 'zyxyz'.
>

Well, if you aggregated foo.*::text as in my original example, then
the textual representation of the row would protect you from that. But
yes, if you were just doing it with a single text column that might be
a risk.

> Now, as against that, you lose any sensitivity to the ordering of the
> values.
>
> Personally I'd be a bit inclined to xor the per-row md5's rather than
> sum them, but that's a small matter.
>

But this would be a much riskier thing to do with a single column,
because if you updated multiple rows in the same way (e.g., UPDATE t
SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
there were an even number of matches.

Regards,
Dean


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Marko Kreen <markokr(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 14:47:25
Message-ID: 9910.1371221245@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
> On 14 June 2013 14:14, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Personally I'd be a bit inclined to xor the per-row md5's rather than
>> sum them, but that's a small matter.

> But this would be a much riskier thing to do with a single column,
> because if you updated multiple rows in the same way (e.g., UPDATE t
> SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
> there were an even number of matches.

I was implicitly thinking that the sum would be a modulo sum so that the
final result is still the size of an md5 signature. If that's true,
then leaking bits via carry out is just as bad as xor's deficiencies.
Now, you could certainly make it a non-modulo sum and not lose any
information to carries, if you're willing to do the arithmetic in
NUMERIC and have a variable-width result. Sounds a bit slow though.

regards, tom lane


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <markokr(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 14:49:31
Message-ID: CAEZATCXUfLS3U1nTFynvAtuhmRTBOV96Jo3-WX3gvcefQo-qWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14 June 2013 15:19, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> * Andrew Dunstan (andrew(at)dunslane(dot)net) wrote:
>> I'd rather go the other way, processing the records without having
>> to process them otherwise at all. Turning things into text must slow
>> things down, surely.
>
> That's certainly an interesting idea also..
>

md5_agg(record) ?

Yes, I like it.

Regards,
Dean


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <markokr(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 14:56:57
Message-ID: 20130614145657.GH19500@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-06-14 15:49:31 +0100, Dean Rasheed wrote:
> On 14 June 2013 15:19, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > * Andrew Dunstan (andrew(at)dunslane(dot)net) wrote:
> >> I'd rather go the other way, processing the records without having
> >> to process them otherwise at all. Turning things into text must slow
> >> things down, surely.
> >
> > That's certainly an interesting idea also..
> >
>
> md5_agg(record) ?
>
> Yes, I like it.

It's more complex than just memcmp()ing HeapTupleData though. At least
if the Datum contains varlena columns there's so many different
representations (short, long, compressed, external, external compressed)
of the same data that a md5 without normalizing that wouldn't be very
interesting.
So you would at least need a normalizing version of
toast_flatten_tuple() that also deals with short/long varlenas. But even
after that, you would still need to deal with Datums that can have
different representation (like short numerics, old style hstore, ...).

It might be more realistic to use the binary output functions, but I am
not sure whether all of those are sufficiently reproduceable.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Marko Kreen <markokr(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 15:09:46
Message-ID: 51BB323A.3000604@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/14/2013 04:47 PM, Tom Lane wrote:
> Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
>> On 14 June 2013 14:14, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Personally I'd be a bit inclined to xor the per-row md5's rather than
>>> sum them, but that's a small matter.
>> But this would be a much riskier thing to do with a single column,
>> because if you updated multiple rows in the same way (e.g., UPDATE t
>> SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
>> there were an even number of matches.
> I was implicitly thinking that the sum would be a modulo sum so that the
> final result is still the size of an md5 signature. If that's true,
> then leaking bits via carry out is just as bad as xor's deficiencies.
> Now, you could certainly make it a non-modulo sum and not lose any
> information to carries, if you're willing to do the arithmetic in
> NUMERIC and have a variable-width result. Sounds a bit slow though.
What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from <yourtable> t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Hannu Krosing <hannu(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <markokr(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 16:45:49
Message-ID: CAEZATCUzc6V1t+rmQtyhssGAPJ6g=s6YwsROTVgq+Hrqi90TYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14 June 2013 16:09, Hannu Krosing <hannu(at)2ndquadrant(dot)com> wrote:
> What skytools/pgq/londiste uses for comparing tables on master
> and slave is query like this
>
> select sum(hashtext(t.*::text)) from <yourtable> t;
>
> This is non-modulo sum and does not use md5 but relies on
> whatever the hashtext() du jour is :)
>
> So it is not comparable to anything external (like the md5sum
> compatible idea above) but is usually good enough for fast
> checks of compatible tables.
>
> As tables are unordered by definition anyway, this should be
> good enough for most SQL.
>
> The speed comes from both fast(er) hashtext() function and
> avoiding the sort.
>

That sounds like a pretty good approach. We could do that if we had a
version of md5() that returned numeric. My impression is that numeric
computations are pretty fast compared to the sorting overhead.

On the other hand, if there is a usable index, select md5_agg(..) from
(sub-query) will do and index scan rather than a sort, making it much
faster than using an ORDER BY in the aggregate.

Regards,
Dean


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <markokr(at)gmail(dot)com>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-15 03:55:56
Message-ID: 51BBE5CC.1000104@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 06/14/2013 09:40 PM, Stephen Frost wrote:
> Where I'd take this is actually in a completely different direction..
> I'd like the aggregate to be able to match the results of running the
> 'md5sum' unix utility on a file that's been COPY'd out.
Until I started looking at the follow-up discussion I didn't realise it
wasn't supposed to.

If it isn't the md5sum of the ordered rows, it shouldn't be called
'md5'. It might still be useful, but please don't call it md5.

The proposals to make it produce the same result with different row
orderings sound useful in the context of SQL; if it produced a different
result with different orderings I'd want a way to force an explicit
ORDER BY clause on the aggregate and error out if one wasn't present.
Making it ignore order means that it's no longer md5, though.

row_sums(...) ?

- --
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRu+XMAAoJELBXNkqjr+S2mkkH/j8gi8d07dI6+G742f0U+v0J
u8DGhtDQuuWHalqlaUDOssmi4fRDg99OzLlR+Mid0yGL/UfFMoL47H+kNRoMkuzV
stUz3vf5rp8TbqEnikT3EwEKIuzaWrae0Fn3TKIYXVSRVvWjGzRSZsvJZsdfcS7T
7lZ9sf6QGekT9bAi6BIFsG7Z1bFLb6Q6AeTsX04++dLBCrjm96CSyisBswY5J2qg
zD0WrK6IOsSn9ljlIZRGSTtP+tdM5mOi/DHdeEd+glGx5YKQ9t9yq++oayoqb9mp
hPtsBo6UwcMaylPA2vQnhVi0q2bl9FMa+QGpaWBe6YfXLPF4PWhET2OixkRat1w=
=ncT/
-----END PGP SIGNATURE-----


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-15 04:03:31
Message-ID: 51BBE793.6040401@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 06/13/2013 05:35 PM, Dean Rasheed wrote:
> Hi,
>
> Attached is a patch implementing a new aggregate function md5_agg() to
> compute the aggregate MD5 sum across a number of rows. This is
> something I've wished for a number of times. I think the primary use
> case is to do a quick check that 2 tables, possibly on different
> servers, contain the same data, using a query like
>
> SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;
>
> or
>
> SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

That's a very useful thing to be able to do, but I'm hesitant to make
the fact that it uses md5 too prominent in the name if it doesn't
produce a result that an external user could reasonably expect from
md5'ing the same data.

I imagine having an md5_agg(text) and md5(bytea) that was the more
efficient, streaming equivalent of:

md5(string_agg(the_col,''))

would be rather handy.

It'd be less useful for other types (floats, integers, etc) unless we
had a way to get the binary representations of those in a well defined
form, like int8le(1) . Casting to 'text' would be sufficient for most of
the purposes I can imagine, though, and for those that it wouldn't
things would quickly get so complicated that you'd want to be using a
PL/something function anyway.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-15 09:22:21
Message-ID: CAEZATCUvYXfuDBszdknjKXF0z58hpV4-8SfRo6AhR4i6EJ+Vdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 13 June 2013 10:35, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> Hi,
>
> Attached is a patch implementing a new aggregate function md5_agg() to
> compute the aggregate MD5 sum across a number of rows. This is
> something I've wished for a number of times. I think the primary use
> case is to do a quick check that 2 tables, possibly on different
> servers, contain the same data, using a query like
>
> SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;
>
> or
>
> SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

2). Produce an ordered MD5 sum compatible with COPY, whose result
would match that of running unix md5sum on the COPY output. Given all
the possible COPY options that would affect the result (format,
delimiters, headers, quoting, escaping, ...), I think that such a
thing would only reasonably be possible as an extension to the COPY
command itself.

I guess in its simplest form this would just be a new option "MD5" to
COPY that would cause it to pipe its output to the md5 aggregator and
then send the final sum to the COPY destination at the end (e.g.,
"COPY foo TO STDOUT MD5" would produce the ordered MD5 sum of the data
in foo).

I still think my original md5_agg() has its uses, since what it
produces is comparable with external md5 sums, and is directly
available to SQL, but (1) is probably the most useful for quickly
comparing 2 tables. I'm much less convinced about the value of (2),
but on the face of it, it doesn't seem like it would be hard to
implement.

Thoughts?

Regards,
Dean


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-17 10:34:52
Message-ID: CAEZATCVaEtpBfbKN1ub72w+G_KQAqjQ_2X5yYFhV8L3giJrupg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> There seem to be 2 separate directions that this could go, which
> really meet different requirements:
>
> 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> the order in which they are scanned. A possible approach to this might
> be something like an aggregate
>
> md5_total(text/bytea) returns text
>
> that returns the sum of the md5 values of each input value, treating
> each md5 value as an unsigned 128-bit integer, and then producing the
> hexadecimal representation of the final sum. This should out-perform a
> solution based on numeric addition, and in typical cases, the result
> wouldn't be much longer than a regular md5 sum, and so would be easy
> to eyeball for differences.
>

I've been playing around with the idea of an aggregate that computes
the sum of the md5 hashes of each of its inputs, which I've called
md5_total() for now, although I'm not particularly wedded to that
name. Comparing it with md5_agg() on a 100M row table (see attached
test script) produces interesting results:

SELECT md5_agg(foo.*::text)
FROM (SELECT * FROM foo ORDER BY id) foo;

50bc42127fb9b028c9708248f835ed8f

Time: 92960.021 ms

SELECT md5_total(foo.*::text) FROM foo;

02faea7fafee4d253fc94cfae031afc43c03479c

Time: 96190.343 ms

Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
result is longer) but it seems like it would be very useful for
quickly comparing data in SQL, since its value is not dependent on the
row-order making it easier to use and better performing if there is no
usable index for ordering.

Note, however, that if there is an index that can be used for
ordering, the performance is not necessarily better than md5_agg(), as
this example shows. There is a small additional overhead per row for
initialising the MD5 sums, and adding the results to the total, but I
think the biggest factor is that md5_total() is processing more data.
The reason is that MD5 works on 64-byte blocks, so the total amount of
data going through the core MD5 algorithm is each row's size is
rounded up to a multiple of 64. In this simple case it ends up
processing around 1.5 times as much data:

SELECT sum(length(foo.*::text)) AS md5_agg,
sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;

md5_agg | md5_total
------------+-------------
8103815438 | 12799909248

although of course that overhead won't be as large on wider tables,
and even in this case the overall performance is still on a par with
md5_agg().

ISTM that both aggregates are potentially useful in different
situations. I would probably typically use md5_total() because of its
simplicity/order-independence and consistent performance, but
md5_agg() might also be useful when comparing with external data.

Regards,
Dean

Attachment Content-Type Size
md5_agg_v2.patch application/octet-stream 37.4 KB
md5-100m-row-test.sql application/octet-stream 553 bytes

From: Marko Kreen <markokr(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-17 11:53:30
Message-ID: 20130617115330.GA21009@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> > There seem to be 2 separate directions that this could go, which
> > really meet different requirements:
> >
> > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > the order in which they are scanned. A possible approach to this might
> > be something like an aggregate
> >
> > md5_total(text/bytea) returns text
> >
> > that returns the sum of the md5 values of each input value, treating
> > each md5 value as an unsigned 128-bit integer, and then producing the
> > hexadecimal representation of the final sum. This should out-perform a
> > solution based on numeric addition, and in typical cases, the result
> > wouldn't be much longer than a regular md5 sum, and so would be easy
> > to eyeball for differences.
> >
>
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
>
> SELECT md5_agg(foo.*::text)
> FROM (SELECT * FROM foo ORDER BY id) foo;
>
> 50bc42127fb9b028c9708248f835ed8f
>
> Time: 92960.021 ms
>
> SELECT md5_total(foo.*::text) FROM foo;
>
> 02faea7fafee4d253fc94cfae031afc43c03479c
>
> Time: 96190.343 ms
>
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.
>
> Note, however, that if there is an index that can be used for
> ordering, the performance is not necessarily better than md5_agg(), as
> this example shows. There is a small additional overhead per row for
> initialising the MD5 sums, and adding the results to the total, but I
> think the biggest factor is that md5_total() is processing more data.
> The reason is that MD5 works on 64-byte blocks, so the total amount of
> data going through the core MD5 algorithm is each row's size is
> rounded up to a multiple of 64. In this simple case it ends up
> processing around 1.5 times as much data:
>
> SELECT sum(length(foo.*::text)) AS md5_agg,
> sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
>
> md5_agg | md5_total
> ------------+-------------
> 8103815438 | 12799909248
>
> although of course that overhead won't be as large on wider tables,
> and even in this case the overall performance is still on a par with
> md5_agg().
>
> ISTM that both aggregates are potentially useful in different
> situations. I would probably typically use md5_total() because of its
> simplicity/order-independence and consistent performance, but
> md5_agg() might also be useful when comparing with external data.

Few notes:

- Index-scan over whole table is *very* bad for larger tables (few times
bigger than available RAM). If you want to promote such use you should
also warn against use on loaded server.

- It's pointless to worry about overflow on 128-bit ints. Just let it
happen. Adding complexity for that does not bring any advantage.

- Using some faster 128-bit hash may be useful - check out CityHash
or SpookyHash. You can get C implementation from pghashlib.

--
marko


From: David Fetter <david(at)fetter(dot)org>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Review [was Re: MD5 aggregate]
Date: 2013-06-21 17:48:35
Message-ID: 20130621174835.GI5395@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> > There seem to be 2 separate directions that this could go, which
> > really meet different requirements:
> >
> > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > the order in which they are scanned. A possible approach to this might
> > be something like an aggregate
> >
> > md5_total(text/bytea) returns text
> >
> > that returns the sum of the md5 values of each input value, treating
> > each md5 value as an unsigned 128-bit integer, and then producing the
> > hexadecimal representation of the final sum. This should out-perform a
> > solution based on numeric addition, and in typical cases, the result
> > wouldn't be much longer than a regular md5 sum, and so would be easy
> > to eyeball for differences.
> >
>
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
>
> SELECT md5_agg(foo.*::text)
> FROM (SELECT * FROM foo ORDER BY id) foo;
>
> 50bc42127fb9b028c9708248f835ed8f
>
> Time: 92960.021 ms
>
> SELECT md5_total(foo.*::text) FROM foo;
>
> 02faea7fafee4d253fc94cfae031afc43c03479c
>
> Time: 96190.343 ms
>
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.
>
> Note, however, that if there is an index that can be used for
> ordering, the performance is not necessarily better than md5_agg(), as
> this example shows. There is a small additional overhead per row for
> initialising the MD5 sums, and adding the results to the total, but I
> think the biggest factor is that md5_total() is processing more data.
> The reason is that MD5 works on 64-byte blocks, so the total amount of
> data going through the core MD5 algorithm is each row's size is
> rounded up to a multiple of 64. In this simple case it ends up
> processing around 1.5 times as much data:
>
> SELECT sum(length(foo.*::text)) AS md5_agg,
> sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
>
> md5_agg | md5_total
> ------------+-------------
> 8103815438 | 12799909248
>
> although of course that overhead won't be as large on wider tables,
> and even in this case the overall performance is still on a par with
> md5_agg().
>
> ISTM that both aggregates are potentially useful in different
> situations. I would probably typically use md5_total() because of its
> simplicity/order-independence and consistent performance, but
> md5_agg() might also be useful when comparing with external data.
>
> Regards,
> Dean

Submission review:

Is the patch in a patch format which has context? (eg: context diff format)

Yes.

Does it apply cleanly to the current git master?

Yes.

Does it include reasonable tests, necessary doc patches, etc?

Yes.

Usability review:

Does the patch actually implement that?

Yes.

Do we want that?

Enough do.

Do we already have it?

No.

Does it follow SQL spec, or the community-agreed behavior?

No, and yes, respectively.

Does it include pg_dump support (if applicable)?

N/A

Are there dangers?

Patch changes the MD5 implementation, which could
theoretically result in backward incompatibility if the
changes are not 100% backward-compatible.

Have all the bases been covered?

Yes.

Feature test:

Does the feature work as advertised?

Yes.

Are there corner cases the author has failed to consider?

Not that I've found so far.

Are there any assertion failures or crashes?

No.

Performance review (skills needed: ability to time performance)

Does the patch slow down simple tests?

Yes. For an MD5 checksum of some random data, here are
results from master:

shackle(at)postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
postgres-# select md5(a::text) FROM t2;
shackle(at)postgres:5493=# \timing
Timing is on.
shackle(at)postgres:5493=# \g
Time: 955.393 ms
shackle(at)postgres:5493=# \g
Time: 950.909 ms
shackle(at)postgres:5493=# \g
Time: 947.955 ms
shackle(at)postgres:5493=# \g
Time: 946.193 ms
shackle(at)postgres:5493=# \g
Time: 950.591 ms
shackle(at)postgres:5493=# \g
Time: 952.346 ms
shackle(at)postgres:5493=# \g
Time: 948.623 ms
shackle(at)postgres:5493=# \g
Time: 939.819 ms

and here from master + the patch:

WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 1129.178 ms
shackle(at)postgres:5494=# \g
Time: 1134.013 ms
shackle(at)postgres:5494=# \g
Time: 1124.387 ms
shackle(at)postgres:5494=# \g
Time: 1119.733 ms
shackle(at)postgres:5494=# \g
Time: 1104.906 ms
shackle(at)postgres:5494=# \g
Time: 1137.055 ms
shackle(at)postgres:5494=# \g
Time: 1128.977 ms
shackle(at)postgres:5494=# \g
Time: 1143.619 ms

Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

If it claims to improve performance, does it?

Possibly for the aggregate.

Does it slow down other things?

Not that I've found.

Coding review:

Does it follow the project coding guidelines?

Yes.

Are there portability issues?

Not that I can see.

Will it work on Windows/BSD etc?

Not yet tested.

Are the comments sufficient and accurate?

Yes.

Does it do what it says, correctly?

As far as I can tell.

Does it produce compiler warnings?

No.

Can you make it crash?

My efforts so far have failed.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: David Fetter <david(at)fetter(dot)org>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review [was Re: MD5 aggregate]
Date: 2013-06-21 20:04:28
Message-ID: 20130621200428.GJ5395@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> > On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> > > There seem to be 2 separate directions that this could go, which
> > > really meet different requirements:
> > >
> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > > the order in which they are scanned. A possible approach to this might
> > > be something like an aggregate
> > >
> > > md5_total(text/bytea) returns text
> > >
> > > that returns the sum of the md5 values of each input value, treating
> > > each md5 value as an unsigned 128-bit integer, and then producing the
> > > hexadecimal representation of the final sum. This should out-perform a
> > > solution based on numeric addition, and in typical cases, the result
> > > wouldn't be much longer than a regular md5 sum, and so would be easy
> > > to eyeball for differences.
> > >
> >
> > I've been playing around with the idea of an aggregate that computes
> > the sum of the md5 hashes of each of its inputs, which I've called
> > md5_total() for now, although I'm not particularly wedded to that
> > name. Comparing it with md5_agg() on a 100M row table (see attached
> > test script) produces interesting results:
> >
> > SELECT md5_agg(foo.*::text)
> > FROM (SELECT * FROM foo ORDER BY id) foo;
> >
> > 50bc42127fb9b028c9708248f835ed8f
> >
> > Time: 92960.021 ms
> >
> > SELECT md5_total(foo.*::text) FROM foo;
> >
> > 02faea7fafee4d253fc94cfae031afc43c03479c
> >
> > Time: 96190.343 ms
> >
> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> > result is longer) but it seems like it would be very useful for
> > quickly comparing data in SQL, since its value is not dependent on the
> > row-order making it easier to use and better performing if there is no
> > usable index for ordering.
> >
> > Note, however, that if there is an index that can be used for
> > ordering, the performance is not necessarily better than md5_agg(), as
> > this example shows. There is a small additional overhead per row for
> > initialising the MD5 sums, and adding the results to the total, but I
> > think the biggest factor is that md5_total() is processing more data.
> > The reason is that MD5 works on 64-byte blocks, so the total amount of
> > data going through the core MD5 algorithm is each row's size is
> > rounded up to a multiple of 64. In this simple case it ends up
> > processing around 1.5 times as much data:
> >
> > SELECT sum(length(foo.*::text)) AS md5_agg,
> > sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
> >
> > md5_agg | md5_total
> > ------------+-------------
> > 8103815438 | 12799909248
> >
> > although of course that overhead won't be as large on wider tables,
> > and even in this case the overall performance is still on a par with
> > md5_agg().
> >
> > ISTM that both aggregates are potentially useful in different
> > situations. I would probably typically use md5_total() because of its
> > simplicity/order-independence and consistent performance, but
> > md5_agg() might also be useful when comparing with external data.
> >
> > Regards,
> > Dean

> Performance review (skills needed: ability to time performance)
>
> Does the patch slow down simple tests?
>
> Yes. For an MD5 checksum of some random data, here are
> results from master:
>
> shackle(at)postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
> postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> postgres-# select md5(a::text) FROM t2;
> shackle(at)postgres:5493=# \timing
> Timing is on.
> shackle(at)postgres:5493=# \g
> Time: 955.393 ms
> shackle(at)postgres:5493=# \g
> Time: 950.909 ms
> shackle(at)postgres:5493=# \g
> Time: 947.955 ms
> shackle(at)postgres:5493=# \g
> Time: 946.193 ms
> shackle(at)postgres:5493=# \g
> Time: 950.591 ms
> shackle(at)postgres:5493=# \g
> Time: 952.346 ms
> shackle(at)postgres:5493=# \g
> Time: 948.623 ms
> shackle(at)postgres:5493=# \g
> Time: 939.819 ms
>
> and here from master + the patch:
>
> WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 1129.178 ms
> shackle(at)postgres:5494=# \g
> Time: 1134.013 ms
> shackle(at)postgres:5494=# \g
> Time: 1124.387 ms
> shackle(at)postgres:5494=# \g
> Time: 1119.733 ms
> shackle(at)postgres:5494=# \g
> Time: 1104.906 ms
> shackle(at)postgres:5494=# \g
> Time: 1137.055 ms
> shackle(at)postgres:5494=# \g
> Time: 1128.977 ms
> shackle(at)postgres:5494=# \g
> Time: 1143.619 ms
>
> Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
> Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

The above was with -O0. Please find below with vanilla ./configure (-O2):

Master:
shackle(at)postgres:5493=# WITH t1 AS (SELECT
string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 469.101 ms
shackle(at)postgres:5493=# \g
Time: 464.122 ms
shackle(at)postgres:5493=# \g
Time: 461.411 ms
shackle(at)postgres:5493=# \g
Time: 458.222 ms
shackle(at)postgres:5493=# \g
Time: 463.616 ms
shackle(at)postgres:5493=# \g
Time: 463.983 ms
shackle(at)postgres:5493=# \g
Time: 453.770 ms
shackle(at)postgres:5493=# \g
Time: 456.729 ms

MD5:
WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 436.862 ms
shackle(at)postgres:5494=# \t
Showing only tuples.
shackle(at)postgres:5494=# \t
Tuples only is off.
shackle(at)postgres:5494=# \g
Time: 438.686 ms
shackle(at)postgres:5494=# \g
Time: 443.789 ms
shackle(at)postgres:5494=# \g
Time: 452.522 ms
shackle(at)postgres:5494=# \g
Time: 447.824 ms
shackle(at)postgres:5494=# \g
Time: 448.547 ms
shackle(at)postgres:5494=# \g
Time: 446.901 ms
shackle(at)postgres:5494=# \g
Time: 448.537 ms

Average, stddev population for master: 461.36925, 4.5883631545
Average, stddev population for md5: 445.4585, 4.9892286729

so slightly faster on MD5, as expected.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review [was Re: MD5 aggregate]
Date: 2013-06-23 14:10:15
Message-ID: CAEZATCVC_3kRiRwpNJrTSL9Gt90dNjL=pcXdXboEDBaOu-k4QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 21 June 2013 21:04, David Fetter <david(at)fetter(dot)org> wrote:
> On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
>> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>> > On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> > > There seem to be 2 separate directions that this could go, which
>> > > really meet different requirements:
>> > >
>> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
>> > > the order in which they are scanned. A possible approach to this might
>> > > be something like an aggregate
>> > >
>> > > md5_total(text/bytea) returns text
>> > >
>> > > that returns the sum of the md5 values of each input value, treating
>> > > each md5 value as an unsigned 128-bit integer, and then producing the
>> > > hexadecimal representation of the final sum. This should out-perform a
>> > > solution based on numeric addition, and in typical cases, the result
>> > > wouldn't be much longer than a regular md5 sum, and so would be easy
>> > > to eyeball for differences.
>> > >
>> >
>> > I've been playing around with the idea of an aggregate that computes
>> > the sum of the md5 hashes of each of its inputs, which I've called
>> > md5_total() for now, although I'm not particularly wedded to that
>> > name. Comparing it with md5_agg() on a 100M row table (see attached
>> > test script) produces interesting results:
>> >
>> > SELECT md5_agg(foo.*::text)
>> > FROM (SELECT * FROM foo ORDER BY id) foo;
>> >
>> > 50bc42127fb9b028c9708248f835ed8f
>> >
>> > Time: 92960.021 ms
>> >
>> > SELECT md5_total(foo.*::text) FROM foo;
>> >
>> > 02faea7fafee4d253fc94cfae031afc43c03479c
>> >
>> > Time: 96190.343 ms
>> >
>> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
>> > result is longer) but it seems like it would be very useful for
>> > quickly comparing data in SQL, since its value is not dependent on the
>> > row-order making it easier to use and better performing if there is no
>> > usable index for ordering.
>> >
>> > Note, however, that if there is an index that can be used for
>> > ordering, the performance is not necessarily better than md5_agg(), as
>> > this example shows. There is a small additional overhead per row for
>> > initialising the MD5 sums, and adding the results to the total, but I
>> > think the biggest factor is that md5_total() is processing more data.
>> > The reason is that MD5 works on 64-byte blocks, so the total amount of
>> > data going through the core MD5 algorithm is each row's size is
>> > rounded up to a multiple of 64. In this simple case it ends up
>> > processing around 1.5 times as much data:
>> >
>> > SELECT sum(length(foo.*::text)) AS md5_agg,
>> > sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
>> >
>> > md5_agg | md5_total
>> > ------------+-------------
>> > 8103815438 | 12799909248
>> >
>> > although of course that overhead won't be as large on wider tables,
>> > and even in this case the overall performance is still on a par with
>> > md5_agg().
>> >
>> > ISTM that both aggregates are potentially useful in different
>> > situations. I would probably typically use md5_total() because of its
>> > simplicity/order-independence and consistent performance, but
>> > md5_agg() might also be useful when comparing with external data.
>> >
>> > Regards,
>> > Dean
>
>> Performance review (skills needed: ability to time performance)
>>
>> Does the patch slow down simple tests?
>>
>> Yes. For an MD5 checksum of some random data, here are
>> results from master:
>>
>> shackle(at)postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
>> postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>> postgres-# select md5(a::text) FROM t2;
>> shackle(at)postgres:5493=# \timing
>> Timing is on.
>> shackle(at)postgres:5493=# \g
>> Time: 955.393 ms
>> shackle(at)postgres:5493=# \g
>> Time: 950.909 ms
>> shackle(at)postgres:5493=# \g
>> Time: 947.955 ms
>> shackle(at)postgres:5493=# \g
>> Time: 946.193 ms
>> shackle(at)postgres:5493=# \g
>> Time: 950.591 ms
>> shackle(at)postgres:5493=# \g
>> Time: 952.346 ms
>> shackle(at)postgres:5493=# \g
>> Time: 948.623 ms
>> shackle(at)postgres:5493=# \g
>> Time: 939.819 ms
>>
>> and here from master + the patch:
>>
>> WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
>> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>> select md5(a::text) FROM t2;
>> Time: 1129.178 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1134.013 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1124.387 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1119.733 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1104.906 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1137.055 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1128.977 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1143.619 ms
>>
>> Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
>> Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms
>
> The above was with -O0. Please find below with vanilla ./configure (-O2):
>
> Master:
> shackle(at)postgres:5493=# WITH t1 AS (SELECT
> string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
> generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 469.101 ms
> shackle(at)postgres:5493=# \g
> Time: 464.122 ms
> shackle(at)postgres:5493=# \g
> Time: 461.411 ms
> shackle(at)postgres:5493=# \g
> Time: 458.222 ms
> shackle(at)postgres:5493=# \g
> Time: 463.616 ms
> shackle(at)postgres:5493=# \g
> Time: 463.983 ms
> shackle(at)postgres:5493=# \g
> Time: 453.770 ms
> shackle(at)postgres:5493=# \g
> Time: 456.729 ms
>
> MD5:
> WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
> FROM generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 436.862 ms
> shackle(at)postgres:5494=# \t
> Showing only tuples.
> shackle(at)postgres:5494=# \t
> Tuples only is off.
> shackle(at)postgres:5494=# \g
> Time: 438.686 ms
> shackle(at)postgres:5494=# \g
> Time: 443.789 ms
> shackle(at)postgres:5494=# \g
> Time: 452.522 ms
> shackle(at)postgres:5494=# \g
> Time: 447.824 ms
> shackle(at)postgres:5494=# \g
> Time: 448.547 ms
> shackle(at)postgres:5494=# \g
> Time: 446.901 ms
> shackle(at)postgres:5494=# \g
> Time: 448.537 ms
>
> Average, stddev population for master: 461.36925, 4.5883631545
> Average, stddev population for md5: 445.4585, 4.9892286729
>
> so slightly faster on MD5, as expected.
>
> Cheers,
> David.

Thanks for the review and testing!

Regards,
Dean


From: Noah Misch <noah(at)leadboat(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-26 18:32:01
Message-ID: 20130626183201.GA863764@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
>
> SELECT md5_agg(foo.*::text)
> FROM (SELECT * FROM foo ORDER BY id) foo;
>
> 50bc42127fb9b028c9708248f835ed8f
>
> Time: 92960.021 ms
>
> SELECT md5_total(foo.*::text) FROM foo;
>
> 02faea7fafee4d253fc94cfae031afc43c03479c
>
> Time: 96190.343 ms
>
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.

I took a look at this patch. First, as Peter suggested upthread, it should
have been two patches: one to optimize calculateDigestFromBuffer() and
callees, another atop the first adding new SQL functions and adjusting the
infrastructure to meet their needs.

> + <primary>md5_total</primary>
> + </indexterm>
> + <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
> + </entry>
> + <entry><type>text</type> or <type>bytea</type></entry>
> + <entry><type>text</type></entry>
> + <entry>
> + sum of the MD5 hash values of the input values, independent of their
> + order

This is not specific enough. We would need to provide either an algorithm
specification or a literature reference. However, that just leads to the
problem that we should not put a literature-unrecognized cryptographic
algorithm into core PostgreSQL. I realize that the use case inspiring this
patch wasn't concerned with cryptographic properties, but the association with
md5 immediately makes it a cryptography offering.

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible. However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally. (We do carry legacy stuff that wouldn't make
the cut today.) In my estimation, md5_agg() does not make that cut. The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability. It could just as well live in an extension
published on PGXN. Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

> *** a/src/backend/libpq/md5.c
> --- b/src/backend/libpq/md5.c
> ***************
> *** 1,14 ****
> /*
> * md5.c
> *
> ! * Implements the MD5 Message-Digest Algorithm as specified in
> ! * RFC 1321. This implementation is a simple one, in that it
> ! * needs every input byte to be buffered before doing any
> ! * calculations. I do not expect this file to be used for
> ! * general purpose MD5'ing of large amounts of data, only for
> ! * generating hashed passwords from limited input.

In other words, performance wasn't a design criterion. I wonder if we would
be wasting our time with a narrow performance change like removing the data
copy. What's your opinion of copying pgcrypto's md5 implementation, which
already avoids the data copy?

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-26 20:04:34
Message-ID: CAEZATCXUUrx0QnLtbW5=ZmsAMQkCopWNwGqeR45qoOSOaLV9dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26 June 2013 19:32, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>> I've been playing around with the idea of an aggregate that computes
>> the sum of the md5 hashes of each of its inputs, which I've called
>> md5_total() for now, although I'm not particularly wedded to that
>> name. Comparing it with md5_agg() on a 100M row table (see attached
>> test script) produces interesting results:
>>
>> SELECT md5_agg(foo.*::text)
>> FROM (SELECT * FROM foo ORDER BY id) foo;
>>
>> 50bc42127fb9b028c9708248f835ed8f
>>
>> Time: 92960.021 ms
>>
>> SELECT md5_total(foo.*::text) FROM foo;
>>
>> 02faea7fafee4d253fc94cfae031afc43c03479c
>>
>> Time: 96190.343 ms
>>
>> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
>> result is longer) but it seems like it would be very useful for
>> quickly comparing data in SQL, since its value is not dependent on the
>> row-order making it easier to use and better performing if there is no
>> usable index for ordering.
>
> I took a look at this patch. First, as Peter suggested upthread, it should
> have been two patches: one to optimize calculateDigestFromBuffer() and
> callees, another atop the first adding new SQL functions and adjusting the
> infrastructure to meet their needs.
>

I didn't really set out with the aim of optimising the existing md5()
functions at all, but merely to restructure the code in a way that
would expose the necessary API for md5_agg(). As the timings up-thread
show, the performance gain is really very modest, although the memory
saving might be more of a factor in some setups. In fact, for the
sorts of queries shown, the vast majority of the time is spent
elsewhere, as can be seen by replacing md5() with a more trivial
function such as length().

>> + <primary>md5_total</primary>
>> + </indexterm>
>> + <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
>> + </entry>
>> + <entry><type>text</type> or <type>bytea</type></entry>
>> + <entry><type>text</type></entry>
>> + <entry>
>> + sum of the MD5 hash values of the input values, independent of their
>> + order
>
> This is not specific enough. We would need to provide either an algorithm
> specification or a literature reference. However, that just leads to the
> problem that we should not put a literature-unrecognized cryptographic
> algorithm into core PostgreSQL. I realize that the use case inspiring this
> patch wasn't concerned with cryptographic properties, but the association with
> md5 immediately makes it a cryptography offering.
>

Yes, I can totally understand that point-of-view. I wasn't really
intending to write md5_total() at the outset - it was more of an
intellectual exercise following some of the previous points raised. So
on reflection, I think I can also agree that that probably doesn't
belong in core.

> md5_agg() is well-defined and not cryptographically novel, and your use case
> is credible. However, not every useful-sometimes function belongs in core; we
> mostly stick to functions with ubiquitous value and functions that would be
> onerous to implement externally. (We do carry legacy stuff that wouldn't make
> the cut today.) In my estimation, md5_agg() does not make that cut. The
> variety of intuitive md5_agg() definitions attested upthread doesn't bode well
> for its broad applicability. It could just as well live in an extension
> published on PGXN. Mine is just one opinion, though; I won't be horrified if
> the community wants an md5_agg() in core.
>

I disagree with this though. I see md5_agg() as a natural extension of
the already-in-core md5() functions and underlying code, satisfying a
well-defined use-case, and providing a checksum comparable with
externally generated md5 sums.

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables. So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.

>> *** a/src/backend/libpq/md5.c
>> --- b/src/backend/libpq/md5.c
>> ***************
>> *** 1,14 ****
>> /*
>> * md5.c
>> *
>> ! * Implements the MD5 Message-Digest Algorithm as specified in
>> ! * RFC 1321. This implementation is a simple one, in that it
>> ! * needs every input byte to be buffered before doing any
>> ! * calculations. I do not expect this file to be used for
>> ! * general purpose MD5'ing of large amounts of data, only for
>> ! * generating hashed passwords from limited input.
>
> In other words, performance wasn't a design criterion. I wonder if we would
> be wasting our time with a narrow performance change like removing the data
> copy. What's your opinion of copying pgcrypto's md5 implementation, which
> already avoids the data copy?
>

I haven't looked at pgcrypto because, as I said, performance wasn't my
primary criterion either, but removing the unnessary data copy just
seemed like good sense.

I'll take a look at it, and then as you and Peter suggest, look to
split the patch into two. I think I will withdraw md5_total() because
I was never entirely happy with that anyway.

Thanks for the review.

Regards,
Dean


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-26 20:46:50
Message-ID: 51CB533A.6070603@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/26/13 4:04 PM, Dean Rasheed wrote:
> A quick google search reveals several people asking for something like
> this, and people recommending md5(string_agg(...)) or
> md5(string_agg(md5(...))) based solutions, which are doomed to failure
> on larger tables.

The thread discussed several other options of checksumming tables that
did not have the air of a crytographic offering, as Noah put it.

> So I think that there is a case for having md5_agg()
> in core as an alternative to such hacks, while having more
> sophisticated crypto functions available as extensions.

Well, in general, I'd rather see the sophisticated stuff in core and the
hacks in extensions.


From: Noah Misch <noah(at)leadboat(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-26 21:48:14
Message-ID: 20130626214814.GA865548@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:
> On 26 June 2013 19:32, Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:

> > md5_agg() is well-defined and not cryptographically novel, and your use case
> > is credible. However, not every useful-sometimes function belongs in core; we
> > mostly stick to functions with ubiquitous value and functions that would be
> > onerous to implement externally. (We do carry legacy stuff that wouldn't make
> > the cut today.) In my estimation, md5_agg() does not make that cut. The
> > variety of intuitive md5_agg() definitions attested upthread doesn't bode well
> > for its broad applicability. It could just as well live in an extension
> > published on PGXN. Mine is just one opinion, though; I won't be horrified if
> > the community wants an md5_agg() in core.
>
> I disagree with this though. I see md5_agg() as a natural extension of
> the already-in-core md5() functions and underlying code, satisfying a
> well-defined use-case, and providing a checksum comparable with
> externally generated md5 sums.

All true, but I don't see those facts justifying core inclusion.

> A quick google search reveals several people asking for something like
> this, and people recommending md5(string_agg(...)) or
> md5(string_agg(md5(...))) based solutions, which are doomed to failure
> on larger tables. So I think that there is a case for having md5_agg()
> in core as an alternative to such hacks, while having more
> sophisticated crypto functions available as extensions.

My nine Google hits for "md5(string_agg" included one Stack Overflow answer,
several mirrors of that answer, and a few posts on this thread.

> I haven't looked at pgcrypto because, as I said, performance wasn't my
> primary criterion either, but removing the unnessary data copy just
> seemed like good sense.
>
> I'll take a look at it, and then as you and Peter suggest, look to
> split the patch into two. I think I will withdraw md5_total() because
> I was never entirely happy with that anyway.

Understood; feel free to back off from any performance improvements in which
you didn't wish to involve yourself.

If we do end up moving forward with md5_agg(), I note that the pgcrypto
version already has an initialize/accumulate/finalize API division. A patch
importing that code largely-unchanged would be easier to verify than a patch
restructuring the src/backend/libpq/md5.c implementation. The two patches
would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-27 08:19:41
Message-ID: CAEZATCUFUL9FoYYraE0MSqJdcmya6=SObbh37bfyp23p=vU=Cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26 June 2013 22:48, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:
>> On 26 June 2013 19:32, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>
>> > md5_agg() is well-defined and not cryptographically novel, and your use case
>> > is credible. However, not every useful-sometimes function belongs in core; we
>> > mostly stick to functions with ubiquitous value and functions that would be
>> > onerous to implement externally. (We do carry legacy stuff that wouldn't make
>> > the cut today.) In my estimation, md5_agg() does not make that cut. The
>> > variety of intuitive md5_agg() definitions attested upthread doesn't bode well
>> > for its broad applicability. It could just as well live in an extension
>> > published on PGXN. Mine is just one opinion, though; I won't be horrified if
>> > the community wants an md5_agg() in core.
>>
>> I disagree with this though. I see md5_agg() as a natural extension of
>> the already-in-core md5() functions and underlying code, satisfying a
>> well-defined use-case, and providing a checksum comparable with
>> externally generated md5 sums.
>
> All true, but I don't see those facts justifying core inclusion.
>
>> A quick google search reveals several people asking for something like
>> this, and people recommending md5(string_agg(...)) or
>> md5(string_agg(md5(...))) based solutions, which are doomed to failure
>> on larger tables. So I think that there is a case for having md5_agg()
>> in core as an alternative to such hacks, while having more
>> sophisticated crypto functions available as extensions.
>
> My nine Google hits for "md5(string_agg" included one Stack Overflow answer,
> several mirrors of that answer, and a few posts on this thread.
>

I found more than that using Google. It's not uncommon for people to
use Google and then pick the first "accepted" answer on Stack
Overflow, in which case they might well pick one of the answers here
[1] or here [2]. Or they might find this [3]. Or if they came to our
lists they might find this [4], or deduce from this [5] that it isn't
possible.

[1] http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql

[2] http://stackoverflow.com/questions/13554333/finding-out-the-hash-value-of-a-group-of-rows

[3] https://ralphholz.de/?q=node/298

[4] http://www.postgresql.org/message-id/4E5F469C.5070104@compulab.co.il

[5] http://www.postgresql.org/message-id/e94d85500801301153u6b976e31m89e311c7134a0160@mail.gmail.com

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.

>> I haven't looked at pgcrypto because, as I said, performance wasn't my
>> primary criterion either, but removing the unnessary data copy just
>> seemed like good sense.
>>
>> I'll take a look at it, and then as you and Peter suggest, look to
>> split the patch into two. I think I will withdraw md5_total() because
>> I was never entirely happy with that anyway.
>
> Understood; feel free to back off from any performance improvements in which
> you didn't wish to involve yourself.
>
> If we do end up moving forward with md5_agg(), I note that the pgcrypto
> version already has an initialize/accumulate/finalize API division. A patch
> importing that code largely-unchanged would be easier to verify than a patch
> restructuring the src/backend/libpq/md5.c implementation. The two patches
> would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch.
>

I'll take a look at it, although I think there is still the potential
for bugs either way.

What I've done with libpq's md5.c is just to rearrange the internal
pieces, without touching the core algorithm code or the higher level
callers. So the most likely types of bugs are boundary/size errors. If
I'd broken any of pg_md5_hash(), pg_md5_binary() or pg_md5_crypt(),
then I'd have broken them all. It's possible to get a reasonable level
of confidence in those changes using queries like this on HEAD and
with the patch:

SELECT md5(string_agg(md5(str) || repeat(' ', i), '')) FROM (
SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

and these with the patch:

SELECT md5_agg(md5(str) || repeat(' ', i)) FROM (
SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

SELECT md5_agg(md5_agg || repeat(' ', i)) FROM (
SELECT i, md5_agg(chr(32+(i+j*31)%224))
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

which all produce the same overall sum.

Regards,
Dean


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-27 08:28:07
Message-ID: CAEZATCUJKKw6jRz1iCOo3vOjs-y82G1Ew9amXTp4C-A-+bcjKA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 26 June 2013 21:46, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On 6/26/13 4:04 PM, Dean Rasheed wrote:
>> A quick google search reveals several people asking for something like
>> this, and people recommending md5(string_agg(...)) or
>> md5(string_agg(md5(...))) based solutions, which are doomed to failure
>> on larger tables.
>
> The thread discussed several other options of checksumming tables that
> did not have the air of a crytographic offering, as Noah put it.
>

True but md5 has the advantage of being directly comparable with the
output of Unix md5sum, which would be useful if you loaded data from
external files and wanted to confirm that your import process didn't
mangle it.

Regards,
Dean


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-27 11:29:22
Message-ID: CACMqXCJNrpTttpMFW8u5fvy7sEJCkYCep5278nJB3-vpGHcdcw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> On 26 June 2013 21:46, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>> On 6/26/13 4:04 PM, Dean Rasheed wrote:
>>> A quick google search reveals several people asking for something like
>>> this, and people recommending md5(string_agg(...)) or
>>> md5(string_agg(md5(...))) based solutions, which are doomed to failure
>>> on larger tables.
>>
>> The thread discussed several other options of checksumming tables that
>> did not have the air of a crytographic offering, as Noah put it.
>>
>
> True but md5 has the advantage of being directly comparable with the
> output of Unix md5sum, which would be useful if you loaded data from
> external files and wanted to confirm that your import process didn't
> mangle it.

The problem with md5_agg() is that it's only useful in toy scenarios.

It's more useful give people script that does same sum(hash(row))
on dump file than try to run MD5 on ordered rows.

Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
people want way to check if 2 tables or table and dump are same.

--
marko


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Peter Eisentraut <peter_e(at)gmx(dot)net>, Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-27 15:44:28
Message-ID: CA+Tgmoaa5kMEVZRoGQkUmZ8ykBN9Qv3iqUxyZi8o7U8Y_V_9YA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Jun 27, 2013 at 7:29 AM, Marko Kreen <markokr(at)gmail(dot)com> wrote:
> On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> On 26 June 2013 21:46, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
>>> On 6/26/13 4:04 PM, Dean Rasheed wrote:
>>>> A quick google search reveals several people asking for something like
>>>> this, and people recommending md5(string_agg(...)) or
>>>> md5(string_agg(md5(...))) based solutions, which are doomed to failure
>>>> on larger tables.
>>>
>>> The thread discussed several other options of checksumming tables that
>>> did not have the air of a crytographic offering, as Noah put it.
>>>
>>
>> True but md5 has the advantage of being directly comparable with the
>> output of Unix md5sum, which would be useful if you loaded data from
>> external files and wanted to confirm that your import process didn't
>> mangle it.
>
> The problem with md5_agg() is that it's only useful in toy scenarios.
>
> It's more useful give people script that does same sum(hash(row))
> on dump file than try to run MD5 on ordered rows.
>
> Also, I don't think anybody actually cares about MD5(table-as-bytes), instead
> people want way to check if 2 tables or table and dump are same.

I think you're trying to tell Dean to write the patch that you want
instead of the patch that he wants. There are certainly other things
that could be done that some people might sometimes prefer, but that
doesn't mean what he did isn't useful.

That having been said, I basically agree with Noah: I think this would
be a useful extension (perhaps even in contrib?) but I don't think we
need to install it by default. It's useful, but it's also narrow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-27 16:47:49
Message-ID: 51CC6CB5.4000705@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 6/27/13 4:19 AM, Dean Rasheed wrote:
> I'd say there are clearly people who want it, and the nature of some
> of those answers suggests to me that we ought to have a better answer
> in core.

It's not clear what these people wanted this functionality for. They
all wanted to analyze a table to compare with another table (or the same
table later). Either, they wanted this to detect data changes, in which
case the right tool is a checksum, not a cryptographic hash. We already
have several checksum implementations in core, so we could expose on of
them. Or they wanted this to protect their data from tampering, in
which case the right tool is a cryptographic hash, but Noah argues that
a sum of MD5 hashes is not cryptographically sound. (And in any case,
we don't put cryptographic functionality into the core.)

The reason md5_agg is proposed here and in all those cited posts is
presumably because the md5() function was already there anyway. The the
md5() function is there because the md5 code was already there anyway
because of the authentication. Let's not add higher-order
already-there-anyway code. ;-)


From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-27 23:10:32
Message-ID: CAEZATCVCWObMN4Mtu1GktZXBt1L3hfSAeE17fYiLP64KxFusNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 June 2013 17:47, Peter Eisentraut <peter_e(at)gmx(dot)net> wrote:
> On 6/27/13 4:19 AM, Dean Rasheed wrote:
>> I'd say there are clearly people who want it, and the nature of some
>> of those answers suggests to me that we ought to have a better answer
>> in core.
>
> It's not clear what these people wanted this functionality for. They
> all wanted to analyze a table to compare with another table (or the same
> table later). Either, they wanted this to detect data changes, in which
> case the right tool is a checksum, not a cryptographic hash. We already
> have several checksum implementations in core, so we could expose on of
> them. Or they wanted this to protect their data from tampering, in
> which case the right tool is a cryptographic hash, but Noah argues that
> a sum of MD5 hashes is not cryptographically sound. (And in any case,
> we don't put cryptographic functionality into the core.)
>
> The reason md5_agg is proposed here and in all those cited posts is
> presumably because the md5() function was already there anyway. The the
> md5() function is there because the md5 code was already there anyway
> because of the authentication. Let's not add higher-order
> already-there-anyway code. ;-)
>

OK fair enough. It's looking like there are more people who don't want
md5_agg() in core, or want something different, than who do want it.
Also, if we're taking the view that the existing md5() function is
only for hashing passwords, then it's probably not worth trying to
optimise it.

At this stage it's probably best to mark this as returned with
feedback, and I'll consider the options for rewriting it, but not
during this commitfest.

Regards,
Dean