Re: MD5 aggregate

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-27 08:19:41
Message-ID: CAEZATCUFUL9FoYYraE0MSqJdcmya6=SObbh37bfyp23p=vU=Cg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26 June 2013 22:48, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote:
>> On 26 June 2013 19:32, Noah Misch <noah(at)leadboat(dot)com> wrote:
>> > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>
>> > md5_agg() is well-defined and not cryptographically novel, and your use case
>> > is credible. However, not every useful-sometimes function belongs in core; we
>> > mostly stick to functions with ubiquitous value and functions that would be
>> > onerous to implement externally. (We do carry legacy stuff that wouldn't make
>> > the cut today.) In my estimation, md5_agg() does not make that cut. The
>> > variety of intuitive md5_agg() definitions attested upthread doesn't bode well
>> > for its broad applicability. It could just as well live in an extension
>> > published on PGXN. Mine is just one opinion, though; I won't be horrified if
>> > the community wants an md5_agg() in core.
>>
>> I disagree with this though. I see md5_agg() as a natural extension of
>> the already-in-core md5() functions and underlying code, satisfying a
>> well-defined use-case, and providing a checksum comparable with
>> externally generated md5 sums.
>
> All true, but I don't see those facts justifying core inclusion.
>
>> A quick google search reveals several people asking for something like
>> this, and people recommending md5(string_agg(...)) or
>> md5(string_agg(md5(...))) based solutions, which are doomed to failure
>> on larger tables. So I think that there is a case for having md5_agg()
>> in core as an alternative to such hacks, while having more
>> sophisticated crypto functions available as extensions.
>
> My nine Google hits for "md5(string_agg" included one Stack Overflow answer,
> several mirrors of that answer, and a few posts on this thread.
>

I found more than that using Google. It's not uncommon for people to
use Google and then pick the first "accepted" answer on Stack
Overflow, in which case they might well pick one of the answers here
[1] or here [2]. Or they might find this [3]. Or if they came to our
lists they might find this [4], or deduce from this [5] that it isn't
possible.

[1] http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql

[2] http://stackoverflow.com/questions/13554333/finding-out-the-hash-value-of-a-group-of-rows

[3] https://ralphholz.de/?q=node/298

[4] http://www.postgresql.org/message-id/4E5F469C.5070104@compulab.co.il

[5] http://www.postgresql.org/message-id/e94d85500801301153u6b976e31m89e311c7134a0160@mail.gmail.com

I'd say there are clearly people who want it, and the nature of some
of those answers suggests to me that we ought to have a better answer
in core.

>> I haven't looked at pgcrypto because, as I said, performance wasn't my
>> primary criterion either, but removing the unnessary data copy just
>> seemed like good sense.
>>
>> I'll take a look at it, and then as you and Peter suggest, look to
>> split the patch into two. I think I will withdraw md5_total() because
>> I was never entirely happy with that anyway.
>
> Understood; feel free to back off from any performance improvements in which
> you didn't wish to involve yourself.
>
> If we do end up moving forward with md5_agg(), I note that the pgcrypto
> version already has an initialize/accumulate/finalize API division. A patch
> importing that code largely-unchanged would be easier to verify than a patch
> restructuring the src/backend/libpq/md5.c implementation. The two patches
> would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch.
>

I'll take a look at it, although I think there is still the potential
for bugs either way.

What I've done with libpq's md5.c is just to rearrange the internal
pieces, without touching the core algorithm code or the higher level
callers. So the most likely types of bugs are boundary/size errors. If
I'd broken any of pg_md5_hash(), pg_md5_binary() or pg_md5_crypt(),
then I'd have broken them all. It's possible to get a reasonable level
of confidence in those changes using queries like this on HEAD and
with the patch:

SELECT md5(string_agg(md5(str) || repeat(' ', i), '')) FROM (
SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

and these with the patch:

SELECT md5_agg(md5(str) || repeat(' ', i)) FROM (
SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

SELECT md5_agg(md5_agg || repeat(' ', i)) FROM (
SELECT i, md5_agg(chr(32+(i+j*31)%224))
FROM generate_series(0,10000) i,
generate_series(0,i) j
GROUP BY i
) t;

which all produce the same overall sum.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dean Rasheed 2013-06-27 08:28:07 Re: MD5 aggregate
Previous Message Atri Sharma 2013-06-27 07:56:59 Group Commits Vs WAL Writes