Re: MD5 aggregate

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-26 20:04:34
Message-ID: CAEZATCXUUrx0QnLtbW5=ZmsAMQkCopWNwGqeR45qoOSOaLV9dw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 26 June 2013 19:32, Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>> I've been playing around with the idea of an aggregate that computes
>> the sum of the md5 hashes of each of its inputs, which I've called
>> md5_total() for now, although I'm not particularly wedded to that
>> name. Comparing it with md5_agg() on a 100M row table (see attached
>> test script) produces interesting results:
>>
>> SELECT md5_agg(foo.*::text)
>> FROM (SELECT * FROM foo ORDER BY id) foo;
>>
>> 50bc42127fb9b028c9708248f835ed8f
>>
>> Time: 92960.021 ms
>>
>> SELECT md5_total(foo.*::text) FROM foo;
>>
>> 02faea7fafee4d253fc94cfae031afc43c03479c
>>
>> Time: 96190.343 ms
>>
>> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
>> result is longer) but it seems like it would be very useful for
>> quickly comparing data in SQL, since its value is not dependent on the
>> row-order making it easier to use and better performing if there is no
>> usable index for ordering.
>
> I took a look at this patch. First, as Peter suggested upthread, it should
> have been two patches: one to optimize calculateDigestFromBuffer() and
> callees, another atop the first adding new SQL functions and adjusting the
> infrastructure to meet their needs.
>

I didn't really set out with the aim of optimising the existing md5()
functions at all, but merely to restructure the code in a way that
would expose the necessary API for md5_agg(). As the timings up-thread
show, the performance gain is really very modest, although the memory
saving might be more of a factor in some setups. In fact, for the
sorts of queries shown, the vast majority of the time is spent
elsewhere, as can be seen by replacing md5() with a more trivial
function such as length().

>> + <primary>md5_total</primary>
>> + </indexterm>
>> + <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
>> + </entry>
>> + <entry><type>text</type> or <type>bytea</type></entry>
>> + <entry><type>text</type></entry>
>> + <entry>
>> + sum of the MD5 hash values of the input values, independent of their
>> + order
>
> This is not specific enough. We would need to provide either an algorithm
> specification or a literature reference. However, that just leads to the
> problem that we should not put a literature-unrecognized cryptographic
> algorithm into core PostgreSQL. I realize that the use case inspiring this
> patch wasn't concerned with cryptographic properties, but the association with
> md5 immediately makes it a cryptography offering.
>

Yes, I can totally understand that point-of-view. I wasn't really
intending to write md5_total() at the outset - it was more of an
intellectual exercise following some of the previous points raised. So
on reflection, I think I can also agree that that probably doesn't
belong in core.

> md5_agg() is well-defined and not cryptographically novel, and your use case
> is credible. However, not every useful-sometimes function belongs in core; we
> mostly stick to functions with ubiquitous value and functions that would be
> onerous to implement externally. (We do carry legacy stuff that wouldn't make
> the cut today.) In my estimation, md5_agg() does not make that cut. The
> variety of intuitive md5_agg() definitions attested upthread doesn't bode well
> for its broad applicability. It could just as well live in an extension
> published on PGXN. Mine is just one opinion, though; I won't be horrified if
> the community wants an md5_agg() in core.
>

I disagree with this though. I see md5_agg() as a natural extension of
the already-in-core md5() functions and underlying code, satisfying a
well-defined use-case, and providing a checksum comparable with
externally generated md5 sums.

A quick google search reveals several people asking for something like
this, and people recommending md5(string_agg(...)) or
md5(string_agg(md5(...))) based solutions, which are doomed to failure
on larger tables. So I think that there is a case for having md5_agg()
in core as an alternative to such hacks, while having more
sophisticated crypto functions available as extensions.

>> *** a/src/backend/libpq/md5.c
>> --- b/src/backend/libpq/md5.c
>> ***************
>> *** 1,14 ****
>> /*
>> * md5.c
>> *
>> ! * Implements the MD5 Message-Digest Algorithm as specified in
>> ! * RFC 1321. This implementation is a simple one, in that it
>> ! * needs every input byte to be buffered before doing any
>> ! * calculations. I do not expect this file to be used for
>> ! * general purpose MD5'ing of large amounts of data, only for
>> ! * generating hashed passwords from limited input.
>
> In other words, performance wasn't a design criterion. I wonder if we would
> be wasting our time with a narrow performance change like removing the data
> copy. What's your opinion of copying pgcrypto's md5 implementation, which
> already avoids the data copy?
>

I haven't looked at pgcrypto because, as I said, performance wasn't my
primary criterion either, but removing the unnessary data copy just
seemed like good sense.

I'll take a look at it, and then as you and Peter suggest, look to
split the patch into two. I think I will withdraw md5_total() because
I was never entirely happy with that anyway.

Thanks for the review.

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Szymon Guz 2013-06-26 20:08:13 Re: [PATCH] Fix conversion for Decimal arguments in plpython functions
Previous Message Peter Eisentraut 2013-06-26 19:59:36 Re: [PATCH] Fix conversion for Decimal arguments in plpython functions