Re: MD5 aggregate

From: Noah Misch <noah(at)leadboat(dot)com>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, david(at)fetter(dot)org
Subject: Re: MD5 aggregate
Date: 2013-06-26 18:32:01
Message-ID: 20130626183201.GA863764@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> I've been playing around with the idea of an aggregate that computes
> the sum of the md5 hashes of each of its inputs, which I've called
> md5_total() for now, although I'm not particularly wedded to that
> name. Comparing it with md5_agg() on a 100M row table (see attached
> test script) produces interesting results:
>
> SELECT md5_agg(foo.*::text)
> FROM (SELECT * FROM foo ORDER BY id) foo;
>
> 50bc42127fb9b028c9708248f835ed8f
>
> Time: 92960.021 ms
>
> SELECT md5_total(foo.*::text) FROM foo;
>
> 02faea7fafee4d253fc94cfae031afc43c03479c
>
> Time: 96190.343 ms
>
> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> result is longer) but it seems like it would be very useful for
> quickly comparing data in SQL, since its value is not dependent on the
> row-order making it easier to use and better performing if there is no
> usable index for ordering.

I took a look at this patch. First, as Peter suggested upthread, it should
have been two patches: one to optimize calculateDigestFromBuffer() and
callees, another atop the first adding new SQL functions and adjusting the
infrastructure to meet their needs.

> + <primary>md5_total</primary>
> + </indexterm>
> + <function>md5_total(<replaceable class="parameter">expression</replaceable>)</function>
> + </entry>
> + <entry><type>text</type> or <type>bytea</type></entry>
> + <entry><type>text</type></entry>
> + <entry>
> + sum of the MD5 hash values of the input values, independent of their
> + order

This is not specific enough. We would need to provide either an algorithm
specification or a literature reference. However, that just leads to the
problem that we should not put a literature-unrecognized cryptographic
algorithm into core PostgreSQL. I realize that the use case inspiring this
patch wasn't concerned with cryptographic properties, but the association with
md5 immediately makes it a cryptography offering.

md5_agg() is well-defined and not cryptographically novel, and your use case
is credible. However, not every useful-sometimes function belongs in core; we
mostly stick to functions with ubiquitous value and functions that would be
onerous to implement externally. (We do carry legacy stuff that wouldn't make
the cut today.) In my estimation, md5_agg() does not make that cut. The
variety of intuitive md5_agg() definitions attested upthread doesn't bode well
for its broad applicability. It could just as well live in an extension
published on PGXN. Mine is just one opinion, though; I won't be horrified if
the community wants an md5_agg() in core.

> *** a/src/backend/libpq/md5.c
> --- b/src/backend/libpq/md5.c
> ***************
> *** 1,14 ****
> /*
> * md5.c
> *
> ! * Implements the MD5 Message-Digest Algorithm as specified in
> ! * RFC 1321. This implementation is a simple one, in that it
> ! * needs every input byte to be buffered before doing any
> ! * calculations. I do not expect this file to be used for
> ! * general purpose MD5'ing of large amounts of data, only for
> ! * generating hashed passwords from limited input.

In other words, performance wasn't a design criterion. I wonder if we would
be wasting our time with a narrow performance change like removing the data
copy. What's your opinion of copying pgcrypto's md5 implementation, which
already avoids the data copy?

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2013-06-26 18:46:08 Re: Kudos for Reviewers -- straw poll
Previous Message Jeff Janes 2013-06-26 18:31:04 Re: Bloom Filter lookup for hash joins