Re: MD5 aggregate

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-15 09:22:21
Message-ID: CAEZATCUvYXfuDBszdknjKXF0z58hpV4-8SfRo6AhR4i6EJ+Vdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 13 June 2013 10:35, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> Hi,
>
> Attached is a patch implementing a new aggregate function md5_agg() to
> compute the aggregate MD5 sum across a number of rows. This is
> something I've wished for a number of times. I think the primary use
> case is to do a quick check that 2 tables, possibly on different
> servers, contain the same data, using a query like
>
> SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo;
>
> or
>
> SELECT md5_agg(foo.*::text ORDER BY id) FROM foo;

There seem to be 2 separate directions that this could go, which
really meet different requirements:

1). Produce an unordered sum for SQL to compare 2 tables regardless of
the order in which they are scanned. A possible approach to this might
be something like an aggregate

md5_total(text/bytea) returns text

that returns the sum of the md5 values of each input value, treating
each md5 value as an unsigned 128-bit integer, and then producing the
hexadecimal representation of the final sum. This should out-perform a
solution based on numeric addition, and in typical cases, the result
wouldn't be much longer than a regular md5 sum, and so would be easy
to eyeball for differences.

2). Produce an ordered MD5 sum compatible with COPY, whose result
would match that of running unix md5sum on the COPY output. Given all
the possible COPY options that would affect the result (format,
delimiters, headers, quoting, escaping, ...), I think that such a
thing would only reasonably be possible as an extension to the COPY
command itself.

I guess in its simplest form this would just be a new option "MD5" to
COPY that would cause it to pipe its output to the md5 aggregator and
then send the final sum to the COPY destination at the end (e.g.,
"COPY foo TO STDOUT MD5" would produce the ordered MD5 sum of the data
in foo).

I still think my original md5_agg() has its uses, since what it
produces is comparable with external md5 sums, and is directly
available to SQL, but (1) is probably the most useful for quickly
comparing 2 tables. I'm much less convinced about the value of (2),
but on the face of it, it doesn't seem like it would be hard to
implement.

Thoughts?

Regards,
Dean

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Craig Ringer 2013-06-15 09:31:42 Re: Hard to Use WAS: Hard limit on WAL space
Previous Message Joshua D. Drake 2013-06-15 07:55:39 Re: Hard to Use WAS: Hard limit on WAL space