Re: MD5 aggregate

From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>, Marko Kreen <markokr(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MD5 aggregate
Date: 2013-06-14 15:09:46
Message-ID: 51BB323A.3000604@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 06/14/2013 04:47 PM, Tom Lane wrote:
> Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> writes:
>> On 14 June 2013 14:14, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Personally I'd be a bit inclined to xor the per-row md5's rather than
>>> sum them, but that's a small matter.
>> But this would be a much riskier thing to do with a single column,
>> because if you updated multiple rows in the same way (e.g., UPDATE t
>> SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if
>> there were an even number of matches.
> I was implicitly thinking that the sum would be a modulo sum so that the
> final result is still the size of an md5 signature. If that's true,
> then leaking bits via carry out is just as bad as xor's deficiencies.
> Now, you could certainly make it a non-modulo sum and not lose any
> information to carries, if you're willing to do the arithmetic in
> NUMERIC and have a variable-width result. Sounds a bit slow though.
What skytools/pgq/londiste uses for comparing tables on master
and slave is query like this

select sum(hashtext(t.*::text)) from <yourtable> t;

This is non-modulo sum and does not use md5 but relies on
whatever the hashtext() du jour is :)

So it is not comparable to anything external (like the md5sum
compatible idea above) but is usually good enough for fast
checks of compatible tables.

As tables are unordered by definition anyway, this should be
good enough for most SQL.

The speed comes from both fast(er) hashtext() function and
avoiding the sort.

--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jeff Davis 2013-06-14 15:26:11 Re: pg_filedump 9.3: checksums (and a few other fixes)
Previous Message Andres Freund 2013-06-14 14:56:57 Re: MD5 aggregate