Re: Review [was Re: MD5 aggregate]

From: David Fetter <david(at)fetter(dot)org>
To: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review [was Re: MD5 aggregate]
Date: 2013-06-21 20:04:28
Message-ID: 20130621200428.GJ5395@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> > On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
> > > There seem to be 2 separate directions that this could go, which
> > > really meet different requirements:
> > >
> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > > the order in which they are scanned. A possible approach to this might
> > > be something like an aggregate
> > >
> > > md5_total(text/bytea) returns text
> > >
> > > that returns the sum of the md5 values of each input value, treating
> > > each md5 value as an unsigned 128-bit integer, and then producing the
> > > hexadecimal representation of the final sum. This should out-perform a
> > > solution based on numeric addition, and in typical cases, the result
> > > wouldn't be much longer than a regular md5 sum, and so would be easy
> > > to eyeball for differences.
> > >
> >
> > I've been playing around with the idea of an aggregate that computes
> > the sum of the md5 hashes of each of its inputs, which I've called
> > md5_total() for now, although I'm not particularly wedded to that
> > name. Comparing it with md5_agg() on a 100M row table (see attached
> > test script) produces interesting results:
> >
> > SELECT md5_agg(foo.*::text)
> > FROM (SELECT * FROM foo ORDER BY id) foo;
> >
> > 50bc42127fb9b028c9708248f835ed8f
> >
> > Time: 92960.021 ms
> >
> > SELECT md5_total(foo.*::text) FROM foo;
> >
> > 02faea7fafee4d253fc94cfae031afc43c03479c
> >
> > Time: 96190.343 ms
> >
> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> > result is longer) but it seems like it would be very useful for
> > quickly comparing data in SQL, since its value is not dependent on the
> > row-order making it easier to use and better performing if there is no
> > usable index for ordering.
> >
> > Note, however, that if there is an index that can be used for
> > ordering, the performance is not necessarily better than md5_agg(), as
> > this example shows. There is a small additional overhead per row for
> > initialising the MD5 sums, and adding the results to the total, but I
> > think the biggest factor is that md5_total() is processing more data.
> > The reason is that MD5 works on 64-byte blocks, so the total amount of
> > data going through the core MD5 algorithm is each row's size is
> > rounded up to a multiple of 64. In this simple case it ends up
> > processing around 1.5 times as much data:
> >
> > SELECT sum(length(foo.*::text)) AS md5_agg,
> > sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
> >
> > md5_agg | md5_total
> > ------------+-------------
> > 8103815438 | 12799909248
> >
> > although of course that overhead won't be as large on wider tables,
> > and even in this case the overall performance is still on a par with
> > md5_agg().
> >
> > ISTM that both aggregates are potentially useful in different
> > situations. I would probably typically use md5_total() because of its
> > simplicity/order-independence and consistent performance, but
> > md5_agg() might also be useful when comparing with external data.
> >
> > Regards,
> > Dean

> Performance review (skills needed: ability to time performance)
>
> Does the patch slow down simple tests?
>
> Yes. For an MD5 checksum of some random data, here are
> results from master:
>
> shackle(at)postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
> postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> postgres-# select md5(a::text) FROM t2;
> shackle(at)postgres:5493=# \timing
> Timing is on.
> shackle(at)postgres:5493=# \g
> Time: 955.393 ms
> shackle(at)postgres:5493=# \g
> Time: 950.909 ms
> shackle(at)postgres:5493=# \g
> Time: 947.955 ms
> shackle(at)postgres:5493=# \g
> Time: 946.193 ms
> shackle(at)postgres:5493=# \g
> Time: 950.591 ms
> shackle(at)postgres:5493=# \g
> Time: 952.346 ms
> shackle(at)postgres:5493=# \g
> Time: 948.623 ms
> shackle(at)postgres:5493=# \g
> Time: 939.819 ms
>
> and here from master + the patch:
>
> WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 1129.178 ms
> shackle(at)postgres:5494=# \g
> Time: 1134.013 ms
> shackle(at)postgres:5494=# \g
> Time: 1124.387 ms
> shackle(at)postgres:5494=# \g
> Time: 1119.733 ms
> shackle(at)postgres:5494=# \g
> Time: 1104.906 ms
> shackle(at)postgres:5494=# \g
> Time: 1137.055 ms
> shackle(at)postgres:5494=# \g
> Time: 1128.977 ms
> shackle(at)postgres:5494=# \g
> Time: 1143.619 ms
>
> Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
> Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

The above was with -O0. Please find below with vanilla ./configure (-O2):

Master:
shackle(at)postgres:5493=# WITH t1 AS (SELECT
string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 469.101 ms
shackle(at)postgres:5493=# \g
Time: 464.122 ms
shackle(at)postgres:5493=# \g
Time: 461.411 ms
shackle(at)postgres:5493=# \g
Time: 458.222 ms
shackle(at)postgres:5493=# \g
Time: 463.616 ms
shackle(at)postgres:5493=# \g
Time: 463.983 ms
shackle(at)postgres:5493=# \g
Time: 453.770 ms
shackle(at)postgres:5493=# \g
Time: 456.729 ms

MD5:
WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 436.862 ms
shackle(at)postgres:5494=# \t
Showing only tuples.
shackle(at)postgres:5494=# \t
Tuples only is off.
shackle(at)postgres:5494=# \g
Time: 438.686 ms
shackle(at)postgres:5494=# \g
Time: 443.789 ms
shackle(at)postgres:5494=# \g
Time: 452.522 ms
shackle(at)postgres:5494=# \g
Time: 447.824 ms
shackle(at)postgres:5494=# \g
Time: 448.547 ms
shackle(at)postgres:5494=# \g
Time: 446.901 ms
shackle(at)postgres:5494=# \g
Time: 448.537 ms

Average, stddev population for master: 461.36925, 4.5883631545
Average, stddev population for md5: 445.4585, 4.9892286729

so slightly faster on MD5, as expected.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-21 20:39:50 Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
Previous Message Heikki Linnakangas 2013-06-21 19:43:01 Re: GIN improvements part2: fast scan