Re: Review [was Re: MD5 aggregate]

From: Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Review [was Re: MD5 aggregate]
Date: 2013-06-23 14:10:15
Message-ID: CAEZATCVC_3kRiRwpNJrTSL9Gt90dNjL=pcXdXboEDBaOu-k4QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 21 June 2013 21:04, David Fetter <david(at)fetter(dot)org> wrote:
> On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
>> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
>> > On 15 June 2013 10:22, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com> wrote:
>> > > There seem to be 2 separate directions that this could go, which
>> > > really meet different requirements:
>> > >
>> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
>> > > the order in which they are scanned. A possible approach to this might
>> > > be something like an aggregate
>> > >
>> > > md5_total(text/bytea) returns text
>> > >
>> > > that returns the sum of the md5 values of each input value, treating
>> > > each md5 value as an unsigned 128-bit integer, and then producing the
>> > > hexadecimal representation of the final sum. This should out-perform a
>> > > solution based on numeric addition, and in typical cases, the result
>> > > wouldn't be much longer than a regular md5 sum, and so would be easy
>> > > to eyeball for differences.
>> > >
>> >
>> > I've been playing around with the idea of an aggregate that computes
>> > the sum of the md5 hashes of each of its inputs, which I've called
>> > md5_total() for now, although I'm not particularly wedded to that
>> > name. Comparing it with md5_agg() on a 100M row table (see attached
>> > test script) produces interesting results:
>> >
>> > SELECT md5_agg(foo.*::text)
>> > FROM (SELECT * FROM foo ORDER BY id) foo;
>> >
>> > 50bc42127fb9b028c9708248f835ed8f
>> >
>> > Time: 92960.021 ms
>> >
>> > SELECT md5_total(foo.*::text) FROM foo;
>> >
>> > 02faea7fafee4d253fc94cfae031afc43c03479c
>> >
>> > Time: 96190.343 ms
>> >
>> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
>> > result is longer) but it seems like it would be very useful for
>> > quickly comparing data in SQL, since its value is not dependent on the
>> > row-order making it easier to use and better performing if there is no
>> > usable index for ordering.
>> >
>> > Note, however, that if there is an index that can be used for
>> > ordering, the performance is not necessarily better than md5_agg(), as
>> > this example shows. There is a small additional overhead per row for
>> > initialising the MD5 sums, and adding the results to the total, but I
>> > think the biggest factor is that md5_total() is processing more data.
>> > The reason is that MD5 works on 64-byte blocks, so the total amount of
>> > data going through the core MD5 algorithm is each row's size is
>> > rounded up to a multiple of 64. In this simple case it ends up
>> > processing around 1.5 times as much data:
>> >
>> > SELECT sum(length(foo.*::text)) AS md5_agg,
>> > sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
>> >
>> > md5_agg | md5_total
>> > ------------+-------------
>> > 8103815438 | 12799909248
>> >
>> > although of course that overhead won't be as large on wider tables,
>> > and even in this case the overall performance is still on a par with
>> > md5_agg().
>> >
>> > ISTM that both aggregates are potentially useful in different
>> > situations. I would probably typically use md5_total() because of its
>> > simplicity/order-independence and consistent performance, but
>> > md5_agg() might also be useful when comparing with external data.
>> >
>> > Regards,
>> > Dean
>
>> Performance review (skills needed: ability to time performance)
>>
>> Does the patch slow down simple tests?
>>
>> Yes. For an MD5 checksum of some random data, here are
>> results from master:
>>
>> shackle(at)postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
>> postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>> postgres-# select md5(a::text) FROM t2;
>> shackle(at)postgres:5493=# \timing
>> Timing is on.
>> shackle(at)postgres:5493=# \g
>> Time: 955.393 ms
>> shackle(at)postgres:5493=# \g
>> Time: 950.909 ms
>> shackle(at)postgres:5493=# \g
>> Time: 947.955 ms
>> shackle(at)postgres:5493=# \g
>> Time: 946.193 ms
>> shackle(at)postgres:5493=# \g
>> Time: 950.591 ms
>> shackle(at)postgres:5493=# \g
>> Time: 952.346 ms
>> shackle(at)postgres:5493=# \g
>> Time: 948.623 ms
>> shackle(at)postgres:5493=# \g
>> Time: 939.819 ms
>>
>> and here from master + the patch:
>>
>> WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
>> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>> select md5(a::text) FROM t2;
>> Time: 1129.178 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1134.013 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1124.387 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1119.733 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1104.906 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1137.055 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1128.977 ms
>> shackle(at)postgres:5494=# \g
>> Time: 1143.619 ms
>>
>> Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
>> Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms
>
> The above was with -O0. Please find below with vanilla ./configure (-O2):
>
> Master:
> shackle(at)postgres:5493=# WITH t1 AS (SELECT
> string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
> generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 469.101 ms
> shackle(at)postgres:5493=# \g
> Time: 464.122 ms
> shackle(at)postgres:5493=# \g
> Time: 461.411 ms
> shackle(at)postgres:5493=# \g
> Time: 458.222 ms
> shackle(at)postgres:5493=# \g
> Time: 463.616 ms
> shackle(at)postgres:5493=# \g
> Time: 463.983 ms
> shackle(at)postgres:5493=# \g
> Time: 453.770 ms
> shackle(at)postgres:5493=# \g
> Time: 456.729 ms
>
> MD5:
> WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
> FROM generate_series(1,10000)),
> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
> select md5(a::text) FROM t2;
> Time: 436.862 ms
> shackle(at)postgres:5494=# \t
> Showing only tuples.
> shackle(at)postgres:5494=# \t
> Tuples only is off.
> shackle(at)postgres:5494=# \g
> Time: 438.686 ms
> shackle(at)postgres:5494=# \g
> Time: 443.789 ms
> shackle(at)postgres:5494=# \g
> Time: 452.522 ms
> shackle(at)postgres:5494=# \g
> Time: 447.824 ms
> shackle(at)postgres:5494=# \g
> Time: 448.547 ms
> shackle(at)postgres:5494=# \g
> Time: 446.901 ms
> shackle(at)postgres:5494=# \g
> Time: 448.537 ms
>
> Average, stddev population for master: 461.36925, 4.5883631545
> Average, stddev population for md5: 445.4585, 4.9892286729
>
> so slightly faster on MD5, as expected.
>
> Cheers,
> David.

Thanks for the review and testing!

Regards,
Dean

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Kevin Grittner 2013-06-23 14:14:51 Re: wrong state of patch in CF
Previous Message Dean Rasheed 2013-06-23 13:44:32 Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]