Re: Optimizing pglz compressor

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
Subject: Re: Optimizing pglz compressor
Date: 2013-03-05 13:32:43
Message-ID: 5135F3FB.7040706@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I spent some more time on this, and came up with the attached patch. It
includes the changes I posted earlier, to use indexes instead of
pointers in the hash table. In addition, it makes the hash table size
variable, depending on the length of the input. This further reduces the
startup cost on small inputs. I changed the hash method slightly,
because the old method would not use any bits from the 3rd byte with a
small hash table size, but fortunately that didn't seem to negative
impact with larger hash table sizes either.

I wrote a little C extension to test this. It contains a function, which
runs pglz_compress() on a bytea input, N times. I ran that with
different kinds of inputs, and got the following results:

unpatched:

testname | auto
-------------------+-----------
5k text | 1425.750
512b text | 1287.413
2k random | -1053.160
100k random | -1056.379
512b random | -1018.474
64b of text | -2547.106
64b random | -3731.700
100k of same byte | 1093.885
100k text | 849.430
(9 rows)

pglz-replace-pointers-with-indexes.patch (the patch I posted earlier):

testname | auto
-------------------+-----------
5k text | 1251.576
512b text | 946.348
2k random | -815.095
100k random | -808.356
512b random | -614.418
64b of text | -744.382
64b random | -1060.758
100k of same byte | 1192.397
100k text | 796.530
(9 rows)

pglz-variable-size-hash-table.patch:

testname | auto
-------------------+----------
5k text | 1276.905
512b text | 823.796
2k random | -844.484
100k random | -848.080
512b random | -615.039
64b of text | -393.448
64b random | -568.685
100k of same byte | 1186.759
100k text | 799.978
(9 rows)

These values are from a single run of the test, but I did repeat them
several times to make sure there isn't too much variability in them to
render the results meaningless. The negative values mean that
pglz_compress gave up on the compression in the test, ie. it shows how
long it took for pglz_compress to realize that it can't compress the
input. Compare the abs() values.

With the variable-size hash table, I'm not sure how significant the
earlier patch to use indexes instead of pointers is. But I don't think
it makes things any worse, so it's included in this.

On 01.03.2013 17:42, Heikki Linnakangas wrote:
> On 01.03.2013 17:37, Alvaro Herrera wrote:
>> My take on this is that if this patch is necessary to get Amit's patch
>> to a commitable state, it's fair game.
>
> I don't think it's necessary for that, but let's see..

With these tweaks, I was able to make pglz-based delta encoding perform
roughly as well as Amit's patch. So I think that's the approach we
should take, as it's a bit simpler and more versatile. I'll follow up on
that in the other thread.

- Heikki

Attachment Content-Type Size
pglz-variable-size-hash-table.patch text/x-diff 8.3 KB
compresstest.tar.gz application/x-gzip 2.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2013-03-05 13:35:16 Re: Support for REINDEX CONCURRENTLY
Previous Message Robert Haas 2013-03-05 13:30:56 Re: sql_drop Event Trigger