Re: Optimizing pglz compressor

From: Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
To: "'Heikki Linnakangas'" <hlinnakangas(at)vmware(dot)com>, "'Alvaro Herrera'" <alvherre(at)2ndquadrant(dot)com>
Cc: "'PostgreSQL-development'" <pgsql-hackers(at)postgreSQL(dot)org>
Subject: Re: Optimizing pglz compressor
Date: 2013-06-19 11:01:13
Message-ID: 008501ce6cdc$51981790$f4c846b0$@kapila@huawei.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, March 05, 2013 7:03 PM Heikki Linnakangas wrote:

> I spent some more time on this, and came up with the attached patch. It
> includes the changes I posted earlier, to use indexes instead of
> pointers in the hash table. In addition, it makes the hash table size
> variable, depending on the length of the input. This further reduces
> the startup cost on small inputs. I changed the hash method slightly,
> because the old method would not use any bits from the 3rd byte with a
> small hash table size, but fortunately that didn't seem to negative
> impact with larger hash table sizes either.
>
> I wrote a little C extension to test this. It contains a function,
> which runs pglz_compress() on a bytea input, N times. I ran that with
> different kinds of inputs, and got the following results:
>

The purpose of this patch is to improve LZ compression speed by reducing the
startup cost of initialization of hash_start array.
To achieve the same it uses variable hash and reduced the size of each
history entry by replacing pointers with int16 indexes.
It achieves it's purpose for small data, but for large data in some cases
performance is degaraded, refer second set of performance data.

1. Patch compiles cleanly and all regression tests passed.
2. Change in pglz_hist_idx macro is not very clear to me, neither it is
mentioned in comments
3. Why first entry is kept as INVALID_ENTRY? It appears to me, it is for
cleaner checks in code.

Performance Data
------------------
I have used pglz-variable-size-hash-table.patch to collect all performance
data:

Results of compress-tests.sql -- inserting large data into tmp table
------------------------------

testname |unpatched | patched
-------------------+----------+------------
5k text | 4.8932 | 4.9014
512b text | 22.6209 | 18.6849
256b text | 13.9784 | 8.9342
1K text | 20.4969 | 20.5988
2k random | 10.5826 | 10.0758
100k random | 3.9056 | 3.8200
500k random | 22.4078 | 22.1971
512b random | 15.7788 | 12.9575
256b random | 18.9213 | 12.5209
1K random | 11.3933 | 9.8853
100k of same byte | 5.5877 | 5.5960
500k of same byte | 2.6853 | 2.6500

Observation
-------------
1. This clearly shows that the patch improves performance for small data
without any impact for large data.

Performance data for directly calling lz_compress function (tests.sql)
---------------------------------------------------------------------------
select testname,
(compresstest(data, nrows, 8192)::numeric / 1000)::numeric(10,3) as
auto
from tests;

Head
testname | auto
-------------------+-----------
5k text | 3511.879
512b text | 1430.990
256b text | 1768.796
1K text | 1390.134
3K text | 4099.304
2k random | -402.916
100k random | -10.311
500k random | -2.019
512b random | -753.317
256b random | -1096.999
1K random | -559.931
10k of same byte | 3548.651
100k of same byte | 36037.280
500k of same byte | 25565.195
(14 rows)

Patch(pglz-variable-size-hash-table.patch)

testname | auto
-------------------+-----------
5k text | 3840.207
512b text | 1088.897
256b text | 982.172
1K text | 1402.488
3K text | 4334.802
2k random | -333.100
100k random | -8.390
500k random | -1.672
512b random | -499.251
256b random | -524.889
1K random | -453.177
10k of same byte | 4754.367
100k of same byte | 50427.735
500k of same byte | 36163.265
(14 rows)

Observations
--------------
1. For small data perforamce is always good with patch.
2. For random small/large data performace is good.
3. For medium and large text and same byte data(3K,5K text, 10K,100K,500K
same byte), performance is degraded.

I have used attached compress-tests-init.sql to generate data.
I am really not sure why the data you reported and what I taken differ in
few cases. I had tried multiple times but the result is same.
Kindly let me know if you think I am doing something wrong.

Note - To generate data in randomhex, I used Copy from file. I used same
command you provided to generate a file.

With Regards,
Amit Kapila.

Attachment Content-Type Size
compress-tests-init.sql application/octet-stream 12.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Etsuro Fujita 2013-06-19 11:49:57 Re: Patch for removng unused targets
Previous Message Cédric Villemain 2013-06-19 10:33:22 [Review] Re: [PATCH] Remove useless USE_PGXS support in contrib