pgsql: Optimize pglz compressor for small inputs.

Lists: pgsql-committerspgsql-hackers
From: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
To: pgsql-committers(at)postgresql(dot)org
Subject: pgsql: Optimize pglz compressor for small inputs.
Date: 2013-07-01 08:01:26
Message-ID: E1UtZ38-0006xD-AF@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Optimize pglz compressor for small inputs.

The pglz compressor has a significant startup cost, because it has to
initialize to zeros the history-tracking hash table. On a 64-bit system, the
hash table was 64kB in size. While clearing memory is pretty fast, for very
short inputs the relative cost of that was quite large.

This patch alleviates that in two ways. First, instead of storing pointers
in the hash table, store 16-bit indexes into the hist_entries array. That
slashes the size of the hash table to 1/2 or 1/4 of the original, depending
on the pointer width. Secondly, adjust the size of the hash table based on
input size. For very small inputs, you don't need a large hash table to
avoid collisions.

Review by Amit Kapila.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/031cc55bbea6b3a6b67c700498a78fb1d4399476

Modified Files
--------------
src/backend/utils/adt/pg_lzcompress.c | 95 +++++++++++++++++++++++----------
1 file changed, 66 insertions(+), 29 deletions(-)


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: pgsql: Optimize pglz compressor for small inputs.
Date: 2013-07-14 17:12:12
Message-ID: 20130714171212.GD2511@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

Heikki,

* Heikki Linnakangas (heikki(dot)linnakangas(at)iki(dot)fi) wrote:
> This patch alleviates that in two ways. First, instead of storing pointers
> in the hash table, store 16-bit indexes into the hist_entries array. That
> slashes the size of the hash table to 1/2 or 1/4 of the original, depending
> on the pointer width. Secondly, adjust the size of the hash table based on
> input size. For very small inputs, you don't need a large hash table to
> avoid collisions.

The coverity scanner has a bit of an issue with this patch which, at
least on first blush, looks like a valid concern.

While the change in pg_lzcompress.c:pglz_find_match() to loop on:

while (hent != INVALID_ENTRY_PTR)
{
const char *ip = input;
const char *hp = hent->pos;

looks good, and INVALID_ENTRY_PTR is the address of the first entry in
the array (and can't be NULL), towards the end of the loop we do:

hent = hent->next;
if (hent)
...

Should we really be checking for 'hent != INVALID_ENTRY_PTR' here? If
not, and hent really can end up as NULL, then we're going to segfault
on the next loop due to the unchecked 'hent->pos' early in the loop.
If hent can never be NULL, then we probably don't need this check at
all.

Thanks,

Stephen


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: pgsql: Optimize pglz compressor for small inputs.
Date: 2013-07-17 17:36:44
Message-ID: 51E6D62C.7060302@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-committers pgsql-hackers

On 14.07.2013 20:12, Stephen Frost wrote:
> * Heikki Linnakangas (heikki(dot)linnakangas(at)iki(dot)fi) wrote:
>> This patch alleviates that in two ways. First, instead of storing pointers
>> in the hash table, store 16-bit indexes into the hist_entries array. That
>> slashes the size of the hash table to 1/2 or 1/4 of the original, depending
>> on the pointer width. Secondly, adjust the size of the hash table based on
>> input size. For very small inputs, you don't need a large hash table to
>> avoid collisions.
>
> The coverity scanner has a bit of an issue with this patch which, at
> least on first blush, looks like a valid concern.
>
> While the change in pg_lzcompress.c:pglz_find_match() to loop on:
>
> while (hent != INVALID_ENTRY_PTR)
> {
> const char *ip = input;
> const char *hp = hent->pos;
>
> looks good, and INVALID_ENTRY_PTR is the address of the first entry in
> the array (and can't be NULL), towards the end of the loop we do:
>
> hent = hent->next;
> if (hent)
> ...
>
> Should we really be checking for 'hent != INVALID_ENTRY_PTR' here? If
> not, and hent really can end up as NULL, then we're going to segfault
> on the next loop due to the unchecked 'hent->pos' early in the loop.
> If hent can never be NULL, then we probably don't need this check at
> all.

hent can never be NULL, the code should indeed check for 'hent !=
INVALID_ENTRY_PTR'. The check is not required from a correctness point
of view, the idea is just to avoid calculating the 'good_match'
variable, if you're going to fall out of the loop anyway.

I'm actually a bit surprised the compiler doesn't optimize it that way
anyway, without the explicit if-block, but at least "gcc -O2" (version
4.7.3) seem to do that. So, I guess we should keep the check.

Committed '(hent)' -> '(hent != INVALID_ENTRY_PTR)'. Thanks for the report!

- Heikki