Re: Faster compression, again

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Daniel Farina <daniel(at)heroku(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 01:44:56
Message-ID: 10658.1331775896@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Mar 14, 2012 at 6:08 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Doesn't it always start with a header of two int32 values where the
>> first must be smaller than the second? That seems like enough to
>> get traction for an identifiably different header for an alternative
>> compression technique.

> The first of those words is vl_len_, which we can't fiddle with too
> much, but the second is rawsize, which we can definitely fiddle with.
> Right now, rawsize < vl_len_ means it's compressed; and rawsize ==
> vl_len_ means it's uncompressed. As you point out, rawsize > vl_len_
> is undefined; also, and maybe simpler, rawsize < 0 is undefined.

Well, let's please not make the same mistake again of assuming that
there will never again be any other ideas in this space. IOW, let's
find a way to shoehorn in an actual compression-method ID value of some
sort. I don't particularly care for trying to push that into rawsize,
because you don't really have more than about one bit to work with
there, unless you eat the entire word for ID purposes which seems
excessive.

After looking at pg_lzcompress.c for a bit, it appears to me that the
LSB of the first byte of compressed data must always be zero, because
the very first control bit has to say "copy a literal byte"; you can't
have a back-reference until there's some data in the output buffer.
So what I suggest is that we keep rawsize the same as it is, but peek at
the first byte after that to decide what we have: even means existing
compression method, an odd value is an ID byte selecting some new
method. This gives us room for 128 new methods before we have trouble
again, while consuming only one byte which seems like acceptable
overhead for the purpose.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-03-15 02:12:24 Re: Faster compression, again
Previous Message Robert Haas 2012-03-15 01:30:27 Re: Client Messages