Re: Faster compression, again

Lists: pgsql-hackers
From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Faster compression, again
Date: 2012-03-14 18:06:16
Message-ID: CAAZKuFb59sABSa7gCG0vnVnGb-mJCUBBbrKiyPraNXHnis7KMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

For 9.3 at a minimum.

The topic of LZO became mired in doubts about:

* Potential Patents
* The author's intention for the implementation to be GPL

Since then, Google released "Snappy," also an LZ77-class
implementation, and it has been ported to C (recently, and with some
quirks, like no LICENSE file...yet, although it is linked from the
original Snappy project). The original Snappy (C++) has a BSD license
and a patent grant (which shields you from Google, at least). Do we
want to investigate a very-fast compression algorithm inclusion again
in the 9.3 cycle?

I've been using the similar implementation "LZO" for WAL archiving and
it is a significant savings (not as much as pg_lesslog, but also less
invasive). It is also fast enough that even if one were not to uproot
TOAST's compression that it would probably be very close to a complete
win for protocol traffic, whereas SSL's standardized zlib can
definitely be a drag in some cases.

This idea resurfaces often, but the reason why I wrote in about it is
because I have a table which I categorized as "small" but was, in
fact, 1.5MB, which made transferring it somewhat slow over a remote
link. zlib compression takes it down to about 550K and lzo (similar,
but not identical) 880K. If we're curious how it affects replication
traffic, I could probably gather statistics on LZO-compressed WAL
traffic, of which we have a pretty huge amount captured.

--
fdr


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 18:13:27
Message-ID: 20120314181327.GH7440@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 11:06:16AM -0700, Daniel Farina wrote:
> For 9.3 at a minimum.
>
> The topic of LZO became mired in doubts about:
>
> * Potential Patents
> * The author's intention for the implementation to be GPL
>
> Since then, Google released "Snappy," also an LZ77-class
> implementation, and it has been ported to C (recently, and with some
> quirks, like no LICENSE file...yet, although it is linked from the
> original Snappy project). The original Snappy (C++) has a BSD license
> and a patent grant (which shields you from Google, at least). Do we
> want to investigate a very-fast compression algorithm inclusion again
> in the 9.3 cycle?
>

+1 for Snappy and a very fast compression algorithm.

Regards,
Ken


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 20:10:29
Message-ID: CAHyXU0xaouYYRLNSTkpAXPReO3imT3YDWd7L39TgN6dKfF8t8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 1:06 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
> For 9.3 at a minimum.
>
> The topic of LZO became mired in doubts about:
>
> * Potential Patents
> * The author's intention for the implementation to be GPL
>
> Since then, Google released "Snappy," also an LZ77-class
> implementation, and it has been ported to C (recently, and with some
> quirks, like no LICENSE file...yet, although it is linked from the
> original Snappy project).  The original Snappy (C++) has a BSD license
> and a patent grant (which shields you from Google, at least).  Do we
> want to investigate a very-fast compression algorithm inclusion again
> in the 9.3 cycle?
>
> I've been using the similar implementation "LZO" for WAL archiving and
> it is a significant savings (not as much as pg_lesslog, but also less
> invasive).  It is also fast enough that even if one were not to uproot
> TOAST's compression that it would probably be very close to a complete
> win for protocol traffic, whereas SSL's standardized zlib can
> definitely be a drag in some cases.
>
> This idea resurfaces often, but the reason why I wrote in about it is
> because I have a table which I categorized as "small" but was, in
> fact, 1.5MB, which made transferring it somewhat slow over a remote
> link.  zlib compression takes it down to about 550K and lzo (similar,
> but not identical) 880K.  If we're curious how it affects replication
> traffic, I could probably gather statistics on LZO-compressed WAL
> traffic, of which we have a pretty huge amount captured.

there are plenty of on gpl lz based libraries out there (for example:
http://www.fastlz.org/) and always have been. they are all much
faster than zlib. the main issue is patents...you have to be careful
even though all the lz77/78 patents seem to have expired or apply to
specifics not relevant to general use.

see here for the last round of talks on this:
http://archives.postgresql.org/pgsql-performance/2009-08/msg00052.php

lzo is nearing its 20th birthday, so even if you are paranoid about
patents (admittedly, there is good reason to be), the window is
closing fast to have patent issues that aren't A expired or B covered
by prior art on that or the various copycat implementations, at least
in the US.

snappy looks amazing.

merlin


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 20:43:55
Message-ID: 4F61030B.10702@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/14/2012 04:10 PM, Merlin Moncure wrote:
> there are plenty of on gpl lz based libraries out there (for example:
> http://www.fastlz.org/) and always have been. they are all much
> faster than zlib. the main issue is patents...you have to be careful
> even though all the lz77/78 patents seem to have expired or apply to
> specifics not relevant to general use.
>

We're not going to include GPL code in the backend. We have enough
trouble with readline and that's only for psql. SO the fact that there
are GPL libraries can't help us, whether there are patent issues or not.

cheers

andrew


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 20:53:45
Message-ID: 20120314205345.GL7440@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 04:43:55PM -0400, Andrew Dunstan wrote:
>
>
> On 03/14/2012 04:10 PM, Merlin Moncure wrote:
> >there are plenty of on gpl lz based libraries out there (for example:
> >http://www.fastlz.org/) and always have been. they are all much
> >faster than zlib. the main issue is patents...you have to be careful
> >even though all the lz77/78 patents seem to have expired or apply to
> >specifics not relevant to general use.
> >
>
> We're not going to include GPL code in the backend. We have enough
> trouble with readline and that's only for psql. SO the fact that
> there are GPL libraries can't help us, whether there are patent
> issues or not.
>
> cheers
>
> andrew
>

That is what makes Google's Snappy so appealing, a BSD license.

Regards,
Ken


From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 21:03:02
Message-ID: CAHyXU0zXfQgRCWGYo-CSR8Jxrj8xeCKi5wwh9WiAMNdO+v65=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 3:43 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>
> On 03/14/2012 04:10 PM, Merlin Moncure wrote:
>>
>> there are plenty of on gpl lz based libraries out there (for example:
>> http://www.fastlz.org/) and always have been.  they are all much
>> faster than zlib.  the main issue is patents...you have to be careful
>> even though all the lz77/78 patents seem to have expired or apply to
>> specifics not relevant to general use.
>>
>
> We're not going to include GPL code in the backend. We have enough trouble
> with readline and that's only for psql. SO the fact that there are GPL
> libraries can't help us, whether there are patent issues or not.

er, typo: I meant to say: "*non-gpl* lz based..." :-).

merlin


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 21:11:21
Message-ID: CAAZKuFay4rsvFykYx8DNjKCk=CNRCJrV+hSCkVmTRO5CUNqd8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 2:03 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
> er, typo: I meant to say: "*non-gpl* lz based..."  :-).

Given that, few I would say have had the traction that LZO and Snappy
have had, even though in many respects they are interchangeable in the
general trade-off spectrum. The question is: what burden of proof is
required to convince the project that Snappy does not have exorbitant
patent issues, in proportion to the utility of having a compression
scheme of this type integrated?

One would think Google's lawyers did their homework to ensure they
would not be trolled for hideous sums of money by disclosing and
releasing the exact implementation of the compression used virtually
everywhere. If anything, that may have been a more complicated issue
than writing and releasing yet-another-LZ77 implementation.

--
fdr


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 21:58:42
Message-ID: 6302.1331762322@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Daniel Farina <daniel(at)heroku(dot)com> writes:
> Given that, few I would say have had the traction that LZO and Snappy
> have had, even though in many respects they are interchangeable in the
> general trade-off spectrum. The question is: what burden of proof is
> required to convince the project that Snappy does not have exorbitant
> patent issues, in proportion to the utility of having a compression
> scheme of this type integrated?

Another not-exactly-trivial requirement is to figure out how to not
break on-disk compatibility when installing an alternative compression
scheme. In hindsight it might've been a good idea if pglz_compress had
wasted a little bit of space on some sort of version identifier ...
but it didn't.

regards, tom lane


From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: "Daniel Farina" <daniel(at)heroku(dot)com>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Merlin Moncure" <mmoncure(at)gmail(dot)com>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 22:08:01
Message-ID: 4F60D07102000025000462C9@gw.wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Another not-exactly-trivial requirement is to figure out how to
> not break on-disk compatibility when installing an alternative
> compression scheme. In hindsight it might've been a good idea if
> pglz_compress had wasted a little bit of space on some sort of
> version identifier ... but it didn't.

Doesn't it always start with a header of two int32 values where the
first must be smaller than the second? That seems like enough to
get traction for an identifiably different header for an alternative
compression technique.

-Kevin


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Merlin Moncure <mmoncure(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-14 22:20:42
Message-ID: CAAZKuFZFfY0nz=ttAskKAUEs_0X=ORF9gpkYgWs9B=rVh-3x+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 2:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Daniel Farina <daniel(at)heroku(dot)com> writes:
>> Given that, few I would say have had the traction that LZO and Snappy
>> have had, even though in many respects they are interchangeable in the
>> general trade-off spectrum. The question is: what burden of proof is
>> required to convince the project that Snappy does not have exorbitant
>> patent issues, in proportion to the utility of having a compression
>> scheme of this type integrated?
>
> Another not-exactly-trivial requirement is to figure out how to not
> break on-disk compatibility when installing an alternative compression
> scheme.  In hindsight it might've been a good idea if pglz_compress had
> wasted a little bit of space on some sort of version identifier ...
> but it didn't.

I was more thinking that the latency and throughput in LZ77 schemes
may be best first applied to protocol compression. The downside is
that requires more protocol wrangling, but at least terabytes of
on-disk format doesn't get in the picture, even though LZ77 on the
data itself may be attractive.

I'm interested allowing layering transports below FEBE (similar to how
SSL is "below", except without the complexity of being tied into auth
& auth) in a couple of respects, and this is one of them.

--
fdr


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc: Daniel Farina <daniel(at)heroku(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 01:24:26
Message-ID: CA+Tgmob0M6p69bdrFXXGUbKLqTz4i5ytL+e_B3ZN4D51-cUFQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 6:08 PM, Kevin Grittner
<Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Another not-exactly-trivial requirement is to figure out how to
>> not break on-disk compatibility when installing an alternative
>> compression scheme.  In hindsight it might've been a good idea if
>> pglz_compress had wasted a little bit of space on some sort of
>> version identifier ... but it didn't.
>
> Doesn't it always start with a header of two int32 values where the
> first must be smaller than the second?  That seems like enough to
> get traction for an identifiably different header for an alternative
> compression technique.

The first of those words is vl_len_, which we can't fiddle with too
much, but the second is rawsize, which we can definitely fiddle with.
Right now, rawsize < vl_len_ means it's compressed; and rawsize ==
vl_len_ means it's uncompressed. As you point out, rawsize > vl_len_
is undefined; also, and maybe simpler, rawsize < 0 is undefined.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Daniel Farina <daniel(at)heroku(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 01:44:56
Message-ID: 10658.1331775896@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Mar 14, 2012 at 6:08 PM, Kevin Grittner
> <Kevin(dot)Grittner(at)wicourts(dot)gov> wrote:
>> Doesn't it always start with a header of two int32 values where the
>> first must be smaller than the second? That seems like enough to
>> get traction for an identifiably different header for an alternative
>> compression technique.

> The first of those words is vl_len_, which we can't fiddle with too
> much, but the second is rawsize, which we can definitely fiddle with.
> Right now, rawsize < vl_len_ means it's compressed; and rawsize ==
> vl_len_ means it's uncompressed. As you point out, rawsize > vl_len_
> is undefined; also, and maybe simpler, rawsize < 0 is undefined.

Well, let's please not make the same mistake again of assuming that
there will never again be any other ideas in this space. IOW, let's
find a way to shoehorn in an actual compression-method ID value of some
sort. I don't particularly care for trying to push that into rawsize,
because you don't really have more than about one bit to work with
there, unless you eat the entire word for ID purposes which seems
excessive.

After looking at pg_lzcompress.c for a bit, it appears to me that the
LSB of the first byte of compressed data must always be zero, because
the very first control bit has to say "copy a literal byte"; you can't
have a back-reference until there's some data in the output buffer.
So what I suggest is that we keep rawsize the same as it is, but peek at
the first byte after that to decide what we have: even means existing
compression method, an odd value is an ID byte selecting some new
method. This gives us room for 128 new methods before we have trouble
again, while consuming only one byte which seems like acceptable
overhead for the purpose.

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Daniel Farina <daniel(at)heroku(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Merlin Moncure <mmoncure(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 02:12:24
Message-ID: CA+Tgmob5P+hY-tNLB5jfuehASMCa38NswRo6tMc0B=Mk0k9TtA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 9:44 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Well, let's please not make the same mistake again of assuming that
> there will never again be any other ideas in this space.  IOW, let's
> find a way to shoehorn in an actual compression-method ID value of some
> sort.  I don't particularly care for trying to push that into rawsize,
> because you don't really have more than about one bit to work with
> there, unless you eat the entire word for ID purposes which seems
> excessive.

Well, you don't have to go that far. For example, you could dictate
that, when the value is negative, the most significant byte indicates
the compression algorithm is in use (128 possible compression
algorithms). The remaining 3 bytes indicate the compressed length;
but when they're all zero, the compressed length is instead stored in
the following 4-byte word. This consumes one additional 4-byte word
for values that take >= 16MB compressed, but that's presumably a
non-problem.

> After looking at pg_lzcompress.c for a bit, it appears to me that the
> LSB of the first byte of compressed data must always be zero, because
> the very first control bit has to say "copy a literal byte"; you can't
> have a back-reference until there's some data in the output buffer.
> So what I suggest is that we keep rawsize the same as it is, but peek at
> the first byte after that to decide what we have: even means existing
> compression method, an odd value is an ID byte selecting some new
> method.  This gives us room for 128 new methods before we have trouble
> again, while consuming only one byte which seems like acceptable
> overhead for the purpose.

That would work, too.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Daniel Farina <daniel(at)heroku(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 21:53:43
Message-ID: CAAZKuFaFBQ9N91ngGbt5UD1u0aFAM87T7QNMynU-KBonz7gpjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 11:06 AM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
>...and it has been ported to C (recently, and with some
> quirks, like no LICENSE file...yet, although it is linked from the
> original Snappy project).

I poked the author about the license and he fixed it in a jiffy. Now
under BSD, with Intel's Copyright. He seems to be committing a few
enhancements, but the snail's trail of the Internet suggests that this
code has made its way into Linux as well, including btrfs.

So now I guess we can have at it... https://github.com/andikleen/snappy-c/

--
fdr


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 22:14:12
Message-ID: CA+U5nMJH+Y4OaDOXpSuARm=3+2F_umvuGzUOpCtLw17E3j+aGg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Mar 14, 2012 at 6:06 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:

> If we're curious how it affects replication
> traffic, I could probably gather statistics on LZO-compressed WAL
> traffic, of which we have a pretty huge amount captured.

What's the compression like for shorter chunks of data? Is it worth
considering using this for the libpq copy protocol and therefore
streaming replication also?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 22:34:35
Message-ID: CAAZKuFbaH2N+bSysom7gsgq+S_nffgGNb0do1KLBSHo05_QKkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 15, 2012 at 3:14 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, Mar 14, 2012 at 6:06 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
>
>> If we're curious how it affects replication
>> traffic, I could probably gather statistics on LZO-compressed WAL
>> traffic, of which we have a pretty huge amount captured.
>
> What's the compression like for shorter chunks of data? Is it worth
> considering using this for the libpq copy protocol and therefore
> streaming replication also?

The overhead is between 1 and 5 bytes that reserve the high bit as a
continuation bit (so one byte for small data), and then straight into
data. So I think it could be applied for most payloads that are a few
bytes wide. Presumably that could be lifted, but the format
description only allows for 2**32 - 1 for the uncompressed size.

I'd really like to find a way to layer both message-oblivious and
message-aware transport under FEBE with both backend and frontend
support without committing the project to new code for-ever-and-ever.
I guess I could investigate it in brief now, unless you've already
thought about/done some work in that area.

--
fdr


From: "ktm(at)rice(dot)edu" <ktm(at)rice(dot)edu>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Daniel Farina <daniel(at)heroku(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-15 22:40:09
Message-ID: 20120315224009.GO7440@aart.rice.edu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 15, 2012 at 10:14:12PM +0000, Simon Riggs wrote:
> On Wed, Mar 14, 2012 at 6:06 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:
>
> > If we're curious how it affects replication
> > traffic, I could probably gather statistics on LZO-compressed WAL
> > traffic, of which we have a pretty huge amount captured.
>
> What's the compression like for shorter chunks of data? Is it worth
> considering using this for the libpq copy protocol and therefore
> streaming replication also?
>
> --
>  Simon Riggs                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services

Here is a pointer to some tests with Snappy+CouchDB:

https://github.com/fdmanana/couchdb/blob/b8f806e41727ba18ed6143cee31a3242e024ab2c/snappy-couch-tests.txt

They checked compression on smaller chunks of data. I have extracted the
basic results. The first number is the original size in bytes, followed
by the compressed size in bytes, the percent compressed and the compression
ratio:

77 -> 60, 90% or 1.1:1
120 -> 104, 87% or 1.15:1
127 -> 80, 63% or 1.6:1
5942 -> 2930, 49% or 2:1

It looks like a good candidate for both the libpq copy protocol and
streaming replication. My two cents.

Regards,
Ken


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Daniel Farina <daniel(at)heroku(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Faster compression, again
Date: 2012-03-22 13:37:46
Message-ID: CA+U5nMKbH6rKNVsuG+BFAayEuVcQnOajgE4nEV3EsW=WO7L0UA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 15, 2012 at 10:34 PM, Daniel Farina <daniel(at)heroku(dot)com> wrote:

> I'd really like to find a way to layer both message-oblivious and
> message-aware transport under FEBE with both backend and frontend
> support without committing the project to new code for-ever-and-ever.
> I guess I could investigate it in brief now, unless you've already
> thought about/done some work in that area.

Not done anything in that area myself.

I think its important that we have compression for the COPY protocol
within libpq, so I'll add that to my must-do list - but would be more
than happy if you wanted to tackle that yourself.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Huchev <hugochevrain(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Faster compression, again
Date: 2012-04-03 14:29:06
Message-ID: 1333463346346-5615311.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

For a C implementation, it could interesting to consider LZ4 algorithm, since
it is written natively in this language. In contrast, Snappy has been ported
to C by Andy from the original C++ Google code, which lso translate into
less extensive usage and tests.

http://code.google.com/p/lz4/

Furthermode, LZ4 license is BSD. And it has been reported in several tests
as being faster than Snappy/LZO, especially on decompression speed.

http://article.gmane.org/gmane.comp.file-systems.btrfs/15744

And last point, there is a "high compression" mode, which could be useful
for data rarely written/modified but often read.

http://code.google.com/p/lz4hc/

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Faster-compression-again-tp5565675p5615311.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Daniel Farina <daniel(at)heroku(dot)com>
To: Huchev <hugochevrain(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Faster compression, again
Date: 2012-04-04 23:09:37
Message-ID: CAAZKuFZ3wUT+3jRWb7ZJmybuvnVA14Hn7RR6c8YhnsNfq53V4Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 3, 2012 at 7:29 AM, Huchev <hugochevrain(at)gmail(dot)com> wrote:
> For a C implementation, it could interesting to consider LZ4 algorithm, since
> it is written natively in this language. In contrast, Snappy has been ported
> to C by Andy from the original C++ Google code, which lso translate into
> less extensive usage and tests.

From what I can tell, the C implementation of snappy has more tests
than this LZ4 implementation, including a fuzz tester. It's a
maintained part of Linux as well, and used for btrfs --- this is why
it was ported. The high compression version of LZ4 is apparently
LGPL. And, finally, there is the issue of patents: snappy has several
multi-billion dollar companies that can be held liable (originator
Google, as well as anyone connected to Linux), and to the best of my
knowledge, nobody has been held to extortion yet.

Consider me unconvinced as to this line of argument.

--
fdr


From: Huchev <hugochevrain(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Faster compression, again
Date: 2012-04-05 09:06:28
Message-ID: CAANrMRpQUSJvXkpJRqnjwTO_eUkR5+F5Ps-neNNA428oz2vn2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Well, the patent argument, used like this, looks like a wild card, which
can be freely interpreted as a mortal danger for some, and a non-issue for
others. A perfect scare-mongerer.
Quite frankly, I don't buy that one implementation is safer because there
is "Google backing it". I can't think of any reason why byte-aligned LZ77
algorithm could face any risk. And btw, just look at the number of
companies which had to pay protection money to Microsoft or face litigation
with Apple because they were using Google's Android. It looks to me that
Google is more a magnet for such dangers than a protector.

Regarding test tools : Yes, this is correct, Snappy C has more fuzzer tools
provided within the package.

Regarding integration to BTRFS, and therefore into Linux, both
implementation look on equal terms. I haven't seen anything which tells
that one has more chances than the other being part of Linux 3.5. In fact,
maybe both will be integrated at the same time.

However, a little publicized fact is that quite a few people tried both
implementation (Snappy C and LZ4), and there were more
failures/difficulties reported on Snappy C. It doesn't mean that Snappy C
is bad, just more complex to use. It seems that the LZ4 implementation is
more straightforward : less dependancies, less risks, less time spent to
properly optimize it,
well, in a word, simpler.

Le 5 avril 2012 01:11, Daniel Farina-4 [via PostgreSQL] <
ml-node+s1045698n5619199h50(at)n5(dot)nabble(dot)com> a écrit :

> On Tue, Apr 3, 2012 at 7:29 AM, Huchev <[hidden email]<http://user/SendEmail.jtp?type=node&node=5619199&i=0>>
> wrote:
> > For a C implementation, it could interesting to consider LZ4 algorithm,
> since
> > it is written natively in this language. In contrast, Snappy has been
> ported
> > to C by Andy from the original C++ Google code, which lso translate into
> > less extensive usage and tests.
>
> From what I can tell, the C implementation of snappy has more tests
> than this LZ4 implementation, including a fuzz tester. It's a
> maintained part of Linux as well, and used for btrfs --- this is why
> it was ported. The high compression version of LZ4 is apparently
> LGPL. And, finally, there is the issue of patents: snappy has several
> multi-billion dollar companies that can be held liable (originator
> Google, as well as anyone connected to Linux), and to the best of my
> knowledge, nobody has been held to extortion yet.
>
> Consider me unconvinced as to this line of argument.
>
> --
> fdr
>
> --
> Sent via pgsql-hackers mailing list ([hidden email]<http://user/SendEmail.jtp?type=node&node=5619199&i=1>)
>
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
>
> http://postgresql.1045698.n5.nabble.com/Faster-compression-again-tp5565675p5619199.html
> To unsubscribe from Faster compression, again, click here<http://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=5565675&code=aHVnb2NoZXZyYWluQGdtYWlsLmNvbXw1NTY1Njc1fDc3NzM5MzkwMA==>
> .
> NAML<http://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

--
View this message in context: http://postgresql.1045698.n5.nabble.com/Faster-compression-again-tp5565675p5619870.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.