Quick Links

Re: a faster compression algorithm for pg_dump

Lists:	pgsql-hackers

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	a faster compression algorithm for pg_dump
Date:	2010-04-08 23:17:45
Message-ID:	s2hdc7b844e1004081617j1d3e34b7pc21bb7d7ebb04aab@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'd like to revive the discussion about offering another compression
algorithm than zlib to at least pg_dump. There has been a previous
discussion here:

http://archives.postgresql.org/pgsql-performance/2009-08/msg00053.php

and it ended without any real result. The results so far were:

- There exist BSD-licensed compression algorithms
- Nobody knows a patent that is in our way
- Nobody can confirm that no patent is in our way

I do see a very real demand for replacing zlib which compresses quite
well but is slow as hell. For pg_dump what people want is cheap
compression, they usually prefer an algorithm that compresses less
optimal but that is really fast.

One question that I do not yet see answered is, do we risk violating a
patent even if we just link against a compression library, for example
liblzf, without shipping the actual code?

I have checked what other projects do, especially about liblzf which
would be my favorite choice (BSD license, available since quite some
time...) and there are other projects that actually ship the lzf code
(I haven't found a project that just links to it). The most prominent
projects are

- KOffice (implements a derived version in
koffice-2.1.2/libs/store/KoXmlReader.cpp)
- Virtual Box (ships it in vbox-ose-1.3.8/src/libs/liblzf-1.51)
- TuxOnIce (formerly known as suspend2 - linux kernel patch, ships it
in the patch)

We have pg_lzcompress.c which implements the compression routines for
the tuple toaster. Are we sure that we don't violate any patents with
this algorithm?

Joachim

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: a faster compression algorithm for pg_dump
Date:	2010-04-09 03:51:45
Message-ID:	r2m407d949e1004082051oc6ddeb1cm5f2b2b7cf259640a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Apr 9, 2010 at 12:17 AM, Joachim Wieland <joe(at)mcknight(dot)de> wrote:
> One question that I do not yet see answered is, do we risk violating a
> patent even if we just link against a compression library, for example
> liblzf, without shipping the actual code?
>

Generally patents are infringed on when the process is used. So
whether we link against or ship the code isn't really relevant. The
user using the software would need a patent license either way. We
want Postgres to be usable without being dependent on any copyright or
patent licenses.

Linking against as an option isn't nearly as bad since the user
compiling it can choose whether to include the restricted feature or
not. That's what we do with readline. However it's not nearly as
attractive when it restricts what file formats Postgres supports -- it
means someone might generate backup dump files that they later
discover they don't have a legal right to read and restore :(

--
greg

From:	Joachim Wieland <joe(at)mcknight(dot)de>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: a faster compression algorithm for pg_dump
Date:	2010-04-10 12:18:09
Message-ID:	t2zdc7b844e1004100518na576191am310f8e4313271d07@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Apr 9, 2010 at 5:51 AM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> Linking against as an option isn't nearly as bad since the user
> compiling it can choose whether to include the restricted feature or
> not. That's what we do with readline. However it's not nearly as
> attractive when it restricts what file formats Postgres supports -- it
> means someone might generate backup dump files that they later
> discover they don't have a legal right to read and restore :(

If we only linked against it, we'd leave it up to the user to weigh
the risk as long as we are not aware of any such violation.

Our top priority is to make sure that the project would not be harmed
if one day such a patent showed up. If I understood you correctly,
this is not an issue, even if we included lzf and less again if we
only link against it. The rest is about user education and using lzf
only in pg_dump and not for toasting, we could show a message in
pg_dump if lzf is chosen to make the user aware of the possible
issues.

If we still cannot do this, then what I am asking is: What does the
project need to be able to at least link against such a compression
algorithm? Is it a list of 10, 20, 50 or more other projects using it
or is it a lawyer saying: "There is no patent."? But then, how can we
be sure that the lawyer is right? Or couldn't we include it even if we
had both, because again, we couldn't be sure... ?

Joachim

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Joachim Wieland <joe(at)mcknight(dot)de>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: a faster compression algorithm for pg_dump
Date:	2010-04-13 19:03:58
Message-ID:	8302.1271185438@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Joachim Wieland <joe(at)mcknight(dot)de> writes:
> If we still cannot do this, then what I am asking is: What does the
> project need to be able to at least link against such a compression
> algorithm?

Well, what we *really* need is a convincing argument that it's worth
taking some risk for. I find that not obvious. You can pipe the output
of pg_dump into your-choice-of-compressor, for example, and that gets
you the ability to spread the work across multiple CPUs in addition to
eliminating legal risk to the PG project. And in any case the general
impression seems to be that the main dump-speed bottleneck is on the
backend side not in pg_dump's compression.

regards, tom lane

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: a faster compression algorithm for pg_dump
Date:	2010-04-14 08:25:17
Message-ID:	4BC57BED.1090705@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Joachim Wieland <joe(at)mcknight(dot)de> writes:
>> If we still cannot do this, then what I am asking is: What does the
>> project need to be able to at least link against such a compression
>> algorithm?
>
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for. I find that not obvious. You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project. And in any case the general
> impression seems to be that the main dump-speed bottleneck is on the
> backend side not in pg_dump's compression.

legal risks aside (I'm not a lawyer so I cannot comment on that) the
current situation imho is:

* for a plain pg_dump the backend is the bottleneck
* for a pg_dump -Fc with compression, compression is a huge bottleneck
* for pg_dump | gzip, it is usually compression (or bytea and some other
datatypes in <9.0)
* for a parallel dump you can either dump uncompressed and compress
afterwards which increases diskspace requirements (and if you need
parallel dump you usually have a large database) and complexity (because
you would have to think about how to manually parallel the compression
* for a parallel dump that compresses inline you are limited by the
compression algorithm on a per core base and given that the current
inline compression overhead is huge you loose a lot of the benefits of
parallel dump

Stefan

From:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: a faster compression algorithm for pg_dump
Date:	2010-04-14 08:33:53
Message-ID:	87iq7uv3f2.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for. I find that not obvious. You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project.

Well, I like -Fc and playing with the catalog to restore in staging
environments only the "interesting" data. I even automated all the
catalog mangling in pg_staging so that I just have to setup which
schema I want, with only the DDL or with the DATA too.

The fun is when you want to exclude functions that are used in
triggers based on the schema where the function lives, not the
trigger, BTW, but that's another story.

So yes having both -Fc and another compression facility than plain gzip
would be good news. And benefiting from a better compression in TOAST
would be good too I guess (small size hit, lots faster, would fit).

Summary : my convincing argument is using the dumps for efficiently
preparing development and testing environments from production data,
thanks to -Fc. That includes skipping data to restore.

Regards,
--
dim

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Joachim Wieland <joe(at)mcknight(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: a faster compression algorithm for pg_dump
Date:	2010-04-15 00:29:32
Message-ID:	201004150029.o3F0TWU12704@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Dimitri Fontaine wrote:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
> > Well, what we *really* need is a convincing argument that it's worth
> > taking some risk for. I find that not obvious. You can pipe the output
> > of pg_dump into your-choice-of-compressor, for example, and that gets
> > you the ability to spread the work across multiple CPUs in addition to
> > eliminating legal risk to the PG project.
>
> Well, I like -Fc and playing with the catalog to restore in staging
> environments only the "interesting" data. I even automated all the
> catalog mangling in pg_staging so that I just have to setup which
> schema I want, with only the DDL or with the DATA too.
>
> The fun is when you want to exclude functions that are used in
> triggers based on the schema where the function lives, not the
> trigger, BTW, but that's another story.
>
> So yes having both -Fc and another compression facility than plain gzip
> would be good news. And benefiting from a better compression in TOAST
> would be good too I guess (small size hit, lots faster, would fit).
>
> Summary?: my convincing argument is using the dumps for efficiently
> preparing development and testing environments from production data,
> thanks to -Fc. That includes skipping data to restore.

I assume people realize that if they are using pg_dump -Fc and then
compressing the output later, they should turn off compression in
pg_dump, or is that something we should document/suggest?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

From:	daveg <daveg(at)sonic(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Joachim Wieland <joe(at)mcknight(dot)de>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: a faster compression algorithm for pg_dump
Date:	2010-04-15 00:54:47
Message-ID:	20100415005447.GI23641@sonic.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 13, 2010 at 03:03:58PM -0400, Tom Lane wrote:
> Joachim Wieland <joe(at)mcknight(dot)de> writes:
> > If we still cannot do this, then what I am asking is: What does the
> > project need to be able to at least link against such a compression
> > algorithm?
>
> Well, what we *really* need is a convincing argument that it's worth
> taking some risk for. I find that not obvious. You can pipe the output
> of pg_dump into your-choice-of-compressor, for example, and that gets
> you the ability to spread the work across multiple CPUs in addition to
> eliminating legal risk to the PG project. And in any case the general
> impression seems to be that the main dump-speed bottleneck is on the
> backend side not in pg_dump's compression.

My client uses pg_dump -Fc and produces about 700GB of compressed postgresql
dump nightly from multiple hosts. They also depend on being able to read and
filter the dump catalog. A faster compression algorithm would be a huge
benefit for dealing with this volume.

-dg

--
David Gould daveg(at)sonic(dot)net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.