Quick Links

Re: copy with compression progress n

Lists:	pgsql-hackers

From:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	copy with compression progress n
Date:	2006-05-31 09:38:05
Message-ID:	447D63FD.9060609@pse-consulting.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I've been playing around with COPYing large binary data, and implemented
a COMPRESSION transfer format. The server side compression saves
significant bandwidth, which may be the major limiting factor when large
amounts of data is involved (i.e. in many cases where COPY TO/FROM
STDIN/STDOUT is used)
In addition, a progress notification can be enabled using a PROGRESS
<each n lines> option.

I tested this with a table, containing 2000 rows with a highly
compressable bytea column (size 1.4GB, on-disk 138MB). Numbers are as
follows (8.2 HEAD psql):
pg_dump -a -F c -t 652s, 146MB
\copy TO /dev/null 322s
\copy TO /dev/null binary 24s
\copy TO /dev/null compression 108s
\copy TO /tmp/file binary 55s, 1.4GB
\copy TO /tmp/file compression 108s, 133MB
\copy TO STDOUT binary|gzip -1 69s, 117MB

So using the plain text copy has a large overhead for text data over
binary formats. OTOH, copying normal rows WITH BINARY may bloat the
result too. A typical test table gave these numbers:
COPY: 6014 Bytes
BINARY: 15071 Bytes
COMPRESSION: 2334 Bytes

The compression (pg_lzcompress) is less efficient than a binary copy
piped to gzip, as long as the data transfer of 1.4GB from server to
client isn't limited by network bandwidth. Apparently, pg_lzcompress
uses 53s to compress to 133MB, while gzip only needs 14s for 117MB.
Might be worth to have a look optimizing that since it's used in
tuptoaster. Still, when network traffic is involved, it may be better to
have some time spent on the server to reduce data (e.g. for Slony, which
uses COPY to start a replication, and is likely to be operated over
lines <1GBit/s).

The attached patch implements COPY ... WITH [BINARY] COMPRESSION
(compression implies BINARY). The copy data uses bit 17 of the flag
field to identify compressed data.
The PROGRESS <n> option to throw notices each n lines has a caveat: when
copying TO STDOUT, data transfer will cease after the first notice was
sent. This may either mean "dont ereport(NOTICE) when COPYing data to
the client" or a bug somewhere.

Regards,
Andreas

Attachment	Content-Type	Size
copy-compression.patch	text/plain	21.0 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: copy with compression progress n
Date:	2006-05-31 14:27:12
Message-ID:	12096.1149085632@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andreas Pflug <pgadmin(at)pse-consulting(dot)de> writes:
> The attached patch implements COPY ... WITH [BINARY] COMPRESSION
> (compression implies BINARY). The copy data uses bit 17 of the flag
> field to identify compressed data.

I think this is a pretty horrid idea, because it changes pg_lzcompress
from an unimportant implementation detail into a backup file format
that we have to support till the end of time. What happens if, say,
we need to abandon pg_lzcompress because we find out it has patent
problems?

It *might* be tolerable if we used gzip instead, but I really don't see
the argument for doing this inside the server at all: piping to gzip
seems like a perfectly acceptable solution, quite possibly with higher
performance than doing it all in a single process (which isn't going
to be able to use more than one CPU).

I don't see the argument for restricting it to binary only, either.

regards, tom lane

From:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: copy with compression progress n
Date:	2006-05-31 15:31:13
Message-ID:	447DB6C1.1070800@pse-consulting.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andreas Pflug <pgadmin(at)pse-consulting(dot)de> writes:
>
>>The attached patch implements COPY ... WITH [BINARY] COMPRESSION
>>(compression implies BINARY). The copy data uses bit 17 of the flag
>>field to identify compressed data.
>
>
> I think this is a pretty horrid idea, because it changes pg_lzcompress
> from an unimportant implementation detail into a backup file format
> that we have to support till the end of time. What happens if, say,
> we need to abandon pg_lzcompress because we find out it has patent
> problems?
>
> It *might* be tolerable if we used gzip instead,

I used pg_lzcompress because it's present in the backend. I'm fine with
every other good compression algorithm.

> but I really don't see
> the argument for doing this inside the server at all: piping to gzip
> seems like a perfectly acceptable solution,

As I said, this hits only if it is possible to pipe the result into gzip
in a performant way. The issue already arises if psql or any other COPY
client (slony, pg_dump) is not on the same machine: Network bandwidth
will limit throughput.

> quite possibly with higher
> performance than doing it all in a single process (which isn't going
> to be able to use more than one CPU).

Which is pretty normal for pgsql.

> I don't see the argument for restricting it to binary only, either.

That's not a restriction, but a result: compressed data is binary.
Marking it as binary will make it working with older frontends as well,
as long as they don't try to interpret the data. Actually, all 8.x psql
versions should work (with COPY STDxx, not \copy).

Do you have a comment about the progress notification and its impact on
copy to stdout?

Regards,
Andreas

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: copy with compression progress n
Date:	2006-05-31 16:26:21
Message-ID:	13166.1149092781@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andreas Pflug <pgadmin(at)pse-consulting(dot)de> writes:
> Do you have a comment about the progress notification and its impact on
> copy to stdout?

I didn't bother to comment on it because I think it's useless, as well
as broken for the stdout case. Anyone who actually sees a use for it
will have to comment on why they want it.

regards, tom lane

From:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: copy progress notification
Date:	2006-05-31 16:43:23
Message-ID:	447DC7AB.6060109@pse-consulting.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Andreas Pflug <pgadmin(at)pse-consulting(dot)de> writes:
>
>>Do you have a comment about the progress notification and its impact on
>>copy to stdout?
>
>
> I didn't bother to comment on it because I think it's useless,

It's useful to see anything at all, and to be able to estimate how long
the whole process will take. People might find it interesting whether
they should go for a cup of coffee or come better back the next day...

> as well as broken for the stdout case.

I know it's broken, but why? Is using ereport when sending copy data
illegal by design? If not, it's not the feature that's broken but
something in cvs HEAD.

Regards,
Andreas

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: copy with compression progress n
Date:	2006-06-01 05:23:31
Message-ID:	1149139411.3839.3.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, K, 2006-05-31 kell 17:31, kirjutas Andreas Pflug:
> Tom Lane wrote:
> > Andreas Pflug <pgadmin(at)pse-consulting(dot)de> writes:
> >
> >>The attached patch implements COPY ... WITH [BINARY] COMPRESSION
> >>(compression implies BINARY). The copy data uses bit 17 of the flag
> >>field to identify compressed data.
> >
> >
> > I think this is a pretty horrid idea, because it changes pg_lzcompress
> > from an unimportant implementation detail into a backup file format
> > that we have to support till the end of time. What happens if, say,
> > we need to abandon pg_lzcompress because we find out it has patent
> > problems?
> >
> > It *might* be tolerable if we used gzip instead,
>
> I used pg_lzcompress because it's present in the backend. I'm fine with
> every other good compression algorithm.
>
> > but I really don't see
> > the argument for doing this inside the server at all: piping to gzip
> > seems like a perfectly acceptable solution,
>
> As I said, this hits only if it is possible to pipe the result into gzip
> in a performant way. The issue already arises if psql or any other COPY
> client (slony, pg_dump) is not on the same machine: Network bandwidth
> will limit throughput.

Maybe make up a way to pipe COPY result through some external process
(like gzip) on the server side without having shell access there.

To make it secure, the external process should probably be run from a
hardwired directory via chroot.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com