Re: Compression and on-disk sorting

Lists: pgsql-hackerspgsql-patches
From: "Bort, Paul" <pbort(at)tmwsystems(dot)com>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 13:09:51
Message-ID: DB106B1B5B8F734B8FF3E155A3A556C202D4FA85@clemail1.tmwsystems.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

>
> Compressed-filesystem extension (like e2compr, and I think either
> Fat or NTFS) can do that.
>

Windows (NT/2000/XP) can compress individual directories and files under
NTFS; new files in a compressed directory are compressed by default.

So if the 'spill-to-disk' all happened in its own specific directory, it
would be trivial to mark that directory for compression.

I don't know enough Linux/Unix to know if it has similar capabilities.


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "Bort, Paul" <pbort(at)tmwsystems(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 15:53:42
Message-ID: 4469F586.3050706@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bort, Paul wrote:
>> Compressed-filesystem extension (like e2compr, and I think either
>> Fat or NTFS) can do that.
>>
>>
>
> Windows (NT/2000/XP) can compress individual directories and files under
> NTFS; new files in a compressed directory are compressed by default.
>
> So if the 'spill-to-disk' all happened in its own specific directory, it
> would be trivial to mark that directory for compression.
>
> I don't know enough Linux/Unix to know if it has similar capabilities.
>
>
>

Or would want to ...

I habitually turn off all compression on my Windows boxes, because it's
a performance hit in my experience. Disk is cheap ...

cheers

andrew


From: Rod Taylor <pg(at)rbt(dot)ca>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 15:56:59
Message-ID: 1147795019.69863.303.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 2006-05-16 at 11:53 -0400, Andrew Dunstan wrote:
> Bort, Paul wrote:
> >> Compressed-filesystem extension (like e2compr, and I think either
> >> Fat or NTFS) can do that.
> >>
> >>
> >
> > Windows (NT/2000/XP) can compress individual directories and files under
> > NTFS; new files in a compressed directory are compressed by default.
> >
> > So if the 'spill-to-disk' all happened in its own specific directory, it
> > would be trivial to mark that directory for compression.
> >
> > I don't know enough Linux/Unix to know if it has similar capabilities.

> Or would want to ...
>
> I habitually turn off all compression on my Windows boxes, because it's
> a performance hit in my experience. Disk is cheap ...

Disk storage is cheap. Disk bandwidth or throughput is very expensive.
--


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 16:27:42
Message-ID: 4469FD7E.2070805@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Rod Taylor wrote:
>> I habitually turn off all compression on my Windows boxes, because it's
>> a performance hit in my experience. Disk is cheap ...
>>
>
> Disk storage is cheap. Disk bandwidth or throughput is very expensive.
>

Sure, but in my experience using Windows File System compression is not
a win here. Presumably if it were an unqualified win they would have it
turned on everywhere. The fact that there's an option is a good
indication that it isn't in many cases. It is most commonly used for
files like executables that are in effect read-only - but that doesn't
help us.

cheers

andrew


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 17:31:07
Message-ID: 20060516173107.GH26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:
> Rod Taylor wrote:
> >>I habitually turn off all compression on my Windows boxes, because it's
> >>a performance hit in my experience. Disk is cheap ...
> >>
> >
> >Disk storage is cheap. Disk bandwidth or throughput is very expensive.

Hey, that's my line! :P

> Sure, but in my experience using Windows File System compression is not
> a win here. Presumably if it were an unqualified win they would have it
> turned on everywhere. The fact that there's an option is a good
> indication that it isn't in many cases. It is most commonly used for
> files like executables that are in effect read-only - but that doesn't
> help us.

The issue with filesystem level compression is that it has to support
things like random access, which isn't needed for on-disk sorting (not
sure about other things like hashing, etc).

In any case, my curiousity is aroused, so I'm currently benchmarking
pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
some benchmarks with pg_tmp compressed.

Does anyone have time to hack some kind of compression into the on-disk
sort code just to get some benchmark numbers? Unfortunately, doing so is
beyond my meager C abilitiy...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 20:42:46
Message-ID: 20060516204246.GW26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> In any case, my curiousity is aroused, so I'm currently benchmarking
> pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
> some benchmarks with pg_tmp compressed.

Results: http://jim.nasby.net/bench.log

As expected, compressing $PGDATA/base was a loss. But compressing
pgsql_tmp and then doing some disk-based sorts did show an improvement,
from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
a worst-case scenario. On the other hand, XP's compression algorithm
appears to be pretty aggressive, as it cut the size of the on-disk sort
file from almost 700MB to 82MB. There's probably gains to be had from a
different compression algorithm.

> Does anyone have time to hack some kind of compression into the on-disk
> sort code just to get some benchmark numbers? Unfortunately, doing so is
> beyond my meager C abilitiy...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 21:46:15
Message-ID: 20060516214615.GM976@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> Does anyone have time to hack some kind of compression into the on-disk
> sort code just to get some benchmark numbers? Unfortunately, doing so is
> beyond my meager C abilitiy...

I had a look at this. At first glance it doesn't seem too hard, except
the whole logtape process kinda gets in the way. If it wern't for the
mark/restore it'd be trivial. Might take a stab at it some time, if I
can think of a way to handle the seeking...

--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 21:50:22
Message-ID: 20060516215021.GC26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 11:46:15PM +0200, Martijn van Oosterhout wrote:
> On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> > Does anyone have time to hack some kind of compression into the on-disk
> > sort code just to get some benchmark numbers? Unfortunately, doing so is
> > beyond my meager C abilitiy...
>
> I had a look at this. At first glance it doesn't seem too hard, except
> the whole logtape process kinda gets in the way. If it wern't for the
> mark/restore it'd be trivial. Might take a stab at it some time, if I
> can think of a way to handle the seeking...

Oh, do we need to randomly seek? Is that how we switch from one tape to
another?

It might be easier to switch to giving each tape it's own file...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 21:58:02
Message-ID: 20060516215802.GN976@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 04:50:22PM -0500, Jim C. Nasby wrote:
> > I had a look at this. At first glance it doesn't seem too hard, except
> > the whole logtape process kinda gets in the way. If it wern't for the
> > mark/restore it'd be trivial. Might take a stab at it some time, if I
> > can think of a way to handle the seeking...
>
> Oh, do we need to randomly seek? Is that how we switch from one tape to
> another?

Not seek, mark/restore. As the code describes, sometimes you go back a
tuple. The primary reason I think is for the final pass, a merge sort
might read the tuples multiple times, so it needs to support it there.

> It might be easier to switch to giving each tape it's own file...

I don't think it would make much difference. OTOH, if this turns out to
be a win, the tuplestore could have the same optimisation.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 22:48:25
Message-ID: 877j4lsi12.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:

> > It might be easier to switch to giving each tape it's own file...
>
> I don't think it would make much difference. OTOH, if this turns out to
> be a win, the tuplestore could have the same optimisation.

Would giving each tape its own file make it easier to allow multiple temporary
sort areas and allow optimizing to avoid seeking when multiple spindles area
available?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-16 22:48:33
Message-ID: 12911.1147819713@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> Not seek, mark/restore. As the code describes, sometimes you go back a
> tuple. The primary reason I think is for the final pass, a merge sort
> might read the tuples multiple times, so it needs to support it there.

However it'd be possible to tell logtape in advance whether a particular
tape needs to support that, and only apply compression when not; it
would work all the time for intermediate merge passes, and with the
recent executor changes to pass down you-need-to-support-mark flags,
it'd work for the output pass in a lot of cases too.

If you're just trying to get some quick and dirty numbers: do
compression, replace Seek/Tell with PANICs, and only test on plain
sorts no joins ;-)

regards, tom lane


From: Andrew Piskorski <atp(at)piskorski(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 00:14:34
Message-ID: 20060517001434.GA3222@tehun.pair.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> On Tue, May 16, 2006 at 12:27:42PM -0400, Andrew Dunstan wrote:
> > Rod Taylor wrote:
> > >>I habitually turn off all compression on my Windows boxes, because it's
> > >>a performance hit in my experience. Disk is cheap ...
> > >
> > >Disk storage is cheap. Disk bandwidth or throughput is very expensive.

> > Sure, but in my experience using Windows File System compression is not
> > a win here. Presumably if it were an unqualified win they would have it

> Does anyone have time to hack some kind of compression into the on-disk
> sort code just to get some benchmark numbers? Unfortunately, doing so is
> beyond my meager C abilitiy...

Folks, first of all, I'm in no way an expert on data compression in
RDBMSs, but other databases DO include data compression features and
claim it as a SIGNIFICANT win in I/O reduction.

Looking at performance of the Windows File System compression, etc.,
doesn't make too much sense when there are actual RDBMS compression
implementations to compare to, on the commerical market, in open
source code, and in the academic literature.

Oracle has included "table compression" since 9iR2. They report table
size reductions of 2x to 4x as typical, with proportional reductions
in I/O, and supposedly, usually low to negligible overhead for writes:

http://download-east.oracle.com/docs/cd/B19306_01/server.102/b14211/build_db.htm#sthref289

Decision Speed: Table Compression In Action by Meikel Poess and Hermann Baer (2003):
http://www.oracle.com/technology/oramag/webcolumns/2003/techarticles/poess_tablecomp.html

Compressing Data for Space and Speed by Sanjay Mishra (2004):
http://www.oracle.com/technology/oramag/oracle/04-mar/o24tech_data.html

Order For Maximum Compression:
http://oramossoracle.blogspot.com/2005/11/table-compression-order-for-maximum.html

I don't remember whether the current (Open Source) MonetDB includes
table compression or not, but they've published papers with lots of
interesting detail on the compression and other high performance OLAP
features in their latest (not released) "X100" MoneyDB research
codebase:

http://monetdb.cwi.nl/
http://homepages.cwi.nl/~mk/MonetDB/
http://sourceforge.net/projects/monetdb/
ftp://ftp.research.microsoft.com/pub/debull/A05june/issue1.htm

Now, the docs and papers above are all focused on query performance,
they say nothing directly about using using compression for on-disk
sorts. But, I'd bet that similar rules of thumb will apply in both
cases.

The main tricks seem to be: One, EXTREMELY lightweight compression
schemes - basically table lookups designed to be as cpu friendly as
posible. Two, keep the data compressed in RAM as well so that you can
also cache more of the data, and indeed keep it the compressed until
as late in the CPU processing pipeline as possible.

A corrolary of that is forget compression schemes like gzip - it
reduces data size nicely but is far too slow on the cpu to be
particularly useful in improving overall throughput rates.

Note, I have not really tested ANY of the above myself, your mileage
may well vary from what I recall from those various articles...

--
Andrew Piskorski <atp(at)piskorski(dot)com>
http://www.piskorski.com/


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andrew Piskorski <atp(at)piskorski(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 03:48:21
Message-ID: 871wuts456.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Andrew Piskorski <atp(at)piskorski(dot)com> writes:

> The main tricks seem to be: One, EXTREMELY lightweight compression
> schemes - basically table lookups designed to be as cpu friendly as
> posible. Two, keep the data compressed in RAM as well so that you can
> also cache more of the data, and indeed keep it the compressed until
> as late in the CPU processing pipeline as possible.
>
> A corrolary of that is forget compression schemes like gzip - it
> reduces data size nicely but is far too slow on the cpu to be
> particularly useful in improving overall throughput rates.

There are some very fast decompression algorithms:

http://www.oberhumer.com/opensource/lzo/

I think most of the mileage from "lookup tables" would be better implemented
at a higher level by giving tools to data modellers that let them achieve
denser data representations. Things like convenient enum data types, 1-bit
boolean data types, short integer data types, etc.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andrew Piskorski <atp(at)piskorski(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 04:03:15
Message-ID: 15446.1147838595@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Andrew Piskorski <atp(at)piskorski(dot)com> writes:
>> A corrolary of that is forget compression schemes like gzip - it
>> reduces data size nicely but is far too slow on the cpu to be
>> particularly useful in improving overall throughput rates.

> There are some very fast decompression algorithms:

AFAICS the only sane choice here is to use
src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
already in the backend, and (2) data compression in general is such a
minefield of patents that we'd be foolish to expose ourselves in more
than one direction.

Certainly, if you can't prototype a convincing performance win using
that algorithm, it's unlikely to be worth anyone's time to look harder.

regards, tom lane


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Andrew Piskorski <atp(at)piskorski(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 08:45:59
Message-ID: 20060517084559.GC15180@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, May 17, 2006 at 12:03:15AM -0400, Tom Lane wrote:
> AFAICS the only sane choice here is to use
> src/backend/utils/adt/pg_lzcompress.c, on the grounds that (1) it's
> already in the backend, and (2) data compression in general is such a
> minefield of patents that we'd be foolish to expose ourselves in more
> than one direction.

Unfortunatly, the interface provided by pg_lzcompress.c is probably
insufficient for this purpose. You want to be able to compress tuples
as they get inserted and start a new block once the output reaches a
certain size. pg_lzcompress.c only has the options compress-whole-block
and decompress-whole-block.

zlib allows you to compress as the data comes along, keeping an eye on
the output buffer while you do it. For an initial test, using zlib
directly would probably be easier. If it works out we can look into
alternatives.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Andrew Piskorski <atp(at)piskorski(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Greg Stark <gsstark(at)mit(dot)edu>
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 08:52:30
Message-ID: 20060517085230.GA53017@tehun.pair.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 11:48:21PM -0400, Greg Stark wrote:

> There are some very fast decompression algorithms:
>
> http://www.oberhumer.com/opensource/lzo/

Sure, and for some tasks in PostgreSQL perhaps it would be useful.
But at least as of July 2005, a Sandor Heman, one of the MonetDB guys,
had looked at zlib, bzlib2, lzrw, and lzo, and claimed that:

"... in general, it is very unlikely that we could achieve any
bandwidth gains with these algorithms. LZRW and LZO might increase
bandwidth on relatively slow disk systems, with bandwidths up to
100MB/s, but this would induce high processing overheads, which
interferes with query execution. On a fast disk system, such as our
350MB/s 12 disk RAID, all the generic algorithms will fail to achieve
any speedup."

http://www.google.com/search?q=MonetDB+LZO+Heman&btnG=Search
http://homepages.cwi.nl/~heman/downloads/msthesis.pdf

> I think most of the mileage from "lookup tables" would be better implemented
> at a higher level by giving tools to data modellers that let them achieve
> denser data representations. Things like convenient enum data types, 1-bit
> boolean data types, short integer data types, etc.

Things like enums and 1 bit booleans certainly could be useful, but
they cannot take advantage of duplicate values across multiple rows at
all, even if 1000 rows have the exact same value in their "date"
column and are all in the same disk block, right?

Thus I suspect that the exact opposite is true, a good table
compression scheme would render special denser data types largely
redundant and obsolete.

Good table compression might be a lot harder to do, of course.
Certainly Oracle's implementation of it had some bugs which made it
difficult to use reliably in practice (in certain circumstances
updates could fail, or if not fail perhaps have pathological
performance), bugs which are supposed to be fixed in 10.2.0.2, which
was only released within the last few months.

--
Andrew Piskorski <atp(at)piskorski(dot)com>
http://www.piskorski.com/


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 14:56:07
Message-ID: 20060517145607.GJ26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, May 16, 2006 at 06:48:25PM -0400, Greg Stark wrote:
> Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
>
> > > It might be easier to switch to giving each tape it's own file...
> >
> > I don't think it would make much difference. OTOH, if this turns out to
> > be a win, the tuplestore could have the same optimisation.
>
> Would giving each tape its own file make it easier to allow multiple temporary
> sort areas and allow optimizing to avoid seeking when multiple spindles area
> available?

Only if those spindles weren't all in a single RAID array and if we went
through the trouble of creating all the machinery so you could tell
PostgreSQL where all those spindles were mounted in the filesystem. And
even after all that work, there's not much that says it would perform
better than a simple RAID10.

What *might* make sense would be to provide two locations for pgsql_tmp,
because a lot of operations in there involve reading and writing at the
same time:

Read from heap while writing tapes to pgsql_tmp
read from tapes while writing final version to pgsql_tmp

There's probably some gain to be had by writing the final version to a
tablespace other than the default one (which is where pgsql_tmp would
be, I think). But recent changes in -HEAD mean that the second step is
now only performed in certain cases.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Andrew Piskorski <atp(at)piskorski(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 15:01:33
Message-ID: 20060517150132.GK26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

If we're going to consider table-level compression, ISTM the logical
first step is to provide greater control over TOASTing; namely
thresholds for when to compress and/or go to external storage that can
be set on a per-field or at least per-table basis.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 15:38:05
Message-ID: 20432.1147880285@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> What *might* make sense would be to provide two locations for pgsql_tmp,
> because a lot of operations in there involve reading and writing at the
> same time:

> Read from heap while writing tapes to pgsql_tmp
> read from tapes while writing final version to pgsql_tmp

Note that a large part of the reason for the current logtape.c design
is to avoid requiring 2X or more disk space to sort X amount of data.
AFAICS, any design that does the above will put us right back in the 2X
regime. That's a direct, measurable penalty; it'll take more than
handwaving arguments to convince me we should change it in pursuit of
unquantified speed benefits.

regards, tom lane


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 15:45:04
Message-ID: 20060517154504.GR26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> > What *might* make sense would be to provide two locations for pgsql_tmp,
> > because a lot of operations in there involve reading and writing at the
> > same time:
>
> > Read from heap while writing tapes to pgsql_tmp
> > read from tapes while writing final version to pgsql_tmp
>
> Note that a large part of the reason for the current logtape.c design
> is to avoid requiring 2X or more disk space to sort X amount of data.
> AFAICS, any design that does the above will put us right back in the 2X
> regime. That's a direct, measurable penalty; it'll take more than
> handwaving arguments to convince me we should change it in pursuit of
> unquantified speed benefits.

I certainly agree that there's no reason to make this change without
testing, but if you've got enough spindles laying around to actually
make use of this I suspect that requiring 2X the disk space to sort X
won't bother you.

Actually, I suspect in most cases it won't matter; I don't think people
make a habit of trying to sort their entire database. :) But we'd want
to protect for the oddball cases... yech.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 15:57:24
Message-ID: 20624.1147881444@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> On Wed, May 17, 2006 at 11:38:05AM -0400, Tom Lane wrote:
>> Note that a large part of the reason for the current logtape.c design
>> is to avoid requiring 2X or more disk space to sort X amount of data.

> Actually, I suspect in most cases it won't matter; I don't think people
> make a habit of trying to sort their entire database. :)

Well, some years ago we used something like 4X space to sort X amount of
data (this is the native behavior of 7-way polyphase merge, it turns out)
and we got yelled at. Which is what prompted the writing of logtape.c.
Maybe disk space has gotten so cheap since then that it no longer
matters ... but I suspect the nature of DB applications is that people
are always pushing the envelope of what their hardware can do.

regards, tom lane


From: Rod Taylor <pg(at)rbt(dot)ca>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 16:16:13
Message-ID: 1147882573.23427.3.camel@home
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

> Actually, I suspect in most cases it won't matter; I don't think people
> make a habit of trying to sort their entire database. :) But we'd want
> to protect for the oddball cases... yech.

I can make query result sets that are far larger than the database
itself.

create table fat_table_with_few_tuples(fat_status_id serial primary key,
fat_1 text, fat_2 text);

create table thin_table_with_many_tuples(fat_status_id integer
references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);

SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;

I would be asking the folks trying to use PostgreSQL for data
warehousing what their opinion is. A few fact tables in an audit query
could easily result in a very large amount of temporary diskspace being
required.

--


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: pgsql-patches(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: [PATCH] Compression and on-disk sorting
Date: 2006-05-17 16:17:30
Message-ID: 20060517161730.GI15180@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Persuant to the discussions currently on -hackers, here's a patch that
uses zlib to compress the tapes as they go to disk. I default to the
compression level 3 (think gzip -3).

Please speed test all you like, I *think* it's bug free, but you never
know.

Outstanding questions:

- I use zlib because the builtin pg_lzcompress can't do what zlib does.
Here we setup input and output buffers and zlib will process as much as
it can (input empty or output full). This means no marshalling is
required. We can compress the whole file without having it in memory.

- zlib allocates memory for compression and decompression, I don't know
how much. However, it allocates via the postgres mcxt system so it
shouldn't too hard to find out. Simon pointed out that we'll need to
track this because we might allow hundreds of tapes.

- Each tape is compressed as one long compressed stream. Currently no
seeking is allowed, so only sorts, no joins! (As tom said, quick and
dirty numbers). This should show this possibility in its best light
but if we want to support seeking we're going to need to change that.
Maybe no compression on the last pass?

- It's probable that the benefits are strongly correlated to the speed
of your disk subsystem. We need to measure this effect. I can't
accuratly measure this because my compiler doesn't inline any of the
functions in tuplesort.c.

In my test of a compression ratio around 100-to-1, on 160MB of data
with tiny work_mem on my 5 year old laptop, it speeds it up by 60% so
it's obviously not a complete waste of time. Ofcourse, YMMV :)

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Attachment Content-Type Size
compress-sort.patch text/plain 9.3 KB

From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 16:55:53
Message-ID: 87k68kr3om.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:

> Only if those spindles weren't all in a single RAID array and if we went
> through the trouble of creating all the machinery so you could tell
> PostgreSQL where all those spindles were mounted in the filesystem.

I think the way you do this is simply by documenting that the admin should
create precisely one temp area per physical spindle (or raid array).

--
greg


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Andrew Piskorski <atp(at)piskorski(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Greg Stark <gsstark(at)mit(dot)edu>
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 17:01:11
Message-ID: 87ejysr3fs.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Andrew Piskorski <atp(at)piskorski(dot)com> writes:

> Things like enums and 1 bit booleans certainly could be useful, but
> they cannot take advantage of duplicate values across multiple rows at
> all, even if 1000 rows have the exact same value in their "date"
> column and are all in the same disk block, right?

That's an interesting direction to go in. Generic algorithms would still help
in that case since the identical value would occur more frequently than other
values it would be encoded in a smaller symbol. But there's going to be a
limit to how compressed it can get the data.

The ideal way to handle the situation you're describing would be to interleave
the tuples so that you have all 1000 values of the first column, followed by
all 1000 values of the second column and so on. Then you run a generic
algorithm on this and it achieves very high compression rates since there are
a lot of repeating patterns.

I don't see how you build a working database with data in this form however.
For example, a single insert would require updating small pieces of data
across the entire table. Perhaps there's some middle ground with interleaving
the tuples within a single compressed page, or something like that?

--
greg


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-17 17:38:47
Message-ID: 1147887527.2646.323.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, 2006-05-17 at 18:17 +0200, Martijn van Oosterhout wrote:
> Persuant to the discussions currently on -hackers, here's a patch that
> uses zlib to compress the tapes as they go to disk. I default to the
> compression level 3 (think gzip -3).
>
> Please speed test all you like, I *think* it's bug free, but you never
> know.
>
> Outstanding questions:
>
> - I use zlib because the builtin pg_lzcompress can't do what zlib does.
> Here we setup input and output buffers and zlib will process as much as
> it can (input empty or output full). This means no marshalling is
> required. We can compress the whole file without having it in memory.

Licence is BSD-compatible and it works the way we need it to work.

> - Each tape is compressed as one long compressed stream. Currently no
> seeking is allowed, so only sorts, no joins! (As tom said, quick and
> dirty numbers). This should show this possibility in its best light
> but if we want to support seeking we're going to need to change that.
> Maybe no compression on the last pass?

We should be able to do this without significant loss of compression by
redefining the lts block size to be 32k. That's the size of the
look-back window anyhow, so compressing the whole stream doesn't get us
much more.

> - It's probable that the benefits are strongly correlated to the speed
> of your disk subsystem. We need to measure this effect. I can't
> accuratly measure this because my compiler doesn't inline any of the
> functions in tuplesort.c.

Please make sure any tests have trace_sort = on.

> In my test of a compression ratio around 100-to-1, on 160MB of data
> with tiny work_mem on my 5 year old laptop, it speeds it up by 60% so
> it's obviously not a complete waste of time. Ofcourse, YMMV :)

Sounds good. Well done.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Andrew Piskorski <atp(at)piskorski(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 19:30:54
Message-ID: 24764.1147894254@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Greg Stark <gsstark(at)mit(dot)edu> writes:
> The ideal way to handle the situation you're describing would be to interleave
> the tuples so that you have all 1000 values of the first column, followed by
> all 1000 values of the second column and so on. Then you run a generic
> algorithm on this and it achieves very high compression rates since there are
> a lot of repeating patterns.

It's not obvious to me that that yields a form more compressible than
what we have now. As long as the previous value is within the lookback
window, an LZ-style compressor will still be able to use it. More
importantly, the layout you describe would be unable to take advantage
of any cross-column correlation, which in real data is likely to be a
useful property for compression.

regards, tom lane


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 19:50:59
Message-ID: 20060517195059.GE26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
>
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
>
> > Only if those spindles weren't all in a single RAID array and if we went
> > through the trouble of creating all the machinery so you could tell
> > PostgreSQL where all those spindles were mounted in the filesystem.
>
> I think the way you do this is simply by documenting that the admin should
> create precisely one temp area per physical spindle (or raid array).

And you still need some way to tell PostgreSQL about all of that.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Rod Taylor <pg(at)rbt(dot)ca>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 19:54:00
Message-ID: 20060517195400.GF26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, May 17, 2006 at 12:16:13PM -0400, Rod Taylor wrote:
> > Actually, I suspect in most cases it won't matter; I don't think people
> > make a habit of trying to sort their entire database. :) But we'd want
> > to protect for the oddball cases... yech.
>
> I can make query result sets that are far larger than the database
> itself.
>
> create table fat_table_with_few_tuples(fat_status_id serial primary key,
> fat_1 text, fat_2 text);
>
> create table thin_table_with_many_tuples(fat_status_id integer
> references fat_table_with_few_tuples, thin_1 integer, thin_2 integer);
>
> SELECT * FROM thin_table_with_many_tuples NATURAL JOIN
> fat_table_with_few_tuples order by fat_1, thin_1, thin_2, fat_2;
>
>
> I would be asking the folks trying to use PostgreSQL for data
> warehousing what their opinion is. A few fact tables in an audit query
> could easily result in a very large amount of temporary diskspace being
> required.

Note my last sentence: we'd need to provide for cases where this was a
problem. How much that would complicate the code, I don't know...

This is another case where someone (with more skills than me) would need
to hack the backend enough to be able to test it and see how big a
performance gain there was.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Hannu Krosing <hannu(at)skype(dot)net>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Andrew Piskorski <atp(at)piskorski(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 19:55:19
Message-ID: 1147895720.3889.1.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Ühel kenal päeval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:
> If we're going to consider table-level compression, ISTM the logical
> first step is to provide greater control over TOASTing; namely
> thresholds for when to compress and/or go to external storage that can
> be set on a per-field or at least per-table basis.

also, we would get a lot of "compression", if we could get rid of index
on toast table, and use the ctid directly.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Hannu Krosing <hannu(at)skype(dot)net>
Cc: Andrew Piskorski <atp(at)piskorski(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 20:20:07
Message-ID: 20060517202007.GH26212@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, May 17, 2006 at 10:55:19PM +0300, Hannu Krosing wrote:
> ??hel kenal p??eval, K, 2006-05-17 kell 10:01, kirjutas Jim C. Nasby:
> > If we're going to consider table-level compression, ISTM the logical
> > first step is to provide greater control over TOASTing; namely
> > thresholds for when to compress and/or go to external storage that can
> > be set on a per-field or at least per-table basis.
>
> also, we would get a lot of "compression", if we could get rid of index
> on toast table, and use the ctid directly.

It'd also be damn handy to be able to ditch the 2-pass vacuum
requirement. I've often wondered if treating the toast table as a real
table was overkill; for example, it should be possible to include WAL
information for it in with the parent table. That plus using ctid as a
reference would hopefully allow for removing a lot of overhead from it.
Presumably all the MVCC info could go, since that would be taken care of
by the parent table.

Of course the downside is that it'd mean a different set of code for
handling toast storage...
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 21:44:22
Message-ID: 873bf8qqbt.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:

> On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
> >
> > "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> >
> > > Only if those spindles weren't all in a single RAID array and if we went
> > > through the trouble of creating all the machinery so you could tell
> > > PostgreSQL where all those spindles were mounted in the filesystem.
> >
> > I think the way you do this is simply by documenting that the admin should
> > create precisely one temp area per physical spindle (or raid array).
>
> And you still need some way to tell PostgreSQL about all of that.

No, my point was that you tell Postges how many spindles you have and where to
find them by creating precisely one temp area on each spindle. It then knows
that it should strive to maximize sequential reads within one temp area and
expect switching between temp areas (which represent multiple spindles) to be
better than multiplexing multiple tapes within a single temp area (which
represents a single spindle).

--
greg


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-17 21:51:56
Message-ID: 20060517215155.GD42612@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, May 17, 2006 at 05:44:22PM -0400, Greg Stark wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
>
> > On Wed, May 17, 2006 at 12:55:53PM -0400, Greg Stark wrote:
> > >
> > > "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
> > >
> > > > Only if those spindles weren't all in a single RAID array and if we went
> > > > through the trouble of creating all the machinery so you could tell
> > > > PostgreSQL where all those spindles were mounted in the filesystem.
> > >
> > > I think the way you do this is simply by documenting that the admin should
> > > create precisely one temp area per physical spindle (or raid array).
> >
> > And you still need some way to tell PostgreSQL about all of that.
>
> No, my point was that you tell Postges how many spindles you have and where to
> find them by creating precisely one temp area on each spindle. It then knows

Which means we need all the interface bits to be able to tell PostgreSQL
where every single temp storage area is. Presumably much of the
tablespace mechanism could be used for this, but it's still a bunch of
work. And you can't just say "I have 8 spindles", you have to tell
PostgreSQL exactly where to put each temporary area (unless you just
have it put one on every tablespace you have defined).

> that it should strive to maximize sequential reads within one temp area and
> expect switching between temp areas (which represent multiple spindles) to be
> better than multiplexing multiple tapes within a single temp area (which
> represents a single spindle).

Which adds yet more complexity to all the code that uses the temp area.
And as others have brought up, you still have to allow for the case when
splitting all of this out into multiple files means you end up using
substantially more disk space. That further drives up the complexity.

My point is that unless someone shows that there's a non-trivial
performance gain here, it's not going to happen.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Greg Stark <gsstark(at)mit(dot)edu>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-18 04:24:54
Message-ID: 87r72sot7t.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

"Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:

> Which means we need all the interface bits to be able to tell PostgreSQL
> where every single temp storage area is. Presumably much of the
> tablespace mechanism could be used for this, but it's still a bunch of
> work. And you can't just say "I have 8 spindles", you have to tell
> PostgreSQL exactly where to put each temporary area (unless you just
> have it put one on every tablespace you have defined).

Yes, if you have more than one temporary area you definitely need a way to
tell Postgres where to put them since obviously they won't be in the postgres
base directory. But I think that's all Postgres really needs to know.

One could imagine a more complex version where Postgres has meta information
about the bandwidth and seek penalty for each sort area separately. Presumably
also for each table space. But that's a whole lot more complexity than
Postgres's current cost model.

> > that it should strive to maximize sequential reads within one temp area and
> > expect switching between temp areas (which represent multiple spindles) to be
> > better than multiplexing multiple tapes within a single temp area (which
> > represents a single spindle).
>
> Which adds yet more complexity to all the code that uses the temp area.
> And as others have brought up, you still have to allow for the case when
> splitting all of this out into multiple files means you end up using
> substantially more disk space. That further drives up the complexity.

You also have to consider that it won't always be a benefit to spread the sort
over multiple sort areas. If there's only one sort going on and you can reach
a 1:1 ratio between tapes and spindles then I think it would be a huge
benefit. Effectively boosting the sort speed by random_page_cost.

But if you don't have as many spindles as your algorithm needs tapes
then it's unclear which to multiplex down and whether you gain any benefit
once you're multiplexing over simply using a single sort area.

And worse, if there are multiple sorts going on in the system then you're not
going to get sequential access even if you have multiple sort areas available.
If you have N sort areas and N sorts are going on then you're probably better
off multiplexing each one down to a single sort area and letting them each
proceed without interfering with each other rather than having each one hog
all the sort areas and forcing the OS to do the multiplexing blindly.

> My point is that unless someone shows that there's a non-trivial
> performance gain here, it's not going to happen.

I think two extreme cases are well worth pursuing:

1) Use n sort areas for n tapes making everything purely sequential access.
That would be useful for large DSS systems where large sorts are running
and i/o bandwidth is high for sequential access. That gives effectively a
random_page_cost speed boost.

2) Use the current algorithm unchanged but have each sort use a different sort
area in some sort of round-robin fashion. That helps the OLTP type
environment (ignoring for the moment that OLTP environments really ought
not be doing disk sorts) where people complain about unreliable execution
times more than slow execution times. If you can provide enough sort areas
then it would remove one big reason other queries concurrent impact the
execution time of queries.

--
greg


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-18 08:31:03
Message-ID: 20060518083103.GD32755@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Wed, May 17, 2006 at 06:38:47PM +0100, Simon Riggs wrote:
> > - Each tape is compressed as one long compressed stream. Currently no
> > seeking is allowed, so only sorts, no joins! (As tom said, quick and
> > dirty numbers). This should show this possibility in its best light
> > but if we want to support seeking we're going to need to change that.
> > Maybe no compression on the last pass?
>
> We should be able to do this without significant loss of compression by
> redefining the lts block size to be 32k. That's the size of the
> look-back window anyhow, so compressing the whole stream doesn't get us
> much more.

The major problem is looking back costs significantly more with
compression. If you need to look back into the previous compressed
block, you need to decompress the whole previous block. The simple
solution would be to keep a buffer of the last 32KB. Another posibility
would be to have a limit of 32KB of uncompressed data per block and
just remember the whole previous block.

Seek/Tell is not the hard part, it's the backspace. It would probably
be smart to make backspace call Seek, rather than trying to be smart
about it.

Another issue is that currently the compression code is completely
within logtape.c. To be able to seek backwards efficiently you might
have to change the abstraction so that it knows about the records from
tuplesort.c. That's much more work, which needs a lot more thinking.

Besides, we still havn't got any reports yet that this actually
provides a benefit on any machine less than five years ago. Anyone out
there doing tests?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-18 10:34:36
Message-ID: 1147948476.2646.426.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 2006-05-18 at 10:31 +0200, Martijn van Oosterhout wrote:
> On Wed, May 17, 2006 at 06:38:47PM +0100, Simon Riggs wrote:
> > > - Each tape is compressed as one long compressed stream. Currently no
> > > seeking is allowed, so only sorts, no joins! (As tom said, quick and
> > > dirty numbers). This should show this possibility in its best light
> > > but if we want to support seeking we're going to need to change that.
> > > Maybe no compression on the last pass?
> >
> > We should be able to do this without significant loss of compression by
> > redefining the lts block size to be 32k. That's the size of the
> > look-back window anyhow, so compressing the whole stream doesn't get us
> > much more.
>
> The major problem is looking back costs significantly more with
> compression. If you need to look back into the previous compressed
> block, you need to decompress the whole previous block. The simple
> solution would be to keep a buffer of the last 32KB. Another posibility
> would be to have a limit of 32KB of uncompressed data per block and
> just remember the whole previous block.

Just do a Z_FULL_FLUSH when you hit end of block. That way all blocks
will be independent of each other and you can rewind as much as you
like. We can choose the block size to be 32KB or even 64KB, there's no
dependency there, just memory allocation. It should be pretty simple to
make the block size variable at run time, so we can select it according
to how many files and how much memory we have.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Compression and on-disk sorting
Date: 2006-05-18 10:34:51
Message-ID: 1147948491.2646.427.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:
> On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> > In any case, my curiousity is aroused, so I'm currently benchmarking
> > pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
> > some benchmarks with pg_tmp compressed.
>
> Results: http://jim.nasby.net/bench.log
>
> As expected, compressing $PGDATA/base was a loss. But compressing
> pgsql_tmp and then doing some disk-based sorts did show an improvement,
> from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
> a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
> a worst-case scenario. On the other hand, XP's compression algorithm
> appears to be pretty aggressive, as it cut the size of the on-disk sort
> file from almost 700MB to 82MB. There's probably gains to be had from a
> different compression algorithm.

Can you re-run these tests using "SELECT aid FROM accounts ..."
"SELECT 1 ... " is of course highly compressible.

I also note that the compressed file fits within memory and may not even
have been written out at all. That's good, but this sounds like the very
best case of what we can hope to achieve. We need to test a whole range
of cases to see if it is generally applicable, or only in certain cases
- and if so which ones.

Would you be able to write up some extensive testing of Martijn's patch?
He's followed through on your idea and written it (on -patches now...)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-18 11:42:01
Message-ID: 20060518114201.GB4359@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, May 18, 2006 at 11:34:36AM +0100, Simon Riggs wrote:
> Just do a Z_FULL_FLUSH when you hit end of block. That way all blocks
> will be independent of each other and you can rewind as much as you
> like. We can choose the block size to be 32KB or even 64KB, there's no
> dependency there, just memory allocation. It should be pretty simple to
> make the block size variable at run time, so we can select it according
> to how many files and how much memory we have.

If you know you don't need to seek, there's no need to block the data
at all, one long stream is fine. So that case is easy.

For seeking, you need more work. I assume you're talking about 32KB
input block sizes (uncompressed). The output blocks will be of variable
size. These compressed blocks would be divided up into fixed 8K blocks
and written to disk.

To allow seeking, you'd have to do something like a header comtaining:

- length of previous compressed block
- length of this compressed block
- offset of block in uncompressed bytes (from beginning of tape)

This would allow you to scan backwards and forwards. If you want to be
able to jump to anywhere in the file, you may be better off storing the
file offsets (which would be implicit if the blocks are 32KB) in the
indirect blocks, using a search to find the right block, and then a
header in the block to find the offset.

Still, I'd like some evidence of benefits before writing up something
like that.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-18 16:07:05
Message-ID: 200605181607.k4IG75Y09778@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches


Uh, TODO already has:

o %Add a GUC variable to control the tablespace for temporary objects
and sort files

It could start with a random tablespace from a supplied list and
cycle through the list.

Do we need to add to this?

---------------------------------------------------------------------------

Greg Stark wrote:
> "Jim C. Nasby" <jnasby(at)pervasive(dot)com> writes:
>
> > Which means we need all the interface bits to be able to tell PostgreSQL
> > where every single temp storage area is. Presumably much of the
> > tablespace mechanism could be used for this, but it's still a bunch of
> > work. And you can't just say "I have 8 spindles", you have to tell
> > PostgreSQL exactly where to put each temporary area (unless you just
> > have it put one on every tablespace you have defined).
>
> Yes, if you have more than one temporary area you definitely need a way to
> tell Postgres where to put them since obviously they won't be in the postgres
> base directory. But I think that's all Postgres really needs to know.
>
> One could imagine a more complex version where Postgres has meta information
> about the bandwidth and seek penalty for each sort area separately. Presumably
> also for each table space. But that's a whole lot more complexity than
> Postgres's current cost model.
>
>
> > > that it should strive to maximize sequential reads within one temp area and
> > > expect switching between temp areas (which represent multiple spindles) to be
> > > better than multiplexing multiple tapes within a single temp area (which
> > > represents a single spindle).
> >
> > Which adds yet more complexity to all the code that uses the temp area.
> > And as others have brought up, you still have to allow for the case when
> > splitting all of this out into multiple files means you end up using
> > substantially more disk space. That further drives up the complexity.
>
> You also have to consider that it won't always be a benefit to spread the sort
> over multiple sort areas. If there's only one sort going on and you can reach
> a 1:1 ratio between tapes and spindles then I think it would be a huge
> benefit. Effectively boosting the sort speed by random_page_cost.
>
> But if you don't have as many spindles as your algorithm needs tapes
> then it's unclear which to multiplex down and whether you gain any benefit
> once you're multiplexing over simply using a single sort area.
>
> And worse, if there are multiple sorts going on in the system then you're not
> going to get sequential access even if you have multiple sort areas available.
> If you have N sort areas and N sorts are going on then you're probably better
> off multiplexing each one down to a single sort area and letting them each
> proceed without interfering with each other rather than having each one hog
> all the sort areas and forcing the OS to do the multiplexing blindly.
>
> > My point is that unless someone shows that there's a non-trivial
> > performance gain here, it's not going to happen.
>
> I think two extreme cases are well worth pursuing:
>
> 1) Use n sort areas for n tapes making everything purely sequential access.
> That would be useful for large DSS systems where large sorts are running
> and i/o bandwidth is high for sequential access. That gives effectively a
> random_page_cost speed boost.
>
> 2) Use the current algorithm unchanged but have each sort use a different sort
> area in some sort of round-robin fashion. That helps the OLTP type
> environment (ignoring for the moment that OLTP environments really ought
> not be doing disk sorts) where people complain about unreliable execution
> times more than slow execution times. If you can provide enough sort areas
> then it would remove one big reason other queries concurrent impact the
> execution time of queries.
>
> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
> http://www.postgresql.org/docs/faq
>

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org, Martijn van Oosterhout <kleptog(at)svana(dot)org>
Subject: Re: Compression and on-disk sorting
Date: 2006-05-18 16:18:51
Message-ID: 20060518161850.GI64371@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, May 18, 2006 at 11:34:51AM +0100, Simon Riggs wrote:
> On Tue, 2006-05-16 at 15:42 -0500, Jim C. Nasby wrote:
> > On Tue, May 16, 2006 at 12:31:07PM -0500, Jim C. Nasby wrote:
> > > In any case, my curiousity is aroused, so I'm currently benchmarking
> > > pgbench on both a compressed and uncompressed $PGDATA/base. I'll also do
> > > some benchmarks with pg_tmp compressed.
> >
> > Results: http://jim.nasby.net/bench.log
> >
> > As expected, compressing $PGDATA/base was a loss. But compressing
> > pgsql_tmp and then doing some disk-based sorts did show an improvement,
> > from 366.1 seconds to 317.3 seconds, an improvement of 13.3%. This is on
> > a Windows XP laptop (Dell Latitude D600) with 512MB, so it's somewhat of
> > a worst-case scenario. On the other hand, XP's compression algorithm
> > appears to be pretty aggressive, as it cut the size of the on-disk sort
> > file from almost 700MB to 82MB. There's probably gains to be had from a
> > different compression algorithm.
>
> Can you re-run these tests using "SELECT aid FROM accounts ..."
> "SELECT 1 ... " is of course highly compressible.
>
> I also note that the compressed file fits within memory and may not even
> have been written out at all. That's good, but this sounds like the very
> best case of what we can hope to achieve. We need to test a whole range
> of cases to see if it is generally applicable, or only in certain cases
> - and if so which ones.
>
> Would you be able to write up some extensive testing of Martijn's patch?
> He's followed through on your idea and written it (on -patches now...)

Yes, I'm working on that. I'd rather test his stuff than XP's
compression anyway.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-18 17:42:54
Message-ID: 16105.1147974174@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Uh, TODO already has:

> o %Add a GUC variable to control the tablespace for temporary objects
> and sort files

> It could start with a random tablespace from a supplied list and
> cycle through the list.

> Do we need to add to this?

The list bit strikes me as lily-gilding, or at least designing features
well beyond proven need.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Rod Taylor <pg(at)rbt(dot)ca>, "Bort, Paul" <pbort(at)tmwsystems(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Compression and on-disk sorting
Date: 2006-05-18 17:50:06
Message-ID: 200605181750.k4IHo6m27691@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Uh, TODO already has:
>
> > o %Add a GUC variable to control the tablespace for temporary objects
> > and sort files
>
> > It could start with a random tablespace from a supplied list and
> > cycle through the list.
>
> > Do we need to add to this?
>
> The list bit strikes me as lily-gilding, or at least designing features
> well beyond proven need.

I have seen other databases do the round-robin idea, so it is worth
testing.

--
Bruce Momjian http://candle.pha.pa.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To: Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-18 22:10:13
Message-ID: 20060518221012.GT64371@pervasive.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, May 18, 2006 at 10:31:03AM +0200, Martijn van Oosterhout wrote:
> Besides, we still havn't got any reports yet that this actually
> provides a benefit on any machine less than five years ago. Anyone out
> there doing tests?

Yes. I'm compiling the patched binaries right now, but the baseline
testing I've got so far is at
http://jim.nasby.net/misc/compress_sort.txt.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc: Martijn van Oosterhout <kleptog(at)svana(dot)org>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-19 11:43:05
Message-ID: 1148038985.2646.586.camel@localhost.localdomain
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Thu, 2006-05-18 at 17:10 -0500, Jim C. Nasby wrote:
> On Thu, May 18, 2006 at 10:31:03AM +0200, Martijn van Oosterhout wrote:
> > Besides, we still havn't got any reports yet that this actually
> > provides a benefit on any machine less than five years ago. Anyone out
> > there doing tests?
>
> Yes. I'm compiling the patched binaries right now, but the baseline
> testing I've got so far is at
> http://jim.nasby.net/misc/compress_sort.txt.

Looks a very good improvement. Well done Martijn/Jim.

The next question is: does it apply in all cases?

We need to test "SELECT aid from accounts" also, or some other scenarios
where the data is as uncompressible as possible. We should also try this
on a table where the rows have been inserted by different transactions,
so that the xmin value isn't the same for all tuples. We need to see if
there are cases where this causes a performance regression rather than
gain.

We still need to examine memory usage. Jim's testing so far is done on
already sorted data, so only uses 1 out of 715 tapes. If we did utilise
a much larger number of tapes, we could face difficulties with the
memory used during decompression.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com


From: Martijn van Oosterhout <kleptog(at)svana(dot)org>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, pgsql-patches(at)postgresql(dot)org
Subject: Re: [PATCH] Compression and on-disk sorting
Date: 2006-05-19 11:58:44
Message-ID: 20060519115844.GE17873@svana.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers pgsql-patches

On Fri, May 19, 2006 at 12:43:05PM +0100, Simon Riggs wrote:
> We need to test "SELECT aid from accounts" also, or some other scenarios
> where the data is as uncompressible as possible. We should also try this
> on a table where the rows have been inserted by different transactions,
> so that the xmin value isn't the same for all tuples. We need to see if
> there are cases where this causes a performance regression rather than
> gain.

AFAIK, the xmin is not included in the sort. The only thing is maybe
the ctid which is used in updates. Actually repeating the above tests
but doing:

select xmin,xmax,cmin,cmax,ctid,* from <blah>

Would be interesting. Even that would be compressable though.

Thinking about it, we're storing and compressing HeapTuples right.
There are a few fields there that would compress really well.

t_tableOid
t_len (if not vairable length fields)
t_natts

Even t_hoff and the bitmask if the patterns of NULLs don't vary much
between rows.

> We still need to examine memory usage. Jim's testing so far is done on
> already sorted data, so only uses 1 out of 715 tapes. If we did utilise
> a much larger number of tapes, we could face difficulties with the
> memory used during decompression.

I'm going to see if I can make some changes to track maximum memory
usage per tape.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.