Re: Compression of tables

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Thomas Munro <munro(at)ip9(dot)org>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Compression of tables
Date: 2013-12-10 15:15:51
Message-ID: CAHyXU0wpqftA+s8ctVVtzJLoJyTeNKMRK7q6cBqQd5VQqpcuCA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 10, 2013 at 1:18 AM, Thomas Munro <munro(at)ip9(dot)org> wrote:
> Hi
>
> I have been wondering what the minimum useful heap table compression
> system would be for Postgres, in order to reduce disk footprint of
> large mostly static datasets. Do you think an approach similar to the
> static row-level compression of DB2 could make sense? I imagine
> something like this:
>
> 1. You have a table which already has data in it.
>
> 2. You run a COMPRESS operation, which builds a static dictionary,
> and rewrites the whole table with compressed frozen tuples. Frozen
> tuples have CTIDs just like regular tuples, and can be pointed to by
> indexes. They are decompressed on the fly when needed.
>
> Clearly things get tricky once you need to update rows. Assume for
> simplicity that future UPDATEs and INSERTs produce normal,
> non-compressed tuples that would only be compressed by a subsequent
> COMPRESS operation. The question is how to deal with the existing
> compressed rows when UPDATEd or DELETEd. Some approaches:
>
> 1. Just don't allow updates of compressed rows (!).
>
> 2. Exclusively lock the whole page when trying to update any
> compressed row, while you explode it into regular uncompressed tuples
> on new pages which you can work on (!).
>
> 3. Pull the minimum header fields out of the compressed tuples so
> that the MVCC machinery can work, to support updates of compressed
> tuples. Perhaps just the t_xmax, t_ctid values (t_xmin == frozen is
> implied), so that a writer can update them. This means an overhead of
> at least 10 bytes per tuple over the compressed size (plus the item
> offsets in the page header).
>
> 4. Something far cleverer.
>
> Well, these are straw-man suggestions really and I probably don't
> understand enough about PG internals (MVCC and implications for
> VACUUM) to be making them. But I'm curious to know if anyone has
> researched something like this.
>
> For example, I have a system that occupies a couple of TB on disk, but
> only a few to a few hundred MB per day change, mostly adding data to
> an active partition. I periodically run CLUSTER on any partition that
> has pg_stat.correlation < 0.9 (this effectively just re-CLUSTERs the
> active one). At the same times I would COMPRESS, and the DB could
> potentially fit on much smaller SSDs.
>
> Most commercial database systems I encounter these days are using
> compression of some sort (more sophisticated than the above,
> with dynamic dictionaries, and sometimes column based storage etc).

postgres compresses TOASTED data: one strategy could be to arrange
your data somehow to utilize TOAST.

I doubt you'll ever see generally heap compressed data in the way
you're thinking: postgres has a strong informal policy of not
implementing features which are dubious and or excessively complicated
with limited benefit, particularly if there are ways to handle this
outside the database; there are various operating system level tricks
that can cause a compressed file or even an entire tablespace (o/s
folder) masquerade as a regular structures. So maybe you are asking
for a feature we already have: CREATE TABLESPACE.

For example take a look here:
https://btrfs.wiki.kernel.org/index.php/Compression#How_do_I_enable_compression.3F

(out of curiosity, if this strategy fits the bill for you I wouldn't
mind seeing a follow up on how this handles your static data use
case).

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2013-12-10 15:17:59 Re: JSON decoding plugin
Previous Message Amit Kapila 2013-12-10 14:55:54 Re: Extra functionality to createuser