Re: Table and Index compression

From: Pierre Frédéric Caillaud <lists(at)peufeu(dot)com>
To: "Greg Stark" <gsstark(at)mit(dot)edu>, "Josh Berkus" <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Table and Index compression
Date: 2009-08-07 08:36:39
Message-ID: op.ux997d1qcke6l8@soyouz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


First, a few things that I forgot to mention in the previous message :

> I like the idea too, but I think there are some major problems to
> solve. In particular I think we need a better solution to blocks
> growing than sparse files.

Sparse files allow something great : to test this concept in real life,
with a length of code that is shorter than this email.

> The main problem with using sparse files is that currently postgres is
> careful to allocate blocks early so it can fail if there's not enough
> space. With your sparse file solution Postgres might only find out
> there's no space after it has already committed a transaction.
> bgwriter has no good course of action to take if it finds out there's
> nowhere to put the data it has in shared buffers.

Yes, that is a big problem.

Note that the same problem can happen if you use a filesystem that
supports compression :

- you preallocate a file full of empty blocks
- filesystem compresses it well, says it fits
- you fill it with data that compresses much worse than zeros
- at some point one of the write() calls fails with a disk full error

This would be an interesting argument in the "can we use compressed ZFS on
postgres" debate...

Also, about compressed NTFS : it can give you disk-full errors on read().
While this may appear stupid, it is in fact very good.
When you read() something compressed, NTFS will reserve the space on disk
to re-write this data, assuming the worst compression, so your app fails
early (on read, not on write). If there is not enough space on the volume,
read() will fail with a disk full error. Many applications do not expect
this.

On the other hand, filling a disk completely is a never a good idea, since
it generally triggers fragmentation so extreme that the only way to
recover is to go buy another harddisk and copy your data.

As a side note, I have also tested lzjb (the ZFS compressor) and lzo is
much faster, and compresses much better (sometimes 2x better).

Back to the point of how to handle disk full errors :
- we could write a file the size of shared_buffers at startup
- if a write() reports disk full, delete the file above
- we now have enough space to flush all of shared_buffers
- flush and exit gracefully

This will waste disk space. Of course... but the purpose of compression is
not to save disk space, it is to have a faster database.
Also, if you use compression,
- you'll do it because your database is much bigger than shared_buffers,
so the wasted space is not huge.
- you'll use a smaller shared_buffers than usual, because shared_buffers
contains uncompressed pages, and you'll want to use lots of OS disk cache
to cache compressed data.

Doesn't seem too crazy to me.

> But I think even if you solve that it's not really a good long-term
> solution. We don't know how the OS handles block allocation for this
> type of file. I'm actually moderately surprised it isn't skipping
> enough blocks assuming you'll allocate them eventually. Even if it
> does handle it the way you expect what happens when you do grow a
> block, it'll have to allocate it way out of the way and we have no way
> to repair that discontinuity later.

I made a quick check before implementing it, using python scripts to play
with sparse files on ext3 :

- writing a sparse file is a bit (not much) slower than a regular file,
- reading from a non-fragmented sparse file is as fast as reading a
regular file
- holes do not eat OS disk cache (which is the most interesting point)
- reading from cache is as fast as reading a regular file (and faster if
you don't read the holes because you know they are holes, which is the
case here)

And, also, growing a sparse file by plugging the holes in it WILL allocate
blocks all over the place and render IO extremely inefficient.
You can defrag it, of course (using fs-specific tools or just cpio), but
that's not "high-availability"...

> Also, the way you've prellocated blocks effectively nails the maximum
> compression at 2x. That seems to be leaving a lot of money on the
> table.

Yes, it does, but there are a few subtleties.

First, about preallocation : I had to implement this, because of the way
the files are extended.
Since first an empty block is written, it will compress well, to 1 OS
page. When, later, this block is written again with data, it will compress
to a larger amount of pages, which will trigger allocation all over the
place. In testing, the resulting sparse file is slow in the of the "forget
it" kind...

The file could also be non-sparse : in this case, no disk space is saved.

Not saving disk space is OK with me : disks are really cheap these days.

I know this sounds crazy ;) but as I said above, compression is meant to
save expensive resources. When you use H-264 to compress 1TB of raw video
to 1GB of compressed video, you save an expensive resource (0.999 TB of
harddisk). But if you use lzo to compress 40GB of database down to 20GB,
you save maybe $3 in disk space, which is ludicrous. However, you also
save much more expensive resources : disk throughput and RAM OS cache. So
it might really pay off.

So, a non-sparse file would work well for random accesses too, since we
only read the parts of it that contain compressed data, and the "wasted"
pages would never be touched.

However it would have drawbacks :

- seq scan would not be able to skip the "empty" pages, so it would have
no benefit whatsoever
- the empty pages, read-ahead by the OS, would pollute the disk cache
- the empty pages would also pollute the controller's cache and the disks'
internal caches

So, a balance has to be reached on how much OS pages (4K on Linux) to
preallocate for a Postgres block (16K or 32K to get good compression).

If you preallocate too little, adding data (INSERT, UPDATEs) will heavily
fragment the file.
If you preallocate too much, seq scan will be slower and perhaps cache
will be wasted (but it will still be much faster than the huge seek-fest
which happens if you preallocate too little).

How many OS pages to preallocate for a postgres block should be specified
by the user.

- If you intend to modify the table contents, you should preallocate a bit
more.
- If the table will be mostly read-only, a tight fit is okay.

A nice thing is that it makes read-only tables a bit redundant. Basically,
if you ask postgres
"CLUSTER or VAC FULL this table, and compress it, not preallocating
anything more than necessary"
You will get a table and indexes that are well compressed, but updates,
deletes, will be slow, so it is a read-mostly table.

If you use partitions, the current partition can be uncompressed (or
compressed with a generous pre-allocation), but the past partitions, which
are usually read-only, can be compressed tight, and their indexes too.

Then you could change the pre-allocation value for this table to
preallocate more, and INSERT into it, no problem, except for the indexes,
for which you'd need to specify beforehand to preallocate a bit more pages
to allow for growing index pages...

So, how many pages to preallocate is a tuning parameter, that should have
a reasonable default, but be user-settable, per-relation (table, index...)

Note that if a page contains lots of dead rows which are similar except a
few updated columns, compression will exploit this redundancy.

Here are some stats on some tables of more than 300K rows :

* Table with lots of columns including a TEXT column of a few hundred
bytes, containing french text, which is usually too short for TOASTING :
Size : 314.69 MB
Paged Size : 133.24 MB
Unpaged efficiency : 2.77
Paged Efficiency : 2.36
pages | count
1 | 16
2 | 848
3 | 4451
4 | 4730
5 | 25

The above is a histogram : for instance 4730 32K postgresql blocks
compressed to 4 4K-pages (a ratio of 2X), and only 16 postgres blocks
compressed to 4K.

"Size" is uncompressed size
"Paged Size" is the sum of all 4K which contain compressed data
"Unpaged Efficiency" is (raw data) / (compressed data), showing compressor
performance.
"Paged Efficiency" is computed from the real number of pages used, so it
is the interesting metric.

In this case, efficiency is 2.36 because some pages compress better than
2X, but if we preallocate 16K for each 32K page, real efficiency will be
2X.
We can see, also, that 25 blocks would need some extra allocation. This is
a really small percentage of the table, so not significant to speed.

* GIN fulltext index on text field of above table

Size : 136.75 MB
Paged Size : 48.04 MB
Unpaged efficiency : 3.63
Paged Efficiency : 2.85
pages | count
1 | 1528
2 | 470
3 | 944
4 | 172
5 | 1262

* GIST fulltext index on text field of above table

annonces_ts_gist
Size : 63.84 MB
Paged Size : 33.89 MB
Unpaged efficiency : 2.14
Paged Efficiency : 1.88
pages | count
1 | 3
2 | 32
3 | 165
4 | 1103
5 | 739
6 | 1

* table with 32 columns including a few TEXT fields which contain at most
a few chars

Size : 393.50 MB
Paged Size : 127.24 MB
Unpaged efficiency : 3.83
Paged Efficiency : 3.09
pages | count
2 | 5207
3 | 7381
4 | 4

* btree on DATE column of above table

Size : 56.34 MB
Unpaged efficiency : 2.63
Paged Efficiency : 2.08
pages | count
1 | 2
2 | 1
3 | 282
4 | 1518

* btree on field "zipcode" of above table

Size : 56.34 MB
Paged Size : 27.04 MB
Unpaged efficiency : 2.90
Paged Efficiency : 2.67
pages | count
1 | 2
2 | 2
3 | 1794
4 | 5

* gist index on GPS coordinates (every row has (lat,lon)) of above table

Size : 196.47 MB
Paged Size : 39.79 MB
Unpaged efficiency : 6.93
Paged Efficiency : 4.94
pages | count
1 | 2486
2 | 3704
3 | 96
4 | 1

Note that the 32 column table is 400MB... and this gist index is almost
200 MB.
This table has 770 MB of indexes on it...
Compression shrinks the table to 127 MB and indexes to a total of about
300 MB.

* table of (increasing timestamp, 1 SERIAL, 1 increasing INT, 3 INT
columns which contain random() data)

Size : 344.06 MB
Paged Size : 215.02 MB
Unpaged efficiency : 1.97
Paged Efficiency : 1.60
pages | count
1 | 1
5 | 11009

* multicolumn btree on (int, timestamp) above

Size : 153.19 MB
Paged Size : 57.42 MB
Unpaged efficiency : 2.73
Paged Efficiency : 2.67
pages | count
1 | 3
3 | 4899

* btree on int column above

Size : 102.06 MB
Paged Size : 63.73 MB
Unpaged efficiency : 1.87
Paged Efficiency : 1.60
pages | count
1 | 2
2 | 1
4 | 3
5 | 3260

As you can see, efficiency varies widely between average (1.6), good
(2-2.5), and amazing (almost 5) depending on what is compressed.

> To handle read-write tables I think we would need to directly
> implement the kind of indirection layer that you're getting out of the
> filesystem's block layer currently. That would let you allocate enough
> blocks to hold the data uncompressed and then free up those blocks
> once you're sure the data is compressible.

Just like NTFS does, then ?

> One possibility is to handle only read-only tables. That would make
> things a *lot* simpler. But it sure would be inconvenient if it's only
> useful on large static tables but requires you to rewrite the whole
> table -- just what you don't want to do with large static tables -- to
> get the benefit.

See above : read-mostly table that you can still update (but it will be
very slow) or insert into (if you change the preallocation setting).

I forgot to talk about SSDs in the previous message. SSDs are quite
expensive, but seek really fast. Saving disk space would be good on a SSD,
because the price per byte is much higher. And since SSDs seek very fast,
they probably wouldn't be bothered that much by sparse blocks being
allocated all over the place. Also the price of SSDs means you wouldn't
make a monster RAID out of them, so the CPU could decompress faster than
the disk in a seqscan. This would need a test.

Ah, I would like to know a thing : if I add a 2-byte field to the postgres
page header, which contains the compressed length, will I get any problems
? (ie. do some pages use a non-standard header ?)

I need to modify a few things and add comments, then will submit a patch
(not for committing, but for hacking).

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pierre Frédéric Caillaud 2009-08-07 08:36:51 Re: Table and Index compression
Previous Message Greg Stark 2009-08-07 08:23:06 Re: [Pg-migrator-general] Composite types break pg_migrated tables