Re: jsonb format is pessimal for toast compression

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Larry White <ljw1001(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, Bruce Momjian <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Subject: Re: jsonb format is pessimal for toast compression
Date: 2014-08-15 23:19:00
Message-ID: 26064.1408144740@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Arthur Silva <arthurprs(at)gmail(dot)com> writes:
> We should add some sort of versionning to the jsonb format. This can be
> explored in the future in many ways.

If we end up making an incompatible change to the jsonb format, I would
support taking the opportunity to stick a version ID in there. But
I don't want to force a dump/reload cycle *only* to do that.

> As for the current problem, we should explore the directory at the end
> option. It should improve compression and keep good access performance.

Meh. Pushing the directory to the end is just a band-aid, and since it
would still force a dump/reload, it's not a very enticing band-aid.
The only thing it'd really fix is the first_success_by issue, which
we could fix *without* a dump/reload by using different compression
parameters for jsonb. Moving the directory to the end, by itself,
does nothing to fix the problem that the directory contents aren't
compressible --- and we now have pretty clear evidence that that is a
significant issue. (See for instance Josh's results that increasing
first_success_by did very little for the size of his dataset.)

I think the realistic alternatives at this point are either to
switch to all-lengths as in my test patch, or to use the hybrid approach
of Heikki's test patch. IMO the major attraction of Heikki's patch
is that it'd be upward compatible with existing beta installations,
ie no initdb required (but thus, no opportunity to squeeze in a version
identifier either). It's not showing up terribly well in the performance
tests I've been doing --- it's about halfway between HEAD and my patch on
that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my
patch, there are things that could be done to micro-optimize it by
touching a bit more code.

I did some quick stats comparing compressed sizes for the delicio.us
data, printing quartiles as per Josh's lead:

all-lengths {440,569,609,655,1257}
Heikki's patch {456,582,624,671,1274}
HEAD {493,636,684,744,1485}

(As before, this is pg_column_size of the jsonb within a table whose rows
are wide enough to force tuptoaster.c to try to compress the jsonb;
otherwise many of these values wouldn't get compressed.) These documents
don't have enough keys to trigger the first_success_by issue, so that
HEAD doesn't look too awful, but still there's about an 11% gain from
switching from offsets to lengths. Heikki's method captures much of
that but not all.

Personally I'd prefer to go to the all-lengths approach, but a large
part of that comes from a subjective assessment that the hybrid approach
is too messy. Others might well disagree.

In case anyone else wants to do measurements on some more data sets,
attached is a copy of Heikki's patch updated to apply against git tip.

regards, tom lane

Attachment Content-Type Size
jsonb-with-offsets-and-lengths-2.patch text/x-diff 7.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joachim Wieland 2014-08-15 23:30:54 pg_dump refactor patch to remove global variables
Previous Message David Rowley 2014-08-15 22:38:39 Re: strncpy is not a safe version of strcpy