Re: jsonb format is pessimal for toast compression

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Arthur Silva <arthurprs(at)gmail(dot)com>
Cc: Larry White <ljw1001(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <kgrittn(at)ymail(dot)com>, Stephen Frost <sfrost(at)snowman(dot)net>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Peter Geoghegan <pg(at)heroku(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Subject: Re: jsonb format is pessimal for toast compression
Date: 2014-08-26 08:40:32
Message-ID: 53FC4800.2030907@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 08/16/2014 02:19 AM, Tom Lane wrote:
> I think the realistic alternatives at this point are either to
> switch to all-lengths as in my test patch, or to use the hybrid approach
> of Heikki's test patch. IMO the major attraction of Heikki's patch
> is that it'd be upward compatible with existing beta installations,
> ie no initdb required (but thus, no opportunity to squeeze in a version
> identifier either). It's not showing up terribly well in the performance
> tests I've been doing --- it's about halfway between HEAD and my patch on
> that extract-a-key-from-a-PLAIN-stored-column test. But, just as with my
> patch, there are things that could be done to micro-optimize it by
> touching a bit more code.
>
> I did some quick stats comparing compressed sizes for the delicio.us
> data, printing quartiles as per Josh's lead:
>
> all-lengths {440,569,609,655,1257}
> Heikki's patch {456,582,624,671,1274}
> HEAD {493,636,684,744,1485}
>
> (As before, this is pg_column_size of the jsonb within a table whose rows
> are wide enough to force tuptoaster.c to try to compress the jsonb;
> otherwise many of these values wouldn't get compressed.) These documents
> don't have enough keys to trigger the first_success_by issue, so that
> HEAD doesn't look too awful, but still there's about an 11% gain from
> switching from offsets to lengths. Heikki's method captures much of
> that but not all.
>
> Personally I'd prefer to go to the all-lengths approach, but a large
> part of that comes from a subjective assessment that the hybrid approach
> is too messy. Others might well disagree.

It's not too pretty, no. But it would be nice to not have to make a
tradeoff between lookup speed and compressibility.

Yet another idea is to store all lengths, but add an additional array of
offsets to JsonbContainer. The array would contain the offset of, say,
every 16th element. It would be very small compared to the lengths
array, but would greatly speed up random access on a large array/object.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fabien COELHO 2014-08-26 08:49:31 Re: postgresql latency & bgwriter not doing its job
Previous Message Rukh Meski 2014-08-26 08:35:59 Re: pgbench throttling latency limit