Re: jsonb format is pessimal for toast compression

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)heroku(dot)com>
Cc: Claudio Freire <klaussfreire(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Jan Wieck" <jan(at)wi3ck(dot)info>
Subject: Re: jsonb format is pessimal for toast compression
Date: 2014-09-16 18:12:49
Message-ID: 54187DA1.7060106@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 09/16/2014 07:47 PM, Josh Berkus wrote:
> On 09/16/2014 06:31 AM, Robert Haas wrote:
>> On Mon, Sep 15, 2014 at 7:44 PM, Peter Geoghegan <pg(at)heroku(dot)com> wrote:
>>> On Mon, Sep 15, 2014 at 4:05 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>> Actually, having the keys all at the same level *is* relevant for the
>>>> issue we're discussing. If those 270 keys are organized in a tree, it's
>>>> not the same as having them all on one level (and not as problematic).
>>>
>>> I believe Robert meant that the 270 keys are not at the top level, but
>>> are at some level (in other words, some object has 270 pairs). That is
>>> equivalent to having them at the top level for the purposes of this
>>> discussion.
>>
>> Yes, that's exactly what I meant.
>>
>>> FWIW, I am slightly concerned about weighing use cases around very
>>> large JSON documents too heavily. Having enormous jsonb documents just
>>> isn't going to work out that well, but neither will equivalent designs
>>> in popular document database systems for similar reasons. For example,
>>> the maximum BSON document size supported by MongoDB is 16 megabytes,
>>> and that seems to be something that their users don't care too much
>>> about. Having 270 pairs in an object isn't unreasonable, but it isn't
>>> going to be all that common either.
>
> Well, I can only judge from the use cases I personally have, none of
> which involve more than 100 keys at any level for most rows. So far
> I've seen some people argue hypotetical use cases involving hundreds of
> keys per level, but nobody who *actually* has such a use case. Also,
> note that we currently don't know where the "last value" extraction
> becomes a performance problem at this stage, except that it's somewhere
> between 200 and 100,000. Also, we don't have a test which shows the
> hybrid approach (Heikki's patch) performing better with 1000's of keys.
>
> Basically, if someone is going to make a serious case for Heikki's
> hybrid approach over the simpler lengths-only approach, then please post
> some test data showing the benefit ASAP, since I can't demonstrate it.
> Otherwise, let's get beta 3 out the door so we can get the 9.4 release
> train moving again.

Are you looking for someone with a real life scenario, or just synthetic
test case? The latter is easy to do.

See attached test program. It's basically the same I posted earlier.
Here are the results from my laptop with Tom's jsonb-lengths-merged.patch:

postgres=# select * from testtimes ;
elem | duration_ms
------+-------------
11 | 0.289508
12 | 0.288122
13 | 0.290558
14 | 0.287889
15 | 0.286303
17 | 0.290415
19 | 0.289829
21 | 0.289783
23 | 0.287104
25 | 0.289834
28 | 0.290735
31 | 0.291844
34 | 0.293454
37 | 0.293866
41 | 0.291217
45 | 0.289243
50 | 0.290385
55 | 0.292085
61 | 0.290892
67 | 0.292335
74 | 0.292561
81 | 0.291416
89 | 0.295714
98 | 0.29844
108 | 0.297421
119 | 0.299471
131 | 0.299877
144 | 0.301604
158 | 0.303365
174 | 0.304203
191 | 0.303596
210 | 0.306526
231 | 0.304189
254 | 0.307782
279 | 0.307372
307 | 0.306873
338 | 0.310471
372 | 0.3151
409 | 0.320354
450 | 0.32038
495 | 0.322127
545 | 0.323256
600 | 0.330419
660 | 0.334226
726 | 0.336951
799 | 0.34108
879 | 0.347746
967 | 0.354275
1064 | 0.356696
1170 | 0.366906
1287 | 0.375352
1416 | 0.392952
1558 | 0.392907
1714 | 0.402157
1885 | 0.412384
2074 | 0.425958
2281 | 0.435415
2509 | 0.45301
2760 | 0.469983
3036 | 0.487329
3340 | 0.505505
3674 | 0.530412
4041 | 0.552585
4445 | 0.581815
4890 | 0.610509
5379 | 0.642885
5917 | 0.680395
6509 | 0.713849
7160 | 0.757561
7876 | 0.805225
8664 | 0.856142
9530 | 0.913255
(72 rows)

That's up to 9530 elements - it's pretty easy to extrapolate from there
to higher counts, it's O(n).

With unpatched git master, the runtime is flat, regardless of which
element is queried, at about 0.29 s. With
jsonb-with-offsets-and-lengths-2.patch, there's no difference that I
could measure.

The difference starts to be meaningful at around 500 entries. In
practice, I doubt anyone's going to notice until you start talking about
tens of thousands of entries.

I'll leave it up to the jury to decide if we care or not. It seems like
a fairly unusual use case, where you push around large enough arrays or
objects to notice. Then again, I'm sure *someone* will do it. People do
strange things, and they find ways to abuse the features that the
original developers didn't think of.

- Heikki

Attachment Content-Type Size
jsonb-lengths.sql application/sql 875 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tapan Halani 2014-09-16 18:16:51 Need guidance to startup
Previous Message Brightwell, Adam 2014-09-16 18:09:40 Re: replicating DROP commands across servers