Re: additional json functionality

From: Maciej Gajewski <maciej(dot)gajewski0(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: additional json functionality
Date: 2013-11-21 10:21:24
Message-ID: CAEcSYXKuq2o6GM+2pasEsfhCz5y_nEUdPYegcncdcRyorx0w-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi everyone

I used to work on a project storing large quantities of schema-less data,
initially using MongoDB, then Postgres with JSON, and eventually I
implemented BSON support for Postgres to get the best of both worlds:
https://github.com/maciekgajewski/postgresbson

I don't think that JSONB is a good idea. There is a lot to learn from
MongoDB's mistakes in this area.

1. As noted in this thread previously, JSON is a serialization format, not
a document format.

2. Almost any structured data type, self-describing or not, can be
serialized to/from JSON, but always using only subset of it, and
interpreting it in it's own specific way.

3. JSON greatest strength is interoperability. It is a great feature of
Postgres that JSON is stored as a text; it's basically a 'text, but you can
do something with it'. There is many JSON implementations out there, and
one should make no assumption about application's expectations.
For instance: JSON standard (RFS-4627) defines all number to be doubles.
Yet I've seen application storing 64-bit integers (wouldn't fit in double
precision), or even arbitrary precision integers. Most parsers are OK with
that.

4. JSON greatest weakness is performance. Because of 1. it needs to be
parsed before any useful information is extracted.

5. 1. and 3. are mutually exclusive; this is one of the most valuable
takeaways I have from working with Mongo and BSON in particular. BSON is an
attempt to create 'binary JSON', and a failed one. It is a poor
serialization format: faster than JSON, but less flexible. Being binary, it
is strongly typed, and it uses various gimmicks to preserve flexibility:
implicit type casts, 3 different equality comparison functions etc. And
it's not fully convertible to/from JSON; no binary format is.
It is a poor document format as well: it retains some of the JSON's
performance problems: serial nature and ineffective storage.

6. Speed matters to some, and being able to generate binary data in
application and send it to database without any serialization/parsing in
between provides great optimization opportunity. One thing that Mongo guys
got right is the fixed, well-defined binary representation. Application can
use provided library to generate objects, and doesn't need to worry about
server's version or endianess.

In the application I've mention before, switching from JSON to BSON (in
Postgres 9.2, using postgresbson) increased throughput by an order of
magnitude. It was an insert-heavy database with indexes on object fields.
Both serializing in application and desalinizing in server was faster ~10x.

7. It seems to me that JSONB is going to repeat all the mistakes of BSON,
it's going to be 'neither'. If there is an agreement that Postgres needs a
'document' format, why not acknowledge 5., and simply adopt one of the
existing formats. Or even better: adopt none, provide many, provide binary
send/recv and conversion to and from JSON, let the user choose.

The world is full of self-describing binary formats: BSON, MessagePack (
http://msgpack.org/), protobuf, hierarchical H-Store is coming along.
Adding another one would create confusion, and a situation similar to this:
http://xkcd.com/927/

And a side note:

Postgres' greatest and most under-advertised feature is it's extensibility.
People tend to notice only the features present in the core package, while
there should be a huge banner on top of http://www.postgresql.org/: "Kids,
we support all data types: we have XML, we have JSON, we have H-store, we
have BSON, and all it with build-in indexing, storage compression and full
transaction support!"

Maciej G.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2013-11-21 10:52:37 b21de4e7b32f868a23bdc5507898d36cbe146164 seems to be two bricks shy of a load
Previous Message Atri Sharma 2013-11-21 10:04:52 Re: WITHIN GROUP patch