Re: jsonb and nested hstore

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Teodor Sigaev <teodor(at)sigaev(dot)ru>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: jsonb and nested hstore
Date: 2014-02-11 02:11:13
Message-ID: 20140211021113.GF15246@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Is it just me or is jsonapi.h not very well documented?

On 2014-02-06 18:47:31 -0500, Andrew Dunstan wrote:
> +/*
> + * for jsonb we always want the de-escaped value - that's what's in token
> + */
> +static void
> +jsonb_in_scalar(void *state, char *token, JsonTokenType tokentype)
> +{
> + JsonbInState *_state = (JsonbInState *) state;
> + JsonbValue v;
> +
> + v.size = sizeof(JEntry);
> +
> + switch (tokentype)
> + {
> +
> + case JSON_TOKEN_STRING:
> + v.type = jbvString;
> + v.string.len = token ? checkStringLen(strlen(token)) : 0;
> + v.string.val = token ? pnstrdup(token, v.string.len) : NULL;
> + v.size += v.string.len;
> + break;
> + case JSON_TOKEN_NUMBER:
> + v.type = jbvNumeric;
> + v.numeric = DatumGetNumeric(DirectFunctionCall3(numeric_in, CStringGetDatum(token), 0, -1));
> +
> + v.size += VARSIZE_ANY(v.numeric) +sizeof(JEntry) /* alignment */ ;
missing space.

Why does + sizeof(JEntry) change anything about alignment? If it was
aligned before, adding a statically sized value doesn't give any new
guarantees about alignment?

> +/*
> + * jsonb type recv function
> + *
> + * the type is sent as text in binary mode, so this is almost the same
> + * as the input function.
> + */
> +Datum
> +jsonb_recv(PG_FUNCTION_ARGS)
> +{
> + StringInfo buf = (StringInfo) PG_GETARG_POINTER(0);
> + text *result = cstring_to_text_with_len(buf->data, buf->len);
> +
> + return deserialize_json_text(result);
> +}

This is a bit absurd, we're receiving a string in a StringInfo buffer,
just to copy it into text, and then in makeJsonLexContext() access the
raw chars again.

> +static void
> +putEscapedValue(StringInfo out, JsonbValue *v)
> +{
> + switch (v->type)
> + {
> + case jbvNull:
> + appendBinaryStringInfo(out, "null", 4);
> + break;
> + case jbvString:
> + escape_json(out, pnstrdup(v->string.val, v->string.len));
> + break;
> + case jbvBool:
> + if (v->boolean)
> + appendBinaryStringInfo(out, "true", 4);
> + else
> + appendBinaryStringInfo(out, "false", 5);
> + break;
> + case jbvNumeric:
> + appendStringInfoString(out, DatumGetCString(DirectFunctionCall1(numeric_out, PointerGetDatum(v->numeric))));
> + break;
> + default:
> + elog(ERROR, "unknown jsonb scalar type");
> + }
> +}

Hm, will the jbvNumeric always result in correct correct quoting?
datum_to_json() does extra hangups for that case, any reason we don't
need that here?

> +char *
> +JsonbToCString(StringInfo out, char *in, int estimated_len)
> +{
...
> + while (redo_switch || ((type = JsonbIteratorGet(&it, &v, false)) != 0))
> + {
> + redo_switch = false;

Not sure if I see the advantage over the goto here. A comment explaining
what the reason for the goto is wouldhave sufficed.

> + case WJB_KEY:
> + if (first == false)
> + appendBinaryStringInfo(out, ", ", 2);
> + first = true;
> +
> + putEscapedValue(out, &v);
> + appendBinaryStringInfo(out, ": ", 2);

putEscapedValue doesn't gurantee only strings are output, but
datum_to_json does extra hangups for that case.

> + type = JsonbIteratorGet(&it, &v, false);
> + if (type == WJB_VALUE)
> + {
> + first = false;
> + putEscapedValue(out, &v);
> + }
> + else
> + {
> + Assert(type == WJB_BEGIN_OBJECT || type == WJB_BEGIN_ARRAY);
> + /*
> + * We need to rerun current switch() due to put
> + * in current place object which we just got
> + * from iterator.
> + */

"due to put"?

> +/*
> + * jsonb type send function
> + *
> + * Just send jsonb as a string of text
> + */
> +Datum
> +jsonb_send(PG_FUNCTION_ARGS)
> +{
> + Jsonb *jb = PG_GETARG_JSONB(0);
> + StringInfoData buf;
> + char *out;
> +
> + out = JsonbToCString(NULL, (JB_ISEMPTY(jb)) ? NULL : VARDATA(jb), VARSIZE(jb));
> +
> + pq_begintypsend(&buf);
> + pq_sendtext(&buf, out, strlen(out));
> + PG_RETURN_BYTEA_P(pq_endtypsend(&buf));
> +}

Why aren't you using using the stringbuf passing JsonbToCString
convention here to avoid the strlen()?

> +/*
> + * Compare two jbvString JsonbValue values, third argument
> + * 'arg', if it's not null, should be a pointer to bool
> + * value which will be set to true if strings are equal and
> + * untouched otherwise.
> + */
> +int
> +compareJsonbStringValue(const void *a, const void *b, void *arg)
> +{
> + const JsonbValue *va = a;
> + const JsonbValue *vb = b;
> + int res;
> +
> + Assert(va->type == jbvString);
> + Assert(vb->type == jbvString);
> +
> + if (va->string.len == vb->string.len)
> + {
> + res = memcmp(va->string.val, vb->string.val, va->string.len);
> + if (res == 0 && arg)
> + *(bool *) arg = true;

Should be NULL, not 0.

> +/*
> + * qsort helper to compare JsonbPair values, third argument
> + * arg will be trasferred as is to subsequent

*transferred.

> +/*
> + * some constant order of JsonbValue
> + */
> +int
> +compareJsonbValue(JsonbValue *a, JsonbValue *b)
> +{

Called recursively, needs to check for stack depth.

> +JsonbValue *
> +findUncompressedJsonbValueByValue(char *buffer, uint32 flags,
> + uint32 *lowbound, JsonbValue *key)
> +{

Functions like this *REALLY* need documentation for their
parameters. And of their actual purpose.

What's actually the uncompressed bit here? Isn't it actually the
contrary? This is navigating the compressed, non-tree form, no?

> + if (flags & JB_FLAG_ARRAY & header)
> + {
> + JEntry *array = (JEntry *) (buffer + sizeof(header));
> + char *data = (char *) (array + (header & JB_COUNT_MASK));
> + int i;

> + for (i = (lowbound) ? *lowbound : 0; i < (header & JB_COUNT_MASK); i++)
> + {
> + JEntry *e = array + i;

> + else if (JBE_ISSTRING(*e) && key->type == jbvString)
> + {
> + if (key->string.len == JBE_LEN(*e) &&
> + memcmp(key->string.val, data + JBE_OFF(*e),
> + key->string.len) == 0)
> + {

So, here we have our own undocumented! indexing system. Grand.

> + else if (flags & JB_FLAG_OBJECT & header)
> + {
> + JEntry *array = (JEntry *) (buffer + sizeof(header));
> + char *data = (char *) (array + (header & JB_COUNT_MASK) * 2);
> + uint32 stopLow = lowbound ? *lowbound : 0,
> + stopHigh = (header & JB_COUNT_MASK),
> + stopMiddle;

I don't understand what the point of the lowbound logic could be here?
If a key hasn't been found, it hasn't been found? Maybe the idea is to
use it when testing containedness or somesuch? Wouldn't iterating over
the keyspace be a better idea for that case?

> + if (key->type != jbvString)
> + return NULL;

That's not allowed, right?

> +/*
> + * Get i-th value of array or hash. if i < 0 then it counts from
> + * the end of array/hash. Note: returns pointer to statically
> + * allocated JsonbValue.
> + */
> +JsonbValue *
> +getJsonbValue(char *buffer, uint32 flags, int32 i)
> +{
> + uint32 header = *(uint32 *) buffer;
> + static JsonbValue r;

Really? And why on earth would static allocation be a good idea? Specify
it on the caller's stack if need be. Or even return by value, today's
calling convention will just allocate that on the caller's stack without
copying.
Accessing static data isn't even faster.

> + if (JBE_ISSTRING(*e))
> + {
> + r.type = jbvString;
> + r.string.val = data + JBE_OFF(*e);
> + r.string.len = JBE_LEN(*e);
> + r.size = sizeof(JEntry) + r.string.len;
> + }
> + else if (JBE_ISBOOL(*e))
> + {
> + r.type = jbvBool;
> + r.boolean = (JBE_ISBOOL_TRUE(*e)) ? true : false;
> + r.size = sizeof(JEntry);
> + }
> + else if (JBE_ISNUMERIC(*e))
> + {
> + r.type = jbvNumeric;
> + r.numeric = (Numeric) (data + INTALIGN(JBE_OFF(*e)));
> +
> + r.size = 2 * sizeof(JEntry) + VARSIZE_ANY(r.numeric);
> + }
> + else if (JBE_ISNULL(*e))
> + {
> + r.type = jbvNull;
> + r.size = sizeof(JEntry);
> + }
> + else
> + {
> + r.type = jbvBinary;
> + r.binary.data = data + INTALIGN(JBE_OFF(*e));
> + r.binary.len = JBE_LEN(*e) - (INTALIGN(JBE_OFF(*e)) - JBE_OFF(*e));
> + r.size = r.binary.len + 2 * sizeof(JEntry);
> + }

This bit of code exists pretty similarly in several places, maybe consolitate?

> +/****************************************************************************
> + * Walk on tree representation of jsonb *
> + ****************************************************************************/
> +static void
> +walkUncompressedJsonbDo(JsonbValue *v, walk_jsonb_cb cb, void *cb_arg, uint32 level)
> +{
> + int i;

check stack limit.

> +void
> +walkUncompressedJsonb(JsonbValue *v, walk_jsonb_cb cb, void *cb_arg)
> +{
> + if (v)
> + walkUncompressedJsonbDo(v, cb, cb_arg, 0);
> +}
> +
> +/****************************************************************************
> + * Iteration over binary jsonb *
> + ****************************************************************************/

This needs docs.

> +static void
> +parseBuffer(JsonbIterator *it, char *buffer)
> +{

Why invent completely independent naming conventions to the previous
functions here?

> +static bool
> +formAnswer(JsonbIterator **it, JsonbValue *v, JEntry * e, bool skipNested)
> +{

Imaginatively undescriptive name. But if it were slightly more more
abstracted away from JsonbIterator it could be the answer to my prayers
above about removing redundant code.

> +static JsonbIterator *
> +up(JsonbIterator *it)
> +{

Not a good name.

> +int
> +JsonbIteratorGet(JsonbIterator **it, JsonbValue *v, bool skipNested)
> +{
> + int res;

recursive, stack depth check.

> + switch ((*it)->type | (*it)->state)
> + {
> + case JB_FLAG_ARRAY | jbi_start:

I don't know, but I don't see the point in avoid if (), else if()
... constructs if it requires such dirty tricks.

> +/****************************************************************************
> + * Transformation from tree to binary representation of jsonb *
> + ****************************************************************************/
> +typedef struct CompressState
> +{
> + char *begin;
> + char *ptr;
> +
> + struct
> + {
> + uint32 i;
> + uint32 *header;
> + JEntry *array;
> + char *begin;
> + } *levelstate, *lptr, *pptr;
> +
> + uint32 maxlevel;
> +
> +} CompressState;
> +
> +#define curLevelState state->lptr
> +#define prevLevelState state->pptr

brrr.

I stopped looking at code at this point.

> diff --git a/src/backend/utils/adt/jsonfuncs.c b/src/backend/utils/adt/jsonfuncs.c
> index e1d8aae..50ddf50 100644
> --- a/src/backend/utils/adt/jsonfuncs.c
> +++ b/src/backend/utils/adt/jsonfuncs.c

there's lots of whitespace/tab damage in this file. Check git log/diff
--check or such.

This is still a mess, sorry:
* Large and important part continue to be undocumented. Especially in
jsonb_support.c
* Lots of naming inconsistencies.
* There's no documentation about what compressed/uncompressed jsonbs
are. The former is the ondisk representation, the latter the in-memory
tree representation.
* There's no non-code documentation about the on-disk format.

Unfortunately I can't see how this patch could get ready in time for
this CF. There's *lots* of work to be done. The code as is isn't going
to be maintainable. Much of it obvious by simply scanning through the
code, without even looking for higher level issues. And much of it has
previously been pointed out, without getting real attention.

That's not to speak of the nested hstore patch, which I didn't even
start to look at. That's twice this patches size.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2014-02-11 02:51:28 Re: newlines at end of generated SQL
Previous Message Merlin Moncure 2014-02-11 02:03:17 Re: jsonb and nested hstore