Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON

From: Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Bernd Helmle <mailings(at)oopsware(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-05 02:22:02
Message-ID: CAARyMpCk-j-CfErYKjhxpEjiUNmwKnWBQkv9mGDRbco3t-5j4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for reviewing my patch!

On Mon, Jul 4, 2011 at 7:10 AM, Bernd Helmle <mailings(at)oopsware(dot)de> wrote:
> +comment = 'data type for storing and manipulating JSON content'
>
> I'm not sure, if "manipulating" is a correct description. Maybe i missed it,
> but i didn't see functions to manipulate JSON strings directly, for example,
> adding an object to a JSON string, ...

Indeed. I intend to add manipulation functions in the future. The
words "and manipulating" shouldn't be there yet.

> -- Coding
> ...
> ... It is noticable that the parser seems to define
> its own is_space() and is_digit() functions, e.g.:
>
> +#define is_space(c) ((c) == '\t' || (c) == '\n' || (c) == '\r' || (c) == '
> ')
>
> Wouldn't it be more clean to use isspace() instead?

isspace() accepts '\f', '\v', and sometimes other characters as well,
depending on locale. Likewise, isdigit() accepts some non-ASCII
characters in addition to [0-9], depending on the locale and platform.
Thus, to avoid parsing a superset of the JSON spec, I can't use the
<ctype.h> functions. I should add a comment with this explanation.

> This code block in function json_escape_unicode() looks suspicious:
>
> +   /* Convert to UTF-8, if necessary. */
> +   {
> +       const char *orig = json;
> +       json = (const char *)
> +           pg_do_encoding_conversion((unsigned char *) json, length,
> +                                     GetDatabaseEncoding(), PG_UTF8);
> +       if (json != orig)
> +           length = strlen(json);
> +   }
>
> Seems it *always* wants to convert the string. Isn't there a "if" condition
> missing?

pg_do_encoding_conversion does this check already. Perhaps I should
rephrase the comment slightly:

+ /* Convert to UTF-8 (only if necessary). */

> There's a commented code fragment missing, which should be removed (see last
> two lines of the snippet below):
>
> +static unsigned int
> +read_hex16(const char *in)
> ...
> +               Assert(is_hex_digit(c));
> +
> +               if (c >= '0' && c <= '9')
> +                       tmp = c - '0';
> +               else if (c >= 'A' && c <= 'F')
> +                       tmp = c - 'A' + 10;
> +               else /* if (c >= 'a' && c <= 'f') */
> +                       tmp = c - 'a' + 10;

That comment is there for documentation purposes, to indicate the
range check that's skipped because we know c is a hex digit, and [a-f]
is the only thing it could be (and must be) at that line. Should it
still be removed?

> json.c introduces new appendStringInfo* functions, e.g.
> appendStringInfoEscape() and appendStringInfoUtf8(). Maybe it's better
> to name them to something different, since it may occur someday that the
> backend
> will introduce the same notion with a different meaning. They are static,
> but why not naming them into something like jsonAppendStringInfoUtf8() or
> similar?

I agree.

> -- Regression Tests
...
> It also tests UNICODE sequences and input encoding with other server
> encoding than UTF-8.

It tests with other *client* encodings than UTF-8, but not other
server encodings than UTF-8. Is it possible to write tests for
different server encodings?

-- Self-review

There's a minor bug in the normalization code:

> select $$ "\u0009" $$::json;
json
----------
"\u0009"
(1 row)

This should produce "\t" instead.

I'm thinking that my expect_*/next_*pop_char/... parsing framework is
overkill, and that, for a syntax as simple as JSON's, visibly messy
parsing code like this:

if (s < e && (*s == 'E' || *s == 'e'))
{
s++;
if (s < e && (*s == '+' || *s == '-'))
s++;
if (!(s < e && is_digit(*s)))
return false;
do
s++;
while (s < e && is_digit(*s));
}

would be easier to maintain than the deceptively elegant-looking code
currently in place:

if (optional_char_cond(s, e, *s == 'E' || *s == 'e'))
{
optional_char_cond(s, e, *s == '+' || *s == '-');
skip1_pred(s, e, is_digit);
}

I don't plan on gutting the current macro-driven parser just yet.
When I do, the new parser will support building a parse tree (only
when needed), and the parsing functions will have signatures like
this:

static bool parse_value(const char **sp, const char *e, JsonNode **out);
static bool parse_string(const char **sp, const char *e, StringInfo *out);
...

The current patch doesn't provide manipulation features where a parse
tree would come in handy. I'd rather hold off on rewriting the parser
until such features are added.

I'll try to submit a revised patch within the next couple days.

Thanks!

- Joey

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2011-07-05 02:59:43 Re: Latch implementation that wakes on postmaster death on both win32 and Unix
Previous Message Fujii Masao 2011-07-05 02:14:48 Re: keepalives_* parameters usefullness