Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON

Lists: pgsql-hackers
From: Bernd Helmle <mailings(at)oopsware(dot)de>
To: Bernd Helmle <mailings(at)oopsware(dot)de>, Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-04 11:10:32
Message-ID: A6AFD432F5F6AAB84C0E7D1B@apophis.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

--On 18. Juni 2011 12:29:38 +0200 Bernd Helmle <mailings(at)oopsware(dot)de> wrote:

>> Similar problems occur with a couple other modules I tried (hstore,
>> intarray).
>
> Hmm, works for me. Seems you have messed up your installation in some way
> (build against current -HEAD but running against a 9.1?).
>
> I'm going to review in the next couple of days again.

A bit later than expected, but here is my summary on the patch:

-- Patch Status

The patch applies cleanly. Since it's primarily targeted as an addition to
contrib/, it doesn't touch the backend source tree (besides documentation TOC
entries), but adds a new subdirectory in contrib/json. The patch is in context
diff as requested.

-- Functionality

The patch as is provides a basic implementation for a JSON datatype. It
includes normalization and validation of JSON strings. It adds the datatype
implementation as an extension.

The implementation focus on the datatype functionality only, there is no
additional support for special operators. Thus, i think the comments in the
control file (and docs) promises actually more than the contrib module will
deliver:

+comment = 'data type for storing and manipulating JSON content'

I'm not sure, if "manipulating" is a correct description. Maybe i missed it,
but i didn't see functions to manipulate JSON strings directly, for example,
adding an object to a JSON string, ...

-- Coding

The JSON datatype is implemented as a variable length character type, thus
allows very large JSON strings to be stored. It transparently decides when to
escape unicode code points depending on the selected client- and
server-encoding.

Validation is done through its own JSON parser. The parser itself is a
recursive descent parser. It is noticable that the parser seems to define its
own is_space() and is_digit() functions, e.g.:

+#define is_space(c) ((c) == '\t' || (c) == '\n' || (c) == '\r' || (c) == ' ')

Wouldn't it be more clean to use isspace() instead?

This code block in function json_escape_unicode() looks suspicious:

+ /* Convert to UTF-8, if necessary. */
+ {
+ const char *orig = json;
+ json = (const char *)
+ pg_do_encoding_conversion((unsigned char *) json, length,
+ GetDatabaseEncoding(), PG_UTF8);
+ if (json != orig)
+ length = strlen(json);
+ }

Seems it *always* wants to convert the string. Isn't there a "if" condition
missing?

There's a commented code fragment missing, which should be removed (see last
two lines of the snippet below):

+static unsigned int
+read_hex16(const char *in)
+{
+ unsigned int i;
+ unsigned int tmp;
+ unsigned int ret = 0;
+
+ for (i = 0; i < 4; i++)
+ {
+ char c = *in++;
+
+ Assert(is_hex_digit(c));
+
+ if (c >= '0' && c <= '9')
+ tmp = c - '0';
+ else if (c >= 'A' && c <= 'F')
+ tmp = c - 'A' + 10;
+ else /* if (c >= 'a' && c <= 'f') */
+ tmp = c - 'a' + 10;

json.c introduces new appendStringInfo* functions, e.g.
appendStringInfoEscape() and appendStringInfoUtf8(). Maybe it's better
to name them to something different, since it may occur someday that the backend
will introduce the same notion with a different meaning. They are static,
but why not naming them into something like jsonAppendStringInfoUtf8() or
similar?

-- Regression Tests

The patch includes a fairly complete regression test suite, which covers much
of the formatting, validating and input functionality of the JSON datatype. It
also tests UNICODE sequences and input encoding with other server encoding than
UTF-8.

--
Thanks

Bernd


From: Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Bernd Helmle <mailings(at)oopsware(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-05 02:22:02
Message-ID: CAARyMpCk-j-CfErYKjhxpEjiUNmwKnWBQkv9mGDRbco3t-5j4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Thanks for reviewing my patch!

On Mon, Jul 4, 2011 at 7:10 AM, Bernd Helmle <mailings(at)oopsware(dot)de> wrote:
> +comment = 'data type for storing and manipulating JSON content'
>
> I'm not sure, if "manipulating" is a correct description. Maybe i missed it,
> but i didn't see functions to manipulate JSON strings directly, for example,
> adding an object to a JSON string, ...

Indeed. I intend to add manipulation functions in the future. The
words "and manipulating" shouldn't be there yet.

> -- Coding
> ...
> ... It is noticable that the parser seems to define
> its own is_space() and is_digit() functions, e.g.:
>
> +#define is_space(c) ((c) == '\t' || (c) == '\n' || (c) == '\r' || (c) == '
> ')
>
> Wouldn't it be more clean to use isspace() instead?

isspace() accepts '\f', '\v', and sometimes other characters as well,
depending on locale. Likewise, isdigit() accepts some non-ASCII
characters in addition to [0-9], depending on the locale and platform.
Thus, to avoid parsing a superset of the JSON spec, I can't use the
<ctype.h> functions. I should add a comment with this explanation.

> This code block in function json_escape_unicode() looks suspicious:
>
> +   /* Convert to UTF-8, if necessary. */
> +   {
> +       const char *orig = json;
> +       json = (const char *)
> +           pg_do_encoding_conversion((unsigned char *) json, length,
> +                                     GetDatabaseEncoding(), PG_UTF8);
> +       if (json != orig)
> +           length = strlen(json);
> +   }
>
> Seems it *always* wants to convert the string. Isn't there a "if" condition
> missing?

pg_do_encoding_conversion does this check already. Perhaps I should
rephrase the comment slightly:

+ /* Convert to UTF-8 (only if necessary). */

> There's a commented code fragment missing, which should be removed (see last
> two lines of the snippet below):
>
> +static unsigned int
> +read_hex16(const char *in)
> ...
> +               Assert(is_hex_digit(c));
> +
> +               if (c >= '0' && c <= '9')
> +                       tmp = c - '0';
> +               else if (c >= 'A' && c <= 'F')
> +                       tmp = c - 'A' + 10;
> +               else /* if (c >= 'a' && c <= 'f') */
> +                       tmp = c - 'a' + 10;

That comment is there for documentation purposes, to indicate the
range check that's skipped because we know c is a hex digit, and [a-f]
is the only thing it could be (and must be) at that line. Should it
still be removed?

> json.c introduces new appendStringInfo* functions, e.g.
> appendStringInfoEscape() and appendStringInfoUtf8(). Maybe it's better
> to name them to something different, since it may occur someday that the
> backend
> will introduce the same notion with a different meaning. They are static,
> but why not naming them into something like jsonAppendStringInfoUtf8() or
> similar?

I agree.

> -- Regression Tests
...
> It also tests UNICODE sequences and input encoding with other server
> encoding than UTF-8.

It tests with other *client* encodings than UTF-8, but not other
server encodings than UTF-8. Is it possible to write tests for
different server encodings?

-- Self-review

There's a minor bug in the normalization code:

> select $$ "\u0009" $$::json;
json
----------
"\u0009"
(1 row)

This should produce "\t" instead.

I'm thinking that my expect_*/next_*pop_char/... parsing framework is
overkill, and that, for a syntax as simple as JSON's, visibly messy
parsing code like this:

if (s < e && (*s == 'E' || *s == 'e'))
{
s++;
if (s < e && (*s == '+' || *s == '-'))
s++;
if (!(s < e && is_digit(*s)))
return false;
do
s++;
while (s < e && is_digit(*s));
}

would be easier to maintain than the deceptively elegant-looking code
currently in place:

if (optional_char_cond(s, e, *s == 'E' || *s == 'e'))
{
optional_char_cond(s, e, *s == '+' || *s == '-');
skip1_pred(s, e, is_digit);
}

I don't plan on gutting the current macro-driven parser just yet.
When I do, the new parser will support building a parse tree (only
when needed), and the parsing functions will have signatures like
this:

static bool parse_value(const char **sp, const char *e, JsonNode **out);
static bool parse_string(const char **sp, const char *e, StringInfo *out);
...

The current patch doesn't provide manipulation features where a parse
tree would come in handy. I'd rather hold off on rewriting the parser
until such features are added.

I'll try to submit a revised patch within the next couple days.

Thanks!

- Joey


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-10 18:09:04
Message-ID: 4E19EAC0.7080701@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 7/4/11 7:22 PM, Joseph Adams wrote:
> I'll try to submit a revised patch within the next couple days.

So? New patch?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Bernd Helmle <mailings(at)oopsware(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-15 19:56:56
Message-ID: CAARyMpC2kHSrybY3SDx3G=cBzA41zCn5gApZ-Z0vN3EG7Re-9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
<joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> I'll try to submit a revised patch within the next couple days.

Sorry this is later than I said.

I addressed the issues covered in the review. I also fixed a bug
where "\u0022" would become """, which is invalid JSON, causing an
assertion failure.

However, I want to put this back into WIP for a number of reasons:

* The current code accepts invalid surrogate pairs (e.g.
"\uD800\uD800"). The problem with accepting them is that it would be
inconsistent with PostgreSQL's Unicode support, and with the Unicode
standard itself. In my opinion: as long as the server encoding is
universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
(barring data corruption and resource limitations).

* I'd like to go ahead with the parser rewrite I mentioned earlier.
The new parser will be able to construct a parse tree when needed, and
it won't use those overkill parsing macros.

* I recently learned that not all supported server encodings can be
converted to Unicode losslessly. The current code, on output,
converts non-ASCII characters to Unicode escapes under some
circumstances (see the comment above json_need_to_escape_unicode).

I'm having a really hard time figuring out how the JSON module should
handle non-Unicode character sets. \uXXXX escapes in JSON literals
can be used to encode characters not available in the server encoding.
On the other hand, the server encoding can encode characters not
present in Unicode (see the third bullet point above). This means
JSON normalization and comparison (along with member lookup) are not
possible in general.

Even if I assume server -> UTF-8 -> server transcoding is lossless,
the situation is still ugly. Normalization could be implemented a few
ways:

* Convert from server encoding to UTF-8, and convert \uXXXX escapes
to UTF-8 characters. This is space-efficient, but the resulting text
would not be compatible with the server encoding (which may or may not
matter).
* Convert from server encoding to UTF-8, and convert all non-ASCII
characters to \uXXXX escapes, resulting in pure ASCII. This bloats
the text by a factor of three, in the worst case.
* Convert \uXXXX escapes to characters in the server encoding, but
only where possible. This would be extremely inefficient.

The parse tree (for functions that need it) will need to store JSON
member names and strings either in UTF-8 or in normalized JSON (which
could be the same thing).

Help and advice would be appreciated. Thanks!

- Joey

Attachment Content-Type Size
json-contrib-rev1-20110714.patch.gz application/x-gzip 17.4 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Bernd Helmle <mailings(at)oopsware(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-18 19:00:43
Message-ID: CA+TgmoZgKCRLTMd+vmAJh3Dgfayz4=GG19yue8U9nc-ckxK_Tw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> On Mon, Jul 4, 2011 at 10:22 PM, Joseph Adams
> <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> I'll try to submit a revised patch within the next couple days.
>
> Sorry this is later than I said.
>
> I addressed the issues covered in the review.  I also fixed a bug
> where "\u0022" would become """, which is invalid JSON, causing an
> assertion failure.
>
> However, I want to put this back into WIP for a number of reasons:
>
>  * The current code accepts invalid surrogate pairs (e.g.
> "\uD800\uD800").  The problem with accepting them is that it would be
> inconsistent with PostgreSQL's Unicode support, and with the Unicode
> standard itself.  In my opinion: as long as the server encoding is
> universal (i.e. UTF-8), decoding a JSON-encoded string should not fail
> (barring data corruption and resource limitations).
>
>  * I'd like to go ahead with the parser rewrite I mentioned earlier.
> The new parser will be able to construct a parse tree when needed, and
> it won't use those overkill parsing macros.
>
>  * I recently learned that not all supported server encodings can be
> converted to Unicode losslessly.  The current code, on output,
> converts non-ASCII characters to Unicode escapes under some
> circumstances (see the comment above json_need_to_escape_unicode).
>
> I'm having a really hard time figuring out how the JSON module should
> handle non-Unicode character sets.  \uXXXX escapes in JSON literals
> can be used to encode characters not available in the server encoding.
>  On the other hand, the server encoding can encode characters not
> present in Unicode (see the third bullet point above).  This means
> JSON normalization and comparison (along with member lookup) are not
> possible in general.

I previously suggested that, instead of trying to implement JSON, you
should just try to implement
JSON-without-the-restriction-that-everything-must-be-UTF8. Most
people are going to be using UTF-8 simply because it's the default,
and if you forget about transcoding then ISTM that this all becomes a
lot simpler. We don't, in general, have the ability to support data
in multiple encodings inside PostgreSQL, and it seems to me that by
trying to invent a mechanism for making that work as part of this
patch, you are setting the bar for yourself awfully high.

One thing to think about here is that transcoding between UTF-8 and
the server encoding seems like the wrong thing all around. After all,
the user does not want the data in the server encoding; they want it
in their chosen client encoding. If you are transcoding between UTF-8
and the server encoding, then that suggests that there's some
double-transcoding going on here, which creates additional
opportunities for (1) inefficiency and (2) outright failure. I'm
guessing that's because you're dealing with an interface that expects
the internal representation of the datum on one side and the server
encoding on the other side, which gets back to the point in the
preceding paragraph. You'd probably need to revise that interface in
order to make this really work the way it should, and that might be
more than you want to get into. At any rate, it probably is a
separate project from making JSON work.

If in spite of the above you're bent on continuing down your present
course, then it seems to me that you'd better make the on-disk
representation UTF-8, with all \uXXXX escapes converted to the
corresponding characters. If you hit an invalid surrogate pair, or a
character that exists in the server encoding but not UTF-8, it's not a
legal JSON object and you throw an error on input, just as you would
for mismatched braces or similar. On output, you should probably just
use \uXXXX to represent any unrepresentable characters - i.e. option 3
from your original list. That may be slow, but I think that it's not
really worth devoting a lot of mental energy to this case. Most
people are going to be using UTF-8 because that's the default, and
those who are not shouldn't expect a data format built around UTF-8 to
work perfectly in their environment, especially if they insist on
using characters that are representable in only some of the encodings
they are using.

But, again, why not just forget about transcoding and define it as
"JSON, if you happen to be using utf-8 as the server encoding, and
otherwise some variant of JSON that uses the server encoding as its
native format?". It seems to me that that would be a heck of a lot
simpler and more reliable, and I'm not sure it's any less useful in
practice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-18 19:19:11
Message-ID: 4091.1311016751@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> I'm having a really hard time figuring out how the JSON module should
>> handle non-Unicode character sets.

> But, again, why not just forget about transcoding and define it as
> "JSON, if you happen to be using utf-8 as the server encoding, and
> otherwise some variant of JSON that uses the server encoding as its
> native format?". It seems to me that that would be a heck of a lot
> simpler and more reliable, and I'm not sure it's any less useful in
> practice.

Right offhand, the only argument I can see against this is that we might
someday want to pass JSON datums directly to third-party loadable
libraries that are picky about UTF8-ness. But we could just document
and/or enforce that such libraries can only be used in UTF8-encoded
databases.

BTW, could the \uNNNN problem be finessed by leaving such escapes in
source form?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-18 19:38:16
Message-ID: CA+TgmoZa_i3t6MVQBxhrwfVU-mR_BxDRHRQa=2GDL6sdzLQ7EQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Fri, Jul 15, 2011 at 3:56 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>>> I'm having a really hard time figuring out how the JSON module should
>>> handle non-Unicode character sets.
>
>> But, again, why not just forget about transcoding and define it as
>> "JSON, if you happen to be using utf-8 as the server encoding, and
>> otherwise some variant of JSON that uses the server encoding as its
>> native format?".  It seems to me that that would be a heck of a lot
>> simpler and more reliable, and I'm not sure it's any less useful in
>> practice.
>
> Right offhand, the only argument I can see against this is that we might
> someday want to pass JSON datums directly to third-party loadable
> libraries that are picky about UTF8-ness.  But we could just document
> and/or enforce that such libraries can only be used in UTF8-encoded
> databases.

Right. Or, if someone someday implements a feature to allow multiple
server encodings within the same database, then we can tell people to
use it if they want JSON to work in a spec-canonical way.

> BTW, could the \uNNNN problem be finessed by leaving such escapes in
> source form?

Joey seems to want to canonicalize the input, which argues against
that, and to detect invalid surrogate pairs, which likewise argues
against that. You could argue that's overkill, but it seems to have
at least some merit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-18 22:17:10
Message-ID: CAARyMpApR22rU4T3FBmieAuyfidcNeC+n8PGSOjerVa9k8St1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 18, 2011 at 3:19 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> BTW, could the \uNNNN problem be finessed by leaving such escapes in
> source form?

Yes, it could. However, it doesn't solve the problem of comparison
(needed for member lookup), which requires canonicalizing the strings
to be compared.

Here's a Unicode handling policy I came up with. It is guaranteed to
be lossless as long as the client and database encodings are the same.

---

On input (json_in), if the text is valid JSON, it is condensed:

* Whitespace characters surrounding tokens are removed.
* Long escapes (like \u0022) are converted to short escapes (like \")
where possible.
* Unnecessary escapes of ASCII characters (e.g. \u0061 and even
\u007F) are converted to their respective characters.
* Escapes of non-ASCII characters (e.g. \u0080, \u266B, \uD834\uDD1E)
are converted to their respective characters, but only if the database
encoding is UTF-8.

On output (json_out), non-ASCII characters are converted to \uXXXX
escapes, unless one or more of these very likely circumstances hold:

* The client encoding and database encoding are the same. No
conversion is performed, so escaping characters will not prevent any
conversion errors.
* The client encoding is UTF-8. Escaping is not necessary because
the client can encode all Unicode codepoints.
* The client encoding and/or database encoding is SQL_ASCII.
SQL_ASCII tells PostgreSQL to shirk transcoding in favor of speed.

When a JSON-encoded string is unwrapped using from_json (e.g.
from_json($$ "\u00A1Hola!" $$)), escapes will be converted to the
characters they represent. If any escapes cannot be represented in
the database encoding, an error will be raised. Note that:

* If the database encoding is UTF-8, conversion will never fail.
* If the database encoding is SQL_ASCII, conversion will fail if any
escapes of non-ASCII characters are present.

---

However, I'm having a really hard time figuring out how comparison
would work in this framework. Here are a few options:

1. Convert the strings to UTF-8, convert the escapes to characters,
and compare the strings.
2. Convert the escapes to the database encoding, then compare the strings.
3. If either string contains escapes of non-ASCII characters, do 1.
Otherwise, do 2.

Number 1 seems the most sane to me, but could lead to rare errors.

Number 3 could produce confusing results. If character set X has
three different representations of one Unicode codepoint, then we
could have scenarios like this (simplified):

"abc♫" != "aaa♫"

but:

"abc\u266b" == "aaa♫"

I suppose a simple solution would be to convert all escapes and
outright ban escapes of characters not in the database encoding. This
would have the nice property that all strings can be unescaped
server-side. Problem is, what if a browser or other program produces,
say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database
where the encoding lacks this character?

On the other hand, converting all JSON to UTF-8 would be simpler to
implement. It would probably be more intuitive, too, given that the
JSON RFC says, "JSON text SHALL be encoded in Unicode."

In any case, the documentation should say "Use UTF-8 for best
results", as there seems to be no entirely satisfactory way to handle
JSON in other database encodings.

- Joey


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-18 23:36:31
Message-ID: 6D8C16B0-E98C-48B5-899A-6566C5E9A0AD@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul19, 2011, at 00:17 , Joey Adams wrote:
> I suppose a simple solution would be to convert all escapes and
> outright ban escapes of characters not in the database encoding.

+1. Making JSON work like TEXT when it comes to encoding issues
makes this all much simpler conceptually. It also avoids all kinds
of weird issues if you extract textual values from a JSON document
server-side.

If we really need more flexibility than that, we should look at
ways to allow different columns to have different encodings. Doing
that just for JSON seems wrongs - especially because doesn't really
reduce the complexity of the problem, as your examples shows. The
essential problem here is, AFAICS, that there's really no sane way to
compare strings in two different encodings, unless both encode a
subset of unicode only.

> This would have the nice property that all strings can be unescaped
> server-side. Problem is, what if a browser or other program produces,
> say, \u00A0 (NO-BREAK SPACE), and tries to insert it into a database
> where the encoding lacks this character?

They'll get an error - just as if they had tried to store that same
character in a TEXT column.

> On the other hand, converting all JSON to UTF-8 would be simpler to
> implement. It would probably be more intuitive, too, given that the
> JSON RFC says, "JSON text SHALL be encoded in Unicode."

Yet, they only I reason I'm aware of for some people to not use UTF-8
as the server encoding is that it's pretty inefficient storage-wise for
some scripts (AFAIR some japanese scripts are an example, but I don't
remember the details). By making JSON store UTF-8 on-disk always, the
JSON type gets less appealing to those people.

best regards,
Florian Pflug


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 01:03:15
Message-ID: CAARyMpDmhmtfrPwtNJBu5+i56H7DNhBsJN1gq8RGgtaBivvqOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Jul19, 2011, at 00:17 , Joey Adams wrote:
>> I suppose a simple solution would be to convert all escapes and
>> outright ban escapes of characters not in the database encoding.
>
> +1. Making JSON work like TEXT when it comes to encoding issues
> makes this all much simpler conceptually. It also avoids all kinds
> of weird issues if you extract textual values from a JSON document
> server-side.

Thanks for the input. I'm leaning in this direction too. However, it
will be a tad tricky to implement the conversions efficiently, since
the wchar API doesn't provide a fast path for individual codepoint
conversion (that I'm aware of), and pg_do_encoding_conversion doesn't
look like a good thing to call lots of times.

My plan is to scan for escapes of non-ASCII characters, convert them
to UTF-8, and put them in a comma-delimited string like this:

a,b,c,d,

then, convert the resulting string to the server encoding (which may
fail, indicating that some codepoint(s) are not present in the
database encoding). After that, read the string and plop the
characters where they go.

It's "clever", but I can't think of a better way to do it with the existing API.

- Joey


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 02:01:09
Message-ID: 1311126863-sup-7396@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Joey Adams's message of mar jul 19 21:03:15 -0400 2011:
> On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> > On Jul19, 2011, at 00:17 , Joey Adams wrote:
> >> I suppose a simple solution would be to convert all escapes and
> >> outright ban escapes of characters not in the database encoding.
> >
> > +1. Making JSON work like TEXT when it comes to encoding issues
> > makes this all much simpler conceptually. It also avoids all kinds
> > of weird issues if you extract textual values from a JSON document
> > server-side.
>
> Thanks for the input. I'm leaning in this direction too. However, it
> will be a tad tricky to implement the conversions efficiently, since
> the wchar API doesn't provide a fast path for individual codepoint
> conversion (that I'm aware of), and pg_do_encoding_conversion doesn't
> look like a good thing to call lots of times.
>
> My plan is to scan for escapes of non-ASCII characters, convert them
> to UTF-8, and put them in a comma-delimited string like this:
>
> a,b,c,d,
>
> then, convert the resulting string to the server encoding (which may
> fail, indicating that some codepoint(s) are not present in the
> database encoding). After that, read the string and plop the
> characters where they go.

Ugh.

> It's "clever", but I can't think of a better way to do it with the existing API.

Would it work to have a separate entry point into mbutils.c that lets
you cache the conversion proc caller-side? I think the main problem is
determining the byte length of each source character beforehand.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Fwd: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 03:42:52
Message-ID: CAARyMpA20PBNoS6Vv5yx8-7icfNqRNx=Rxfwj4LiYCk99ykLBg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Forwarding because the mailing list rejected the original message.

---------- Forwarded message ----------
From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Date: Tue, Jul 19, 2011 at 11:23 PM
Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS]
Another swing at JSON
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert
Haas <robertmhaas(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>,
Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter
<david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers
<pgsql-hackers(at)postgresql(dot)org>

On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Would it work to have a separate entry point into mbutils.c that lets
> you cache the conversion proc caller-side?

That sounds like a really good idea.  There's still the overhead of
calling the proc, but I imagine it's a lot less than looking it up.

> I think the main problem is
> determining the byte length of each source character beforehand.

I'm not sure what you mean.  The idea is to convert the \uXXXX escape
to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8
sequence is easy to compute), call the conversion proc to get the
null-terminated database-encoded character, then append the result to
whatever StringInfo the string is going into.

The only question mark is how big the destination buffer will need to
be.  The maximum number of bytes per char in any supported encoding is
4, but is it possible for one Unicode character to turn into multiple
"character"s in the database encoding?

While we're at it, should we provide the same capability to the SQL
parser?  Namely, the ability to use \uXXXX escapes above U+007F when
the server encoding is not UTF-8?

- Joey


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 03:45:22
Message-ID: 201107200345.p6K3jMK04594@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joey Adams wrote:
> Forwarding because the mailing list rejected the original message.

Yes, I am seeing email failures to the 'core' email list.

---------------------------------------------------------------------------

>
> ---------- Forwarded message ----------
> From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
> Date: Tue, Jul 19, 2011 at 11:23 PM
> Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS]
> Another swing at JSON
> To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
> Cc: Florian Pflug <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert
> Haas <robertmhaas(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>,
> Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter
> <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers
> <pgsql-hackers(at)postgresql(dot)org>
>
>
> On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera
> <alvherre(at)commandprompt(dot)com> wrote:
> > Would it work to have a separate entry point into mbutils.c that lets
> > you cache the conversion proc caller-side?
>
> That sounds like a really good idea. ?There's still the overhead of
> calling the proc, but I imagine it's a lot less than looking it up.
>
> > I think the main problem is
> > determining the byte length of each source character beforehand.
>
> I'm not sure what you mean. ?The idea is to convert the \uXXXX escape
> to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8
> sequence is easy to compute), call the conversion proc to get the
> null-terminated database-encoded character, then append the result to
> whatever StringInfo the string is going into.
>
> The only question mark is how big the destination buffer will need to
> be. ?The maximum number of bytes per char in any supported encoding is
> 4, but is it possible for one Unicode character to turn into multiple
> "character"s in the database encoding?
>
> While we're at it, should we provide the same capability to the SQL
> parser? ?Namely, the ability to use \uXXXX escapes above U+007F when
> the server encoding is not UTF-8?
>
> - Joey
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fwd: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 03:48:26
Message-ID: 201107200348.p6K3mQf05026@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Bruce Momjian wrote:
> Joey Adams wrote:
> > Forwarding because the mailing list rejected the original message.
>
> Yes, I am seeing email failures to the 'core' email list.

Marc says it is now fixed.

---------------------------------------------------------------------------

>
> >
> > ---------- Forwarded message ----------
> > From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
> > Date: Tue, Jul 19, 2011 at 11:23 PM
> > Subject: Re: Initial Review: JSON contrib modul was: Re: [HACKERS]
> > Another swing at JSON
> > To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
> > Cc: Florian Pflug <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert
> > Haas <robertmhaas(at)gmail(dot)com>, Bernd Helmle <mailings(at)oopsware(dot)de>,
> > Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter
> > <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Pg Hackers
> > <pgsql-hackers(at)postgresql(dot)org>
> >
> >
> > On Tue, Jul 19, 2011 at 10:01 PM, Alvaro Herrera
> > <alvherre(at)commandprompt(dot)com> wrote:
> > > Would it work to have a separate entry point into mbutils.c that lets
> > > you cache the conversion proc caller-side?
> >
> > That sounds like a really good idea. ?There's still the overhead of
> > calling the proc, but I imagine it's a lot less than looking it up.
> >
> > > I think the main problem is
> > > determining the byte length of each source character beforehand.
> >
> > I'm not sure what you mean. ?The idea is to convert the \uXXXX escape
> > to UTF-8 with unicode_to_utf8 (the length of the resulting UTF-8
> > sequence is easy to compute), call the conversion proc to get the
> > null-terminated database-encoded character, then append the result to
> > whatever StringInfo the string is going into.
> >
> > The only question mark is how big the destination buffer will need to
> > be. ?The maximum number of bytes per char in any supported encoding is
> > 4, but is it possible for one Unicode character to turn into multiple
> > "character"s in the database encoding?
> >
> > While we're at it, should we provide the same capability to the SQL
> > parser? ?Namely, the ability to use \uXXXX escapes above U+007F when
> > the server encoding is not UTF-8?
> >
> > - Joey
> >
> > --
> > Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-hackers
>
> --
> Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
> EnterpriseDB http://enterprisedb.com
>
> + It's impossible for everything to be true. +
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 04:32:11
Message-ID: CA+TgmoaF93djL07kQZO+hxHx+_q5A2PkUjo2_5e8wYoNKFKbXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Jul 19, 2011 at 9:03 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> On Mon, Jul 18, 2011 at 7:36 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Jul19, 2011, at 00:17 , Joey Adams wrote:
>>> I suppose a simple solution would be to convert all escapes and
>>> outright ban escapes of characters not in the database encoding.
>>
>> +1. Making JSON work like TEXT when it comes to encoding issues
>> makes this all much simpler conceptually. It also avoids all kinds
>> of weird issues if you extract textual values from a JSON document
>> server-side.
>
> Thanks for the input.  I'm leaning in this direction too.  However, it
> will be a tad tricky to implement the conversions efficiently, ...

I'm a bit confused, because I thought what I was talking about was not
doing any conversions in the first place.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 04:40:29
Message-ID: CAARyMpDKzqj1dq3mfb3uBoMV4m3suNvxGrXYzEA+FmwoHwxM3w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Thanks for the input.  I'm leaning in this direction too.  However, it
>> will be a tad tricky to implement the conversions efficiently, ...
>
> I'm a bit confused, because I thought what I was talking about was not
> doing any conversions in the first place.

We want to be able to handle \uXXXX escapes when the database encoding
is not UTF-8. We could leave them in place, but sooner or later
they'll need to be converted in order to unwrap or compare JSON
strings.

The approach being discussed is converting escapes to the database
encoding. This means escapes of characters not available in the
database encoding (e.g. \u266B in ISO-8859-1) will be forbidden.

The PostgreSQL parser (which also supports Unicode escapes) takes a
simpler approach: don't allow non-ASCII escapes unless the server
encoding is UTF-8.

- Joey


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 10:49:09
Message-ID: B91ADDE3-4B8E-4E5A-BE14-D2FE39D9EF52@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul20, 2011, at 06:40 , Joey Adams wrote:
> On Wed, Jul 20, 2011 at 12:32 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Thanks for the input. I'm leaning in this direction too. However, it
>>> will be a tad tricky to implement the conversions efficiently, ...
>>
>> I'm a bit confused, because I thought what I was talking about was not
>> doing any conversions in the first place.
>
> We want to be able to handle \uXXXX escapes when the database encoding
> is not UTF-8. We could leave them in place, but sooner or later
> they'll need to be converted in order to unwrap or compare JSON
> strings.

Hm, I agree that we need to handle \uXXXX escapes in JSON input.
We won't ever produce them during output though, right?

> The approach being discussed is converting escapes to the database
> encoding. This means escapes of characters not available in the
> database encoding (e.g. \u266B in ISO-8859-1) will be forbidden.
>
> The PostgreSQL parser (which also supports Unicode escapes) takes a
> simpler approach: don't allow non-ASCII escapes unless the server
> encoding is UTF-8.

Maybe JSON should simply reject them too, then. At least for now.

How does that XML type handle the situation? It seems that it'd have
the same problem with unicode entity references "&#XXXX;".

best regards,
Florian Pflug


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-20 17:01:09
Message-ID: CAARyMpAcC7O99Nk4BGHQqMgLVi7vNTGD9Ff4tHbPVQA2MrHpqQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Jul 20, 2011 at 6:49 AM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> Hm, I agree that we need to handle \uXXXX escapes in JSON input.
> We won't ever produce them during output though, right?

We could, to prevent transcoding errors if the client encoding is
different than the server encoding (and neither is SQL_ASCII, nor is
the client encoding UTF8). For example, if the database encoding is
UTF-8 and the client encoding is WIN1252, I'd think it would be a good
idea to escape non-ASCII characters.

> How does that XML type handle the situation? It seems that it'd have
> the same problem with unicode entity references "&#XXXX;".

From the looks of it, XML operations convert the text to UTF-8 before
passing it to libxml. The XML type does not normalize the input;
SELECT '&#9835;♫'::xml; simply yields &#9835;♫. Escapes of any
character are allowed in any encoding, from the looks of it.

- Joey


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-22 22:04:49
Message-ID: CAARyMpDb2ZQZ8xZ1uwZHa_rf+FP+cFKK-xiUs1ELsFoE4Wea2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I think I've decided to only allow escapes of non-ASCII characters
when the database encoding is UTF8. For example, $$"\u2013"$$::json
will fail if the database encoding is WIN1252, even though WIN1252 can
encode U+2013 (EN DASH). This may be somewhat draconian, given that:

* SQL_ASCII can otherwise handle "any" language according to the documentation.

* The XML type doesn't have this restriction (it just stores the
input text verbatim, and converts it to UTF-8 before doing anything
complicated with it).

However, it's simple to implement and understand. The JSON data type
will not perform any automatic conversion between character encodings.
Also, if we want to handle this any better in the future, we won't
have to support legacy data containing a mixture of encodings.

In the future, we could create functions to compensate for the issues
people encounter; for example:

* json_escape_unicode(json [, replace bool]) returns text -- convert
non-ASCII characters to escapes. Optionally, use \uFFFD for
unconvertible characters.
* json_unescape_unicode(text [, replace text]) returns json -- like
json_in, but convert Unicode escapes to characters when possible.
Optionally, replace unconvertible characters with a given string.

I've been going back and forth on how to handle encodings in the JSON
type for a while, but suggestions and objections are still welcome.
However, I plan to proceed in this direction so progress can be made.

On another matter, should the JSON type guard against duplicate member
keys? The JSON RFC says "The names within an object SHOULD be
unique," meaning JSON with duplicate members can be considered valid.
JavaScript interpreters (the ones I tried), PHP, and Python all have
the same behavior: discard the first member in favor of the second.
That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an
error if a duplicate attribute is present (e.g. '<a href="b"
href="c"/>'::xml).

Thanks for the input,
- Joey


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-22 23:12:48
Message-ID: CA+TgmoaUmjLMSzyMy9_p1QjY7iOS2mPBg-Cn=rULu7Xxk4c=OQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> On another matter, should the JSON type guard against duplicate member
> keys?  The JSON RFC says "The names within an object SHOULD be
> unique," meaning JSON with duplicate members can be considered valid.
> JavaScript interpreters (the ones I tried), PHP, and Python all have
> the same behavior: discard the first member in favor of the second.
> That is, {"key":1,"key":2} becomes {"key":2}.  The XML type throws an
> error if a duplicate attribute is present (e.g. '<a href="b"
> href="c"/>'::xml).

Hmm. That's tricky. I lean mildly toward throwing an error as being
more consistent with the general PG philosophy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-22 23:16:10
Message-ID: 4E2A04BA.7030408@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 23/07/11 01:12, Robert Haas wrote:
> On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> On another matter, should the JSON type guard against duplicate member
>> keys? The JSON RFC says "The names within an object SHOULD be
>> unique," meaning JSON with duplicate members can be considered valid.
>> JavaScript interpreters (the ones I tried), PHP, and Python all have
>> the same behavior: discard the first member in favor of the second.
>> That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an
>> error if a duplicate attribute is present (e.g. '<a href="b"
>> href="c"/>'::xml).
>
> Hmm. That's tricky. I lean mildly toward throwing an error as being
> more consistent with the general PG philosophy.

OTOH:

regression=# select 'key=>1,key=>2'::hstore;
hstore
------------
"key"=>"1"
(1 row)

Cheers,
Jan


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-22 23:19:17
Message-ID: CA+TgmoZjHL1POpt78fsQYXNDcNrktGxkDChNWEX27x9EXGzc5Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 22, 2011 at 7:16 PM, Jan Urbański <wulczer(at)wulczer(dot)org> wrote:
> On 23/07/11 01:12, Robert Haas wrote:
>> On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>>> On another matter, should the JSON type guard against duplicate member
>>> keys?  The JSON RFC says "The names within an object SHOULD be
>>> unique," meaning JSON with duplicate members can be considered valid.
>>> JavaScript interpreters (the ones I tried), PHP, and Python all have
>>> the same behavior: discard the first member in favor of the second.
>>> That is, {"key":1,"key":2} becomes {"key":2}.  The XML type throws an
>>> error if a duplicate attribute is present (e.g. '<a href="b"
>>> href="c"/>'::xml).
>>
>> Hmm.  That's tricky.  I lean mildly toward throwing an error as being
>> more consistent with the general PG philosophy.
>
> OTOH:
>
> regression=# select 'key=>1,key=>2'::hstore;
>   hstore
> ------------
>  "key"=>"1"
> (1 row)

Fair point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-23 00:09:44
Message-ID: A52E8B6C-44F9-472C-B3B9-B189050D3163@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul23, 2011, at 01:12 , Robert Haas wrote:
> On Fri, Jul 22, 2011 at 6:04 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> On another matter, should the JSON type guard against duplicate member
>> keys? The JSON RFC says "The names within an object SHOULD be
>> unique," meaning JSON with duplicate members can be considered valid.
>> JavaScript interpreters (the ones I tried), PHP, and Python all have
>> the same behavior: discard the first member in favor of the second.
>> That is, {"key":1,"key":2} becomes {"key":2}. The XML type throws an
>> error if a duplicate attribute is present (e.g. '<a href="b"
>> href="c"/>'::xml).
>
> Hmm. That's tricky. I lean mildly toward throwing an error as being
> more consistent with the general PG philosophy.

I'm usually all for throwing an error on ambiguous input - but if Javascript,
PHP and Python all agree, it might be wise to just yield to them.

best regards,
Florian Pflug


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-23 00:26:25
Message-ID: 83AB8CA7-1822-41FE-BF56-D3D6A88BB88A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul23, 2011, at 00:04 , Joey Adams wrote:
> I think I've decided to only allow escapes of non-ASCII characters
> when the database encoding is UTF8. For example, $$"\u2013"$$::json
> will fail if the database encoding is WIN1252, even though WIN1252 can
> encode U+2013 (EN DASH). This may be somewhat draconian, given that:
>
> * SQL_ASCII can otherwise handle "any" language according to the documentation.

+1. It makes the handling if \uXXXX sequences consistent with the behaviour
of CHR(), which seems like a Good Thing. Clients can also work around this
restriction be de-escaping themselves, which shouldn't be too hard.

> * The XML type doesn't have this restriction (it just stores the
> input text verbatim, and converts it to UTF-8 before doing anything
> complicated with it).

Yeah. But the price the XML type pays for that is the lack of an
equality operator.

best regards,
Florian Pflug


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-23 02:36:44
Message-ID: CAARyMpBe8OonqjcmNeEhJLZ6Kf-Ljy_mbEdHtw4K4b=qXtxZ9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 22, 2011 at 7:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Hmm.  That's tricky.  I lean mildly toward throwing an error as being
> more consistent with the general PG philosophy.

I agree. Besides, throwing an error on duplicate keys seems like the
most logical thing to do. The most compelling reason not to, I think,
is that it would make the input function a little slower.

On Fri, Jul 22, 2011 at 8:26 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> * The XML type doesn't have this restriction (it just stores the
>> input text verbatim, and converts it to UTF-8 before doing anything
>> complicated with it).
>
> Yeah. But the price the XML type pays for that is the lack of an
> equality operator.

Interesting. This leads to a couple more questions:

* Should the JSON data type (eventually) have an equality operator?
* Should the JSON input function alphabetize object members by key?

If we canonicalize strings and numbers and alphabetize object members,
then our equality function is just texteq. The only stumbling block
is canonicalizing numbers. Fortunately, JSON's definition of a
"number" is its decimal syntax, so the algorithm is child's play:

* Figure out the digits and exponent.
* If the exponent is greater than 20 or less than 6 (arbitrary), use
exponential notation.

The problem is: 2.718282e-1000 won't equal 0 as may be expected. I
doubt this matters much.

It would be nice to canonicalize JSON on input, and that's the way I'd
like to go, but two caveats are:

* Input (and other operations) would require more CPU time. Instead
of being able to pass the data through a quick condense function, it'd
have to construct an AST (to sort object members) and re-encode the
JSON back into a string.
* Users, for aesthetic reasons, might not want their JSON members rearranged.

If, in the future, we add the ability to manipulate large JSON trees
efficiently (e.g. by using an auxiliary table like TOAST does), we'll
probably want unique members, so enforcing them now may be prudent.

- Joey


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-24 03:09:45
Message-ID: CAARyMpDVTkCH2m7mhjXwrrwpEY1EVX=80FT-+NKp8Yvf_Zzayg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Also, should I forbid the escape \u0000 (in all database encodings)?

Pros:

* If \u0000 is forbidden, and the server encoding is UTF-8, then
every JSON-wrapped string will be convertible to TEXT.

* It will be consistent with the way PostgreSQL already handles text,
and with the decision to use database-encoded JSON strings.

* Some applications choke on strings with null characters. For
example, in some web browsers but not others, if you pass
"Hello\u0000world" to document.write() or assign it to a DOM object's
innerHTML, it will be truncated to "Hello". By banning \u0000, users
can catch such rogue strings early.

* It's a little easier to represent internally.

Cons:

* Means JSON type will accept a subset of the JSON described in
RFC4627. However, the RFC does say "An implementation may set limits
on the length and character contents of strings", so we can arguably
get away with banning \u0000 while being law-abiding citizens.

* Being able to store U+0000–U+00FF means users can use JSON strings
to hold binary data: by treating it as Latin-1.

- Joey


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Florian Pflug <fgp(at)phlo(dot)org>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-24 03:14:50
Message-ID: CA+TgmoYxnRg5etC6CuPf7X_Sw6pSFeJVMjUEhSm0=4fFAvxc9A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> Interesting.  This leads to a couple more questions:
>
>  * Should the JSON data type (eventually) have an equality operator?

+1.

>  * Should the JSON input function alphabetize object members by key?

I think it would probably be better if it didn't. I'm wary of
overcanonicalization. It can be useful to have things come back out
in more or less the format you put them in.

> If we canonicalize strings and numbers and alphabetize object members,
> then our equality function is just texteq.  The only stumbling block
> is canonicalizing numbers.  Fortunately, JSON's definition of a
> "number" is its decimal syntax, so the algorithm is child's play:
>
>  * Figure out the digits and exponent.
>  * If the exponent is greater than 20 or less than 6 (arbitrary), use
> exponential notation.
>
> The problem is: 2.718282e-1000 won't equal 0 as may be expected.  I
> doubt this matters much.

I don't think that 2.718282e-100 SHOULD equal 0.

> If, in the future, we add the ability to manipulate large JSON trees
> efficiently (e.g. by using an auxiliary table like TOAST does), we'll
> probably want unique members, so enforcing them now may be prudent.

I doubt you're going to want to reinvent TOAST, but I do think there
are many advantages to forbidding duplicate keys. ISTM the question
is whether to throw an error or just silently discard one of the k/v
pairs. Keeping both should not be on the table, IMHO.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-24 18:19:58
Message-ID: 2554A880-9AAB-4822-B920-9C28C614FE97@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul24, 2011, at 05:14 , Robert Haas wrote:
> On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> Interesting. This leads to a couple more questions:
>>
>> * Should the JSON data type (eventually) have an equality operator?
>
> +1.

+1.

>> * Should the JSON input function alphabetize object members by key?
>
> I think it would probably be better if it didn't. I'm wary of
> overcanonicalization. It can be useful to have things come back out
> in more or less the format you put them in.

The downside being that we'd then either need to canonicalize in
the equality operator, or live with either no equality operator or
a rather strange one.

Also, if we don't canonicalize now, we (or rather our users) are in
for some pain should we ever decide to store JSON values in some other
form than plain text. Because if we do that, we'd presumably want
to order the members in some predefined way (by their hash value,
for example).

So, from me "+1" for alphabetizing members.

>> If we canonicalize strings and numbers and alphabetize object members,
>> then our equality function is just texteq. The only stumbling block
>> is canonicalizing numbers. Fortunately, JSON's definition of a
>> "number" is its decimal syntax, so the algorithm is child's play:
>>
>> * Figure out the digits and exponent.
>> * If the exponent is greater than 20 or less than 6 (arbitrary), use
>> exponential notation.
>>
>> The problem is: 2.718282e-1000 won't equal 0 as may be expected. I
>> doubt this matters much.
>
> I don't think that 2.718282e-100 SHOULD equal 0.

I agree. As for your proposed algorithm, I suggest to instead use
exponential notation if it produces a shorter textual representation.
In other words, for values between -1 and 1, we'd switch to exponential
notation if there's more than 1 leading zero (to the right of the decimal
point, of course), and for values outside that range if there're more than
2 trailing zeros and no decimal point. All after redundant zeros and
decimal points are removed. So we'd store

0 as 0
1 as 1
0.1 as 0.1
0.01 as 0.01
0.001 as 1e-3
10 as 10
100 as 100
1000 as 1e3
1000.1 as 1000.1
1001 as 1001

>> If, in the future, we add the ability to manipulate large JSON trees
>> efficiently (e.g. by using an auxiliary table like TOAST does), we'll
>> probably want unique members, so enforcing them now may be prudent.
>
> I doubt you're going to want to reinvent TOAST, but I do think there
> are many advantages to forbidding duplicate keys. ISTM the question
> is whether to throw an error or just silently discard one of the k/v
> pairs. Keeping both should not be on the table, IMHO.

+1.

best regards,
Florian Pflug


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-24 22:48:12
Message-ID: CAARyMpCD4oDeLd5OtuPu5KZo2+C3qh-3Vs92N670NSW2Xf66Tw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Jul 23, 2011 at 11:14 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I doubt you're going to want to reinvent TOAST, ...

I was thinking about making it efficient to access or update
foo.a.b.c.d[1000] in a huge JSON tree. Simply TOASTing the varlena
text means we have to unpack the entire datum to access and update
individual members. An alternative would be to split the JSON into
chunks (possibly by using the pg_toast_<id> table) and have some sort
of index that can be used to efficiently look up values by path.

This would not be trivial, and I don't plan to implement it any time soon.

>
On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> On Jul24, 2011, at 05:14 , Robert Haas wrote:
>> On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>>> ... Fortunately, JSON's definition of a
>>> "number" is its decimal syntax, so the algorithm is child's play:
>>>
>>>  * Figure out the digits and exponent.
>>>  * If the exponent is greater than 20 or less than 6 (arbitrary), use
>>> exponential notation.
>>>
>>
>
> I agree. As for your proposed algorithm, I suggest to instead use
> exponential notation if it produces a shorter textual representation.
> In other words, for values between -1 and 1, we'd switch to exponential
> notation if there's more than 1 leading zero (to the right of the decimal
> point, of course), and for values outside that range if there're more than
> 2 trailing zeros and no decimal point. All after redundant zeros and
> decimal points are removed. So we'd store
>
> 0 as 0
> 1 as 1
> 0.1 as 0.1
> 0.01 as 0.01
> 0.001 as 1e-3
> 10 as 10
> 100 as 100
> 1000 as 1e3
> 1000.1 as 1000.1
> 1001 as 1001
>

Interesting idea. The reason I suggested using exponential notation
only for extreme exponents (less than -6 or greater than +20) is
partly for presentation value. Users might be annoyed to see 1000000
turned into 1e6. Moreover, applications working solely with integers
that don't expect the floating point syntax may choke on the converted
numbers. 32-bit integers can be losslessly encoded as IEEE
double-precision floats (JavaScript's internal representation), and
JavaScript's algorithm for converting a number to a string ([1],
section 9.8.1) happens to preserve the integer syntax (I think).

Should we follow the JavaScript standard for rendering numbers (which
my suggestion approximates)? Or should we use the shortest encoding
as Florian suggests?

- Joey

[1]: http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262%205th%20edition%20December%202009.pdf


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-25 00:03:56
Message-ID: B1409488-5A9A-4481-88D7-FE96A0BD5486@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul25, 2011, at 00:48 , Joey Adams wrote:
> On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Jul24, 2011, at 05:14 , Robert Haas wrote:
>>> On Fri, Jul 22, 2011 at 10:36 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>>>> ... Fortunately, JSON's definition of a
>>>> "number" is its decimal syntax, so the algorithm is child's play:
>>>>
>>>> * Figure out the digits and exponent.
>>>> * If the exponent is greater than 20 or less than 6 (arbitrary), use
>>>> exponential notation.
>>
>> I agree. As for your proposed algorithm, I suggest to instead use
>> exponential notation if it produces a shorter textual representation.
>> In other words, for values between -1 and 1, we'd switch to exponential
>> notation if there's more than 1 leading zero (to the right of the decimal
>> point, of course), and for values outside that range if there're more than
>> 2 trailing zeros and no decimal point. All after redundant zeros and
>> decimal points are removed. So we'd store
>>
>> 0 as 0
>> 1 as 1
>> 0.1 as 0.1
>> 0.01 as 0.01
>> 0.001 as 1e-3
>> 10 as 10
>> 100 as 100
>> 1000 as 1e3
>> 1000.1 as 1000.1
>> 1001 as 1001
>
> Interesting idea. The reason I suggested using exponential notation
> only for extreme exponents (less than -6 or greater than +20) is
> partly for presentation value. Users might be annoyed to see 1000000
> turned into 1e6.

I'm not concerned about that, but ...

> Moreover, applications working solely with integers
> that don't expect the floating point syntax may choke on the converted
> numbers.

now that you say it, that's definitely a concern.

> 32-bit integers can be losslessly encoded as IEEE
> double-precision floats (JavaScript's internal representation), and
> JavaScript's algorithm for converting a number to a string ([1],
> section 9.8.1) happens to preserve the integer syntax (I think).

Indeed. In fact, it seems to be designed to use the integer syntax
for all integral values with <= 66 binary digits. log10(2^66) ~ 19.87

> Should we follow the JavaScript standard for rendering numbers (which
> my suggestion approximates)? Or should we use the shortest encoding
> as Florian suggests?

In the light of the above, consider my suggestion withdrawn. I now think
we should just follow the JavaScript standard as closely as possible.
As you said, it's pretty much the same as your suggestion, just more precise
in the handling of some corner-cases like infinity, nan, +/-0, some
questions of leading and trailing zeros, ...

I wouldn't have made my suggestion had I realized earlier that limit
of 20 for the exponent was carefully chosen to ensure that the full
range of a 64-bit integer value would be represented in non-exponential
notation. I assumed the bracketed "arbitrary" in your description applied
to both the upper (20) as well as the lower (-6) bound, when it really only
applies to the lower bound. Sorry for that.

(I am now curious where the seemingly arbitrary lower bound of -6 comes
from, though. The only explanation I can come up with is that somebody
figured that 0.000001 is still easily distinguished visually from
0.00001, but not so much from 0.0000001)

best regards,
Florian Pflug


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-25 05:05:39
Message-ID: CAARyMpBmM8GvDVsY1MkS6hhs+N-KwhycrAypmtKnUCxUo-ETOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Jul 24, 2011 at 2:19 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
> The downside being that we'd then either need to canonicalize in
> the equality operator, or live with either no equality operator or
> a rather strange one.

It just occurred to me that, even if we sort object members, texteq
might not be a sufficient way to determine equality. In particular,
IEEE floats treat +0 and -0 as two different things, but they are
equal when compared. Note that we're only dealing with a decimal
representation; we're not (currently) converting to double-precision
representation and back.

Should we mimic IEEE floats and preserve -0 versus +0 while treating
them as equal? Or should we treat JSON floats like numeric and
convert -0 to 0 on input? Or should we do something else? I think
converting -0 to 0 would be a bad idea, as it would violate the
intuitive assumption that JSON can be used to marshal double-precision
floats.

- Joey


From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-25 05:35:05
Message-ID: CAARyMpAjort+79p_vS=9Htm9YG5+=6p7JdQ_i-2Mm+ad6o4Dqg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> Should we mimic IEEE floats and preserve -0 versus +0 while treating
> them as equal?  Or should we treat JSON floats like numeric and
> convert -0 to 0 on input?  Or should we do something else?  I think
> converting -0 to 0 would be a bad idea, as it would violate the
> intuitive assumption that JSON can be used to marshal double-precision
> floats.

On the other hand, JavaScript's own .toString and JSON.stringify turn
-0 into 0, so JSON can't marshal -0 around, anyway (in practice). Now
I think turning -0 into 0 would be fine for canonicalizing numbers in
json_in.

- Joey


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-25 08:37:54
Message-ID: FD459E4C-CFD4-4AB3-A5B4-65B504C142C8@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul25, 2011, at 07:35 , Joey Adams wrote:
> On Mon, Jul 25, 2011 at 1:05 AM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
>> Should we mimic IEEE floats and preserve -0 versus +0 while treating
>> them as equal? Or should we treat JSON floats like numeric and
>> convert -0 to 0 on input? Or should we do something else? I think
>> converting -0 to 0 would be a bad idea, as it would violate the
>> intuitive assumption that JSON can be used to marshal double-precision
>> floats.
>
> On the other hand, JavaScript's own .toString and JSON.stringify turn
> -0 into 0, so JSON can't marshal -0 around, anyway (in practice). Now
> I think turning -0 into 0 would be fine for canonicalizing numbers in
> json_in.

+1.

best regards,
Florian Pflug


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Florian Pflug <fgp(at)phlo(dot)org>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bernd Helmle <mailings(at)oopsware(dot)de>, Dimitri Fontaine <dimitri(at)2ndquadrant(dot)fr>, David Fetter <david(at)fetter(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject: Re: Initial Review: JSON contrib modul was: Re: Another swing at JSON
Date: 2011-07-25 10:49:18
Message-ID: EB252E7E-9E65-45CA-B1A8-A8AE16D9DC9A@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Jul25, 2011, at 02:03 , Florian Pflug wrote:
> On Jul25, 2011, at 00:48 , Joey Adams wrote:
>> Should we follow the JavaScript standard for rendering numbers (which
>> my suggestion approximates)? Or should we use the shortest encoding
>> as Florian suggests?
>
> In the light of the above, consider my suggestion withdrawn. I now think
> we should just follow the JavaScript standard as closely as possible.
> As you said, it's pretty much the same as your suggestion, just more precise
> in the handling of some corner-cases like infinity, nan, +/-0, some
> questions of leading and trailing zeros, ...

Just FYI, I browsed through the ECMA Standard you referenced again, and realized
that they explicitly forbid JSON numeric values to be NaN or (-)Infinity
(Page 205, Step 9 at the top of the page). RFC 4627 seems to take the same stand.

I fail to see the wisdom in that, but it's what the standard says...

best regards,
Florian Pflug