Re: Duplicate JSON Object Keys

Lists: pgsql-hackers
From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Duplicate JSON Object Keys
Date: 2013-03-07 19:48:45
Message-ID: 60885A46-5CC8-4B40-BF35-B4C28BFD5480@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This behavior surprised me a bit:

david=# select '{"foo": 1, "foo": 2}'::json;
json
----------------------
{"foo": 1, "foo": 2}

I had expected something more like this:

david=# select '{"foo": 1, "foo": 2}'::json;
json
------------
{"foo": 2}

This hasn’t been much of an issue before, but with Andrew’s JSON enhancements going in, it will start to cause problems:

david=# select json_get('{"foo": 1, "foo": 2}', 'foo');
ERROR: field name is not unique in json object

Andrew tells me that the spec requires this. I think that’s fine, but I would rather that it never got to there.

In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive, but such is the price for correctness.

Thoughts?

Thanks,

David


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-07 20:04:02
Message-ID: 5138F2B2.4000603@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/07/2013 02:48 PM, David E. Wheeler wrote:
> This behavior surprised me a bit:
>
> david=# select '{"foo": 1, "foo": 2}'::json;
> json
> ----------------------
> {"foo": 1, "foo": 2}
>
> I had expected something more like this:
>
> david=# select '{"foo": 1, "foo": 2}'::json;
> json
> ------------
> {"foo": 2}
>
> This hasn’t been much of an issue before, but with Andrew’s JSON enhancements going in, it will start to cause problems:
>
> david=# select json_get('{"foo": 1, "foo": 2}', 'foo');
> ERROR: field name is not unique in json object
>
> Andrew tells me that the spec requires this. I think that’s fine, but I would rather that it never got to there.

Specifically, rfc4627 says (note last sentence):

2.2. Objects

An object structure is represented as a pair of curly brackets
surrounding zero or more name/value pairs (or members). A name is a
string. A single colon comes after each name, separating the name
from the value. A single comma separates a value from a following
name. The names within an object SHOULD be unique.

cheers

andrew


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 20:39:06
Message-ID: CA+Tgmob8c1-Knz7JwWgsXXqf-XR4=jkME5ftVpJhPkrH5R4qCw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david(at)justatheory(dot)com> wrote:
> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive, but such is the price for correctness.

I'm with Andrew. That's a rathole I emphatically don't want to go
down. I wrote this code originally, and I had the thought clearly in
mind that I wanted to accept JSON that was syntactically well-formed,
not JSON that met certain semantic constraints. We could add
functions like json_is_non_stupid(json) so that people can easily add
a CHECK constraint that enforces this if they so desire. But
enforcing it categorically seems like a bad plan, especially since at
this point it would require a compatibility break with previous
releases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 20:56:42
Message-ID: 513A508A.6040404@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/08/2013 09:39 PM, Robert Haas wrote:
> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david(at)justatheory(dot)com> wrote:
>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive, but such is the price for correctness.
> I'm with Andrew. That's a rathole I emphatically don't want to go
> down. I wrote this code originally, and I had the thought clearly in
> mind that I wanted to accept JSON that was syntactically well-formed,
> not JSON that met certain semantic constraints.

If it does not meet these "semantic" constraints, then it is not
really JSON - it is merely JSON-like.

this sounds very much like MySQLs decision to support timestamp
"0000-00-00 00:00" - syntactically correct, but semantically wrong.

> We could add
> functions like json_is_non_stupid(json) so that people can easily add
> a CHECK constraint that enforces this if they so desire. But
> enforcing it categorically seems like a bad plan, especially since at
> this point it would require a compatibility break with previous
> releases
If we ever will support "real" spec-compliant JSON (maybe based
on recursive hstore ?) then there will be a compatibility break
anyway, so why not do it now.

Or do you seriously believe that somebody is using "PostgreSQL JSON"
to store these kind of "json documents"

Cheers
Hannu Krosing


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:01:10
Message-ID: 20130308210110.GE5352@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing escribió:
> On 03/08/2013 09:39 PM, Robert Haas wrote:
> >On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david(at)justatheory(dot)com> wrote:
> >>In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive, but such is the price for correctness.
> >I'm with Andrew. That's a rathole I emphatically don't want to go
> >down. I wrote this code originally, and I had the thought clearly in
> >mind that I wanted to accept JSON that was syntactically well-formed,
> >not JSON that met certain semantic constraints.
>
> If it does not meet these "semantic" constraints, then it is not
> really JSON - it is merely JSON-like.
>
> this sounds very much like MySQLs decision to support timestamp
> "0000-00-00 00:00" - syntactically correct, but semantically wrong.

Is it wrong? The standard cited says SHOULD, not MUST.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:14:09
Message-ID: 46479E46-F635-44BB-8B44-1357F623F1A8@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 8, 2013, at 1:01 PM, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote:

>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
>
> Is it wrong? The standard cited says SHOULD, not MUST.

Yes, it is wrong, because multiple keys are specifically disallowed for accessing values. Hence this new error:

david=# select json_get('{"foo": 1, "foo": 2}', 'foo');
ERROR: field name is not unique in json object

I really don’t think that should be possible.

Best,

David


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Hannu Krosing <hannu(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:19:01
Message-ID: 513A55C5.40706@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Well I would much prefer to find out sooner rather than later that there
is a problem, so I would much prefer know I've created a duplicate as
soon as the system can detect it. In general, Postgresql appears much
better at this than MySQL

On 09/03/13 10:01, Alvaro Herrera wrote:
> Hannu Krosing escribió:
>> On 03/08/2013 09:39 PM, Robert Haas wrote:
>>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david(at)justatheory(dot)com> wrote:
>>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive, but such is the price for correctness.
>>> I'm with Andrew. That's a rathole I emphatically don't want to go
>>> down. I wrote this code originally, and I had the thought clearly in
>>> mind that I wanted to accept JSON that was syntactically well-formed,
>>> not JSON that met certain semantic constraints.
>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
> Is it wrong? The standard cited says SHOULD, not MUST.
>


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:21:46
Message-ID: 513A566A.5090909@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/08/2013 04:01 PM, Alvaro Herrera wrote:
> Hannu Krosing escribió:
>> On 03/08/2013 09:39 PM, Robert Haas wrote:
>>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david(at)justatheory(dot)com> wrote:
>>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive, but such is the price for correctness.
>>> I'm with Andrew. That's a rathole I emphatically don't want to go
>>> down. I wrote this code originally, and I had the thought clearly in
>>> mind that I wanted to accept JSON that was syntactically well-formed,
>>> not JSON that met certain semantic constraints.
>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
> Is it wrong? The standard cited says SHOULD, not MUST.

Here's what rfc2119 says about that wording:

4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
there may exist valid reasons in particular circumstances when the
particular behavior is acceptable or even useful, but the full
implications should be understood and the case carefully weighed
before implementing any behavior described with this label.

So we're allowed to do as Robert chose, and I think there are good
reasons for doing so (apart from anything else, checking it would slow
down the parser enormously).

Now you could argue that in that case the extractor functions should
allow it too, and it's probably fairly easy to change them to allow it.
In that case we need to decide who wins. We could treat a later field
lexically as overriding an earlier field of the same name, which I think
is what David expected. That's what plv8 does (i.e. it's how v8
interprets JSON):

andrew=# create or replace function jget(t json, fld text) returns
text language plv8 as ' return t[fld]; ';
CREATE FUNCTION
andrew=# select jget('{"f1":"x","f1":"y"}','f1');
jget
------
y
(1 row)

Or you could take the view I originally took that in view of the RFC
wording we should raise an error if this was found.

I can live with either view.

cheers

andrew


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:28:53
Message-ID: 137A752C-5EB9-47CB-B989-4A41FE40CE44@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> Here's what rfc2119 says about that wording:
>
> 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
> there may exist valid reasons in particular circumstances when the
> particular behavior is acceptable or even useful, but the full
> implications should be understood and the case carefully weighed
> before implementing any behavior described with this label.

I suspect this was allowed for the JavaScript behavior where multiple keys are allowed, but the last key in the list wins.

> So we're allowed to do as Robert chose, and I think there are good reasons for doing so (apart from anything else, checking it would slow down the parser enormously).

Yes, but the implications are going to start biting us on the ass now.

> Now you could argue that in that case the extractor functions should allow it too, and it's probably fairly easy to change them to allow it. In that case we need to decide who wins. We could treat a later field lexically as overriding an earlier field of the same name, which I think is what David expected. That's what plv8 does (i.e. it's how v8 interprets JSON):
>
> andrew=# create or replace function jget(t json, fld text) returns
> text language plv8 as ' return t[fld]; ';
> CREATE FUNCTION
> andrew=# select jget('{"f1":"x","f1":"y"}','f1');
> jget
> ------
> y
> (1 row)
>
>
> Or you could take the view I originally took that in view of the RFC wording we should raise an error if this was found.
>
> I can live with either view.

I’m on the fence. On the one hand, I like the plv8 behavior, which is nice for a dynamic language. On the other hand, I don't much care for it in my database, where I want data storage requirements to be quite strict. I hate the idea of "0000-00-00" being allowed as a date, and am uncomfortable with allowing duplicate keys to be stored in the JSON data type.

So my order of preference for the options would be:

1. Have the JSON type collapse objects so the last instance of a key wins and is actually stored

2. Throw an error when a JSON type has duplicate keys

3. Have the accessors find the last instance of a key and return that value

4. Let things remain as they are now

On second though, I don't like 4 at all. It means that the JSON type things a value is valid while the accessor does not. They contradict one another.

Best,

David


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:34:20
Message-ID: 513A595C.2090104@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/08/2013 10:01 PM, Alvaro Herrera wrote:
> Hannu Krosing escribió:
>> On 03/08/2013 09:39 PM, Robert Haas wrote:
>>> On Thu, Mar 7, 2013 at 2:48 PM, David E. Wheeler <david(at)justatheory(dot)com> wrote:
>>>> In the spirit of being liberal about what we accept but strict about what we store, it seems to me that JSON object key uniqueness should be enforced either by throwing an error on duplicate keys, or by flattening so that the latest key wins (as happens in JavaScript). I realize that tracking keys will slow parsing down, and potentially make it more memory-intensive, but such is the price for correctness.
>>> I'm with Andrew. That's a rathole I emphatically don't want to go
>>> down. I wrote this code originally, and I had the thought clearly in
>>> mind that I wanted to accept JSON that was syntactically well-formed,
>>> not JSON that met certain semantic constraints.
>> If it does not meet these "semantic" constraints, then it is not
>> really JSON - it is merely JSON-like.
>>
>> this sounds very much like MySQLs decision to support timestamp
>> "0000-00-00 00:00" - syntactically correct, but semantically wrong.
> Is it wrong? The standard cited says SHOULD, not MUST.

I think one MAY start implementation with loose interpretation of
SHOULD, but if at all possible we SHOULD implement the
SHOULD-qualified features :)

http://www.ietf.org/rfc/rfc2119.txt:

SHOULD This word, or the adjective "RECOMMENDED", mean that there
may exist valid reasons in particular circumstances to ignore a
particular item, but the full implications must be understood and
carefully weighed before choosing a different course.

We might start with just throwing a warning for duplicate keys, but I
can see no good reason to do so. Except ease of implementation and with
current JSON-AS-TEXT implenetation performance.

And providing a boolean function is_really_json_object(json) to be used in check
constraints seems plain weird .

Otoh, as the spec defines JSON as being designed to be a subset of javascript,
it SHOULD accept select '{"foo": 1, "foo": 2}'::json; but turn it into
'{"foo": 2}'::json; for storage.

I do not think it would be a good idea to leave it to data extraction
functions to always get the last value for foo in this case 2

------------------
Hannu


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:42:26
Message-ID: 513A5B42.20401@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/08/2013 04:28 PM, David E. Wheeler wrote:
> On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>> Here's what rfc2119 says about that wording:
>>
>> 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
>> there may exist valid reasons in particular circumstances when the
>> particular behavior is acceptable or even useful, but the full
>> implications should be understood and the case carefully weighed
>> before implementing any behavior described with this label.
> I suspect this was allowed for the JavaScript behavior where multiple keys are allowed, but the last key in the list wins.
>
>> So we're allowed to do as Robert chose, and I think there are good reasons for doing so (apart from anything else, checking it would slow down the parser enormously).
> Yes, but the implications are going to start biting us on the ass now.
>
>> Now you could argue that in that case the extractor functions should allow it too, and it's probably fairly easy to change them to allow it. In that case we need to decide who wins. We could treat a later field lexically as overriding an earlier field of the same name, which I think is what David expected. That's what plv8 does (i.e. it's how v8 interprets JSON):
>>
>> andrew=# create or replace function jget(t json, fld text) returns
>> text language plv8 as ' return t[fld]; ';
>> CREATE FUNCTION
>> andrew=# select jget('{"f1":"x","f1":"y"}','f1');
>> jget
>> ------
>> y
>> (1 row)
>>
>>
>> Or you could take the view I originally took that in view of the RFC wording we should raise an error if this was found.
>>
>> I can live with either view.
> I’m on the fence. On the one hand, I like the plv8 behavior, which is nice for a dynamic language. On the other hand, I don't much care for it in my database, where I want data storage requirements to be quite strict. I hate the idea of "0000-00-00" being allowed as a date, and am uncomfortable with allowing duplicate keys to be stored in the JSON data type.
>
> So my order of preference for the options would be:
>
> 1. Have the JSON type collapse objects so the last instance of a key wins and is actually stored
>
> 2. Throw an error when a JSON type has duplicate keys
>
> 3. Have the accessors find the last instance of a key and return that value
>
> 4. Let things remain as they are now
>
> On second though, I don't like 4 at all. It means that the JSON type things a value is valid while the accessor does not. They contradict one another.
>
>

You can forget 1. We are not going to have the parser collapse anything.
Either the JSON it gets is valid or it's not. But the parser isn't going
to try to MAKE it valid.

cheers

andrew


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 21:54:44
Message-ID: 513A5E24.9090404@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/08/2013 10:42 PM, Andrew Dunstan wrote:
>
> On 03/08/2013 04:28 PM, David E. Wheeler wrote:
>> On Mar 8, 2013, at 1:21 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>>
>>> Here's what rfc2119 says about that wording:
>>>
>>> 4. SHOULD NOT This phrase, or the phrase "NOT RECOMMENDED" mean that
>>> there may exist valid reasons in particular circumstances when the
>>> particular behavior is acceptable or even useful, but the full
>>> implications should be understood and the case carefully weighed
>>> before implementing any behavior described with this label.
>> I suspect this was allowed for the JavaScript behavior where multiple
>> keys are allowed, but the last key in the list wins.
>>
>>> So we're allowed to do as Robert chose, and I think there are good
>>> reasons for doing so (apart from anything else, checking it would
>>> slow down the parser enormously).
>> Yes, but the implications are going to start biting us on the ass now.
>>
>>> Now you could argue that in that case the extractor functions should
>>> allow it too, and it's probably fairly easy to change them to allow
>>> it. In that case we need to decide who wins. We could treat a later
>>> field lexically as overriding an earlier field of the same name,
>>> which I think is what David expected. That's what plv8 does (i.e.
>>> it's how v8 interprets JSON):
>>>
>>> andrew=# create or replace function jget(t json, fld text) returns
>>> text language plv8 as ' return t[fld]; ';
>>> CREATE FUNCTION
>>> andrew=# select jget('{"f1":"x","f1":"y"}','f1');
>>> jget
>>> ------
>>> y
>>> (1 row)
>>>
>>>
>>> Or you could take the view I originally took that in view of the RFC
>>> wording we should raise an error if this was found.
>>>
>>> I can live with either view.
>> I’m on the fence. On the one hand, I like the plv8 behavior, which is
>> nice for a dynamic language. On the other hand, I don't much care for
>> it in my database, where I want data storage requirements to be quite
>> strict. I hate the idea of "0000-00-00" being allowed as a date, and
>> am uncomfortable with allowing duplicate keys to be stored in the
>> JSON data type.
>>
>> So my order of preference for the options would be:
>>
>> 1. Have the JSON type collapse objects so the last instance of a key
>> wins and is actually stored
>>
>> 2. Throw an error when a JSON type has duplicate keys
>>
>> 3. Have the accessors find the last instance of a key and return that
>> value
>>
>> 4. Let things remain as they are now
>>
>> On second though, I don't like 4 at all. It means that the JSON type
>> things a value is valid while the accessor does not. They contradict
>> one another.
>>
>>
>
>
> You can forget 1. We are not going to have the parser collapse anything.
> Either the JSON it gets is valid or it's not. But the parser isn't
> going to try to MAKE it valid.
Ok, so the "make valid" part will have to wait for
http://www.pgcon.org/2013/schedule/events/518.en.html if this will ever
happen ;)

Which means that all extractor functions will need to do much more work
in case of complex json, think of

json_get('("a":{"b":1},"a":{"1":"x"}, "a":[1,{"b":7}]}'::json,
["a","1","b"])

the true value in javascript is but here the get_json function has several
options to error out early and real confusion as to where to report the
error if in the end it is not found. essentially all extractor functions
have
to do what we omitted doing at input time.

Cheers
Hannu


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 22:03:36
Message-ID: 513A6038.2030008@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/08/2013 04:42 PM, Andrew Dunstan wrote:
>
>>
>> So my order of preference for the options would be:
>>
>> 1. Have the JSON type collapse objects so the last instance of a key
>> wins and is actually stored
>>
>> 2. Throw an error when a JSON type has duplicate keys
>>
>> 3. Have the accessors find the last instance of a key and return that
>> value
>>
>> 4. Let things remain as they are now
>>
>> On second though, I don't like 4 at all. It means that the JSON type
>> things a value is valid while the accessor does not. They contradict
>> one another.
>>
>>
>
>
> You can forget 1. We are not going to have the parser collapse
> anything. Either the JSON it gets is valid or it's not. But the parser
> isn't going to try to MAKE it valid.

Actually, now I think more about it 3 is the best answer. Here's why:
even the JSON generators can produce JSON with non-unique field names:

andrew=# select row_to_json(q) from (select x as a, y as a from
generate_series(1,2) x, generate_series(3,4) y) q;
row_to_json
---------------
{"a":1,"a":3}
{"a":1,"a":4}
{"a":2,"a":3}
{"a":2,"a":4}

So I think we have no option but to say, in terms of rfc 2119, that we
have careful considered and decided not to comply with the RFC's
recommendation (and we should note that in the docs).

cheers

andrew


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 22:54:00
Message-ID: 513A6C08.10409@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> Actually, now I think more about it 3 is the best answer. Here's why:
> even the JSON generators can produce JSON with non-unique field names:

+1

Also, I think we should add a json_normalize() function to the TODO list.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Hannu Krosing <hannu(at)krosing(dot)net>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-08 23:37:26
Message-ID: 513A7636.8060608@krosing.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/08/2013 11:03 PM, Andrew Dunstan wrote:
>
> On 03/08/2013 04:42 PM, Andrew Dunstan wrote:
>>
>>>
>>> So my order of preference for the options would be:
>>>
>>> 1. Have the JSON type collapse objects so the last instance of a key
>>> wins and is actually stored
>>>
>>> 2. Throw an error when a JSON type has duplicate keys
>>>
>>> 3. Have the accessors find the last instance of a key and return
>>> that value
>>>
>>> 4. Let things remain as they are now
>>>
>>> On second though, I don't like 4 at all. It means that the JSON type
>>> things a value is valid while the accessor does not. They contradict
>>> one another.
>>>
>>>
>>
>>
>> You can forget 1. We are not going to have the parser collapse
>> anything. Either the JSON it gets is valid or it's not. But the
>> parser isn't going to try to MAKE it valid.
>
>
> Actually, now I think more about it 3 is the best answer.
> Here's why: even the JSON generators can produce JSON with non-unique
> field names:
Yes, especially if you consider popular json generators vim and strcat() :)

It is not a "serialisation" of some existing object, but it is something
that JavaScript could interpret as valid subset of JavaScript which
producees a JavaScript Object when interpreted.
In this sense it is way better than MySQL timestamp 0000-00-00 00:00

So the loose (without implementing the SHOULD part) meaning of
JSON spec is "anything that can be read into JavaScript producing
a JS Object" and not "serialisation of a JavaScript Object" as I wanted
to read it initially.

>
> andrew=# select row_to_json(q) from (select x as a, y as a from
> generate_series(1,2) x, generate_series(3,4) y) q;
> row_to_json
> ---------------
> {"a":1,"a":3}
> {"a":1,"a":4}
> {"a":2,"a":3}
> {"a":2,"a":4}
>
>
> So I think we have no option but to say, in terms of rfc 2119, that we
> have careful considered and decided not to comply with the RFC's
> recommendation
The downside is, that the we have just shifted the burden of JS Object
generation to the getter functions.

I suspect that 99.98% of the time we will get valid and unique JS Object
serializations or equivalent as input to json_in()

If we want the getter functions to handle the "loose JSON" to Object
conversion
side assuming our stored JSON can contain non-unique keys then these are
bound to be slower, as they have to do these checks. Thay can't just
grab the first
matching one and return or recurse on that.

> (and we should note that in the docs).
definitely +1
>
> cheers
>
> andrew
>
>
>
>


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Hannu Krosing <hannu(at)krosing(dot)net>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-09 00:25:03
Message-ID: 513A815F.8050500@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/08/2013 06:37 PM, Hannu Krosing wrote:

>
> I suspect that 99.98% of the time we will get valid and unique JS
> Object serializations or equivalent as input to json_in()
>
> If we want the getter functions to handle the "loose JSON" to Object
> conversion
> side assuming our stored JSON can contain non-unique keys then these are
> bound to be slower, as they have to do these checks. Thay can't just
> grab the first
> matching one and return or recurse on that.

No, there will be no slowdown. The parser doesn't short circuit.

Read the code.

cheers

andrew


From: Noah Misch <noah(at)leadboat(dot)com>
To: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-09 02:01:29
Message-ID: 20130309020129.GA15861@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 08, 2013 at 10:34:20PM +0100, Hannu Krosing wrote:
> On 03/08/2013 10:01 PM, Alvaro Herrera wrote:
>> Hannu Krosing escribi?:
>>> If it does not meet these "semantic" constraints, then it is not
>>> really JSON - it is merely JSON-like.

>> Is it wrong? The standard cited says SHOULD, not MUST.
>
>
> I think one MAY start implementation with loose interpretation of
> SHOULD, but if at all possible we SHOULD implement the
> SHOULD-qualified features :)
>
> http://www.ietf.org/rfc/rfc2119.txt:
>
> SHOULD This word, or the adjective "RECOMMENDED", mean that there
> may exist valid reasons in particular circumstances to ignore a
> particular item, but the full implications must be understood and
> carefully weighed before choosing a different course.

That "SHOULD" in section 2.2 of RFC 4627 constrains JSON data, not JSON
parsers. Section 4 addresses parsers, saying "A JSON parser MUST accept all
texts that conform to the JSON grammar."

> We might start with just throwing a warning for duplicate keys, but I
> can see no good reason to do so. Except ease of implementation and with
> current JSON-AS-TEXT implenetation performance.

Since its departure from a "SHOULD" item does not impugn the conformance of an
input text, it follows that json_in(), to be a conforming JSON parser, MUST
not reject objects with duplicate keys.

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 12:17:13
Message-ID: CA+TgmobawHPh62enG52e+GkqVtTk+O_GNbTarEuSv-M4yWt7yw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>> So my order of preference for the options would be:
>>
>> 1. Have the JSON type collapse objects so the last instance of a key wins
>> and is actually stored
>>
>> 2. Throw an error when a JSON type has duplicate keys
>>
>> 3. Have the accessors find the last instance of a key and return that
>> value
>>
>> 4. Let things remain as they are now
>>
>> On second though, I don't like 4 at all. It means that the JSON type
>> things a value is valid while the accessor does not. They contradict one
>> another.
>
> You can forget 1. We are not going to have the parser collapse anything.
> Either the JSON it gets is valid or it's not. But the parser isn't going to
> try to MAKE it valid.

Why not? Because it's the wrong thing to do, or because it would be slower?

What I think is tricky here is that there's more than one way to
conceptualize what the JSON data type really is. Is it a key-value
store of sorts, or just a way to store text values that meet certain
minimalist syntactic criteria? I had imagined it as the latter, in
which case normalization isn't sensible. But if you think of it the
first way, then normalization is not only sensible, but almost
obligatory. For example, we don't feel bad about this:

rhaas=# select '1e1'::numeric;
numeric
---------
10
(1 row)

I think Andrew and I had envisioned this as basically a text data type
that enforces some syntax checking on its input, hence the current
design. But I'm not sure that's the ONLY sensible design.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 12:48:57
Message-ID: 514075B9.6060306@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/13/2013 08:17 PM, Robert Haas wrote:
> I think Andrew and I had envisioned this as basically a text data type
> that enforces some syntax checking on its input, hence the current
> design. But I'm not sure that's the ONLY sensible design.
We're probably stuck with it at this point, but it may well be worth
considering the later introduction of a compatible `jsonobj` that stores
parsed and normalized json objects in some internal format the client
doesn't have to care about, like serialized V8 JS VM objects.

I suspect that such a type is better offered by a contrib until/unless
PL/V8 or a similar becomes a core language. It'd be nuts to try to
re-implement all of the JSON and javascript object functionality in a
javascript engine when we can just plug an existing one in and use its
JSON and javascript object manipulation. The minimalist approach makes
sense for the json type precisely because it's just validated text, but
I don't think it makes sense to continually extend it and slowly
reinvent a whole javascript engine in Pg.

If we're going to do things like normalizing json I think that's a job
for a real JavaScript engine that understands Javascript objects.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 13:02:25
Message-ID: 514078E1.2090905@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/13/2013 08:17 AM, Robert Haas wrote:
> On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>>> So my order of preference for the options would be:
>>>
>>> 1. Have the JSON type collapse objects so the last instance of a key wins
>>> and is actually stored
>>>
>>> 2. Throw an error when a JSON type has duplicate keys
>>>
>>> 3. Have the accessors find the last instance of a key and return that
>>> value
>>>
>>> 4. Let things remain as they are now
>>>
>>> On second though, I don't like 4 at all. It means that the JSON type
>>> things a value is valid while the accessor does not. They contradict one
>>> another.
>> You can forget 1. We are not going to have the parser collapse anything.
>> Either the JSON it gets is valid or it's not. But the parser isn't going to
>> try to MAKE it valid.
> Why not? Because it's the wrong thing to do, or because it would be slower?
>
> What I think is tricky here is that there's more than one way to
> conceptualize what the JSON data type really is. Is it a key-value
> store of sorts, or just a way to store text values that meet certain
> minimalist syntactic criteria? I had imagined it as the latter, in
> which case normalization isn't sensible. But if you think of it the
> first way, then normalization is not only sensible, but almost
> obligatory. For example, we don't feel bad about this:
>
> rhaas=# select '1e1'::numeric;
> numeric
> ---------
> 10
> (1 row)
>
> I think Andrew and I had envisioned this as basically a text data type
> that enforces some syntax checking on its input, hence the current
> design. But I'm not sure that's the ONLY sensible design.
>

I think we've moved on from this point, because a) other implementations
allow duplicate keys, b) it's trivially easy to make Postgres generate
such json, and c) there is some dispute about exactly what the spec
mandates.

I'll be posting a revised patch shortly that doesn't error out but
simply uses the value for the later key lexically.

cheers

andrew


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 13:09:18
Message-ID: 20130313130918.GA27988@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-03-13 20:48:57 +0800, Craig Ringer wrote:
> On 03/13/2013 08:17 PM, Robert Haas wrote:
> > I think Andrew and I had envisioned this as basically a text data type
> > that enforces some syntax checking on its input, hence the current
> > design. But I'm not sure that's the ONLY sensible design.
> We're probably stuck with it at this point, but it may well be worth
> considering the later introduction of a compatible `jsonobj` that stores
> parsed and normalized json objects in some internal format the client
> doesn't have to care about, like serialized V8 JS VM objects.
>
> I suspect that such a type is better offered by a contrib until/unless
> PL/V8 or a similar becomes a core language. It'd be nuts to try to
> re-implement all of the JSON and javascript object functionality in a
> javascript engine when we can just plug an existing one in and use its
> JSON and javascript object manipulation. The minimalist approach makes
> sense for the json type precisely because it's just validated text, but
> I don't think it makes sense to continually extend it and slowly
> reinvent a whole javascript engine in Pg.

While I am not convinced - but not the contrary either - that using
something like V8 is a good idea, I wish the patch adding json had
reserved the first byte in the varlena for the 'json encoding' or
something similar. That would have left the road open for easily adding
different encodings in the future. Now youre left of marking it with a
nullbyte in the beginning or similar atrocities...

Just wanted to say that we might want to think about such stuff now that
we preserve cross-version compatibility of on-disk data the next time we
add a type.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 16:40:57
Message-ID: 3E5EE5E6-2013-46B4-8731-F82C7C9E28AD@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 13, 2013, at 5:17 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> What I think is tricky here is that there's more than one way to
> conceptualize what the JSON data type really is. Is it a key-value
> store of sorts, or just a way to store text values that meet certain
> minimalist syntactic criteria? I had imagined it as the latter, in
> which case normalization isn't sensible. But if you think of it the
> first way, then normalization is not only sensible, but almost
> obligatory.

That makes a lot of sense. Given the restrictions I tend to prefer in my database data types, I had imagined it as the former. And since I'm using it now to store key/value pairs (killing off some awful EAV implementations in the process, BTW), I certainly think of it more formally as an object.

But I can live with the other interpretation, as long as the differences are clearly understood and documented. Perhaps a note could be added to the docs explaining this difference, and what one can do to adapt for it. A normalizing function would certainly help.

Best,

David


From: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 16:51:42
Message-ID: 5140AE9E.8010406@archidevsys.co.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14/03/13 02:02, Andrew Dunstan wrote:
>
> On 03/13/2013 08:17 AM, Robert Haas wrote:
>> On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew(at)dunslane(dot)net>
>> wrote:
>>>> So my order of preference for the options would be:
>>>>
>>>> 1. Have the JSON type collapse objects so the last instance of a
>>>> key wins
>>>> and is actually stored
>>>>
>>>> 2. Throw an error when a JSON type has duplicate keys
>>>>
>>>> 3. Have the accessors find the last instance of a key and return that
>>>> value
>>>>
>>>> 4. Let things remain as they are now
>>>>
>>>> On second though, I don't like 4 at all. It means that the JSON type
>>>> things a value is valid while the accessor does not. They
>>>> contradict one
>>>> another.
>>> You can forget 1. We are not going to have the parser collapse
>>> anything.
>>> Either the JSON it gets is valid or it's not. But the parser isn't
>>> going to
>>> try to MAKE it valid.
>> Why not? Because it's the wrong thing to do, or because it would be
>> slower?
>>
>> What I think is tricky here is that there's more than one way to
>> conceptualize what the JSON data type really is. Is it a key-value
>> store of sorts, or just a way to store text values that meet certain
>> minimalist syntactic criteria? I had imagined it as the latter, in
>> which case normalization isn't sensible. But if you think of it the
>> first way, then normalization is not only sensible, but almost
>> obligatory. For example, we don't feel bad about this:
>>
>> rhaas=# select '1e1'::numeric;
>> numeric
>> ---------
>> 10
>> (1 row)
>>
>> I think Andrew and I had envisioned this as basically a text data type
>> that enforces some syntax checking on its input, hence the current
>> design. But I'm not sure that's the ONLY sensible design.
>>
>
>
> I think we've moved on from this point, because a) other
> implementations allow duplicate keys, b) it's trivially easy to make
> Postgres generate such json, and c) there is some dispute about
> exactly what the spec mandates.
>
> I'll be posting a revised patch shortly that doesn't error out but
> simply uses the value for the later key lexically.
>
> cheers
>
> andrew
>
>
>
>
How about adding a new function with '_strict' added to the existing
name, with an extra parameter 'coalesce' - or using other names, if
considered more appropriate!

That way slower more stringent functionality can be added where
required. This way, the existing function need not be changed.

If coalesce = true,
then: the last duplicate is used
else: an error is returned when the new key is a duplicate.

Cheers,
Gavin


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "David E(dot) Wheeler" <david(at)justatheory(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 17:45:43
Message-ID: 5140BB47.7050302@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/13/2013 12:51 PM, Gavin Flower wrote:
> On 14/03/13 02:02, Andrew Dunstan wrote:
>>
>> On 03/13/2013 08:17 AM, Robert Haas wrote:
>>> On Fri, Mar 8, 2013 at 4:42 PM, Andrew Dunstan <andrew(at)dunslane(dot)net>
>>> wrote:
>>>>> So my order of preference for the options would be:
>>>>>
>>>>> 1. Have the JSON type collapse objects so the last instance of a
>>>>> key wins
>>>>> and is actually stored
>>>>>
>>>>> 2. Throw an error when a JSON type has duplicate keys
>>>>>
>>>>> 3. Have the accessors find the last instance of a key and return that
>>>>> value
>>>>>
>>>>> 4. Let things remain as they are now
>>>>>
>>>>> On second though, I don't like 4 at all. It means that the JSON type
>>>>> things a value is valid while the accessor does not. They
>>>>> contradict one
>>>>> another.
>>>> You can forget 1. We are not going to have the parser collapse
>>>> anything.
>>>> Either the JSON it gets is valid or it's not. But the parser isn't
>>>> going to
>>>> try to MAKE it valid.
>>> Why not? Because it's the wrong thing to do, or because it would be
>>> slower?
>>>
>>> What I think is tricky here is that there's more than one way to
>>> conceptualize what the JSON data type really is. Is it a key-value
>>> store of sorts, or just a way to store text values that meet certain
>>> minimalist syntactic criteria? I had imagined it as the latter, in
>>> which case normalization isn't sensible. But if you think of it the
>>> first way, then normalization is not only sensible, but almost
>>> obligatory. For example, we don't feel bad about this:
>>>
>>> rhaas=# select '1e1'::numeric;
>>> numeric
>>> ---------
>>> 10
>>> (1 row)
>>>
>>> I think Andrew and I had envisioned this as basically a text data type
>>> that enforces some syntax checking on its input, hence the current
>>> design. But I'm not sure that's the ONLY sensible design.
>>>
>>
>>
>> I think we've moved on from this point, because a) other
>> implementations allow duplicate keys, b) it's trivially easy to make
>> Postgres generate such json, and c) there is some dispute about
>> exactly what the spec mandates.
>>
>> I'll be posting a revised patch shortly that doesn't error out but
>> simply uses the value for the later key lexically.
>>
>> cheers
>>
>> andrew
>>
>>
>>
>>
> How about adding a new function with '_strict' added to the existing
> name, with an extra parameter 'coalesce' - or using other names, if
> considered more appropriate!
>
> That way slower more stringent functionality can be added where
> required. This way, the existing function need not be changed.
>
> If coalesce = true,
> then: the last duplicate is used
> else: an error is returned when the new key is a duplicate.
>
>
>

For good or ill, we now already have a json type that will accept
strings with duplicate keys, and generator functions which can now
generate such strings. If someone wants functions to enforce a stricter
validity check (e.g. via a check constraint on a domain), or to convert
json to a canonical version which strips out prior keys of the same name
and their associated values, then these should be relatively simple to
implement given the parser API in the current patch. But they aren't
part of the current patch, and I think it's way too late to be adding
such things. I have been persuaded by arguments made upthread that the
best thing to do is exactly what other well known json-accepting
implementations do (e.g. V8), which is to accept json with duplicate
keys and to treat the later key/value as overriding the former
key/value. If I'd done that from the start nobody would now be talking
about this at all.

cheers

andrew


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 17:50:28
Message-ID: 9A4E6ABE-1495-46B7-8C1A-7C5B675F410E@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 13, 2013, at 10:45 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

> If someone wants functions to enforce a stricter validity check (e.g. via a check constraint on a domain), or to convert json to a canonical version which strips out prior keys of the same name and their associated values, then these should be relatively simple to implement given the parser API in the current patch. But they aren't part of the current patch, and I think it's way too late to be adding such things.

I think it might be good to get something like this into core eventually, otherwise I suspect that there will be a different version of it for every JSON-using project out there. And my first cut at it won’t descend into sub-objects.

> I have been persuaded by arguments made upthread that the best thing to do is exactly what other well known json-accepting implementations do (e.g. V8), which is to accept json with duplicate keys and to treat the later key/value as overriding the former key/value. If I'd done that from the start nobody would now be talking about this at all.

That’s true, though I might have started thinking about a canonicalizing function before long. :-)

Best,

David


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 17:59:21
Message-ID: 5140BE79.3010405@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 03/13/2013 01:50 PM, David E. Wheeler wrote:
> On Mar 13, 2013, at 10:45 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
>
>> If someone wants functions to enforce a stricter validity check (e.g. via a check constraint on a domain), or to convert json to a canonical version which strips out prior keys of the same name and their associated values, then these should be relatively simple to implement given the parser API in the current patch. But they aren't part of the current patch, and I think it's way too late to be adding such things.
> I think it might be good to get something like this into core eventually, otherwise I suspect that there will be a different version of it for every JSON-using project out there. And my first cut at it won’t descend into sub-objects.

The you wouldn't be doing it right. The whole thing about a recursive
descent parser is that it's, well, recursive.

cheers

andrew


From: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Gavin Flower <GavinFlower(at)archidevsys(dot)co(dot)nz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Hannu Krosing <hannu(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 18:32:06
Message-ID: 3BA2E36E-70AD-4E1B-962E-5AC91A64E486@justatheory.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mar 13, 2013, at 10:59 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:

>> And my first cut at it won’t descend into sub-objects.
>
>
> The you wouldn't be doing it right. The whole thing about a recursive descent parser is that it's, well, recursive.

Right, but it would serve my immediate needs. I have a column that just stores key/value pairs with no nesting. So I know I can write something like this and have it be good enough:

create or replace function json_smash(
json
) RETURNS JSON language SQL STRICT IMMUTABLE AS $$
SELECT format('{%s}', array_to_string(ARRAY(
SELECT format('%s: %s', to_json(key), value)
FROM (
SELECT key, value, row_number() OVER (
partition by key order by rnum desc
) AS rnum
FROM (
SELECT key, value, row_number() OVER (
partition by key
) AS rnum
FROM json_each($1)
) a
) b
WHERE rnum = 1
), ','))::json;
$$;

And do you really want to see that unloosed on the world? :-P (Yes, I know there is no guarantee on the order of rows returned by json_each()).

Best,

David


From: Hannu Krosing <hannu(at)2ndQuadrant(dot)com>
To: "David E(dot) Wheeler" <david(at)justatheory(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org Hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Duplicate JSON Object Keys
Date: 2013-03-13 20:49:57
Message-ID: 5140E675.7070308@2ndQuadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 03/13/2013 12:40 PM, David E. Wheeler wrote:
> On Mar 13, 2013, at 5:17 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> What I think is tricky here is that there's more than one way to
>> conceptualize what the JSON data type really is. Is it a key-value
>> store of sorts, or just a way to store text values that meet certain
>> minimalist syntactic criteria? I had imagined it as the latter, in
>> which case normalization isn't sensible. But if you think of it the
>> first way, then normalization is not only sensible, but almost
>> obligatory.
> That makes a lot of sense. Given the restrictions I tend to prefer in my database data types, I had imagined it as the former. And since I'm using it now to store key/value pairs (killing off some awful EAV implementations in the process, BTW), I certainly think of it more formally as an object.
>
>
> But I can live with the other interpretation, as long as the differences are clearly understood and documented. Perhaps a note could be added to the docs explaining this difference, and what one can do to adapt for it. A normalizing function would certainly help.
I guess the easiest and most generic way to normalize is to actually
convert to some internal representation and back.

in pl/python this would look like this:

hannu=# create function normalize(IN ij json, OUT oj json) language
plpythonu as $$
import json
return json.dumps(json.loads(ij))
$$;
CREATE FUNCTION
hannu=# select normalize('{"a":1, "a":"b", "a":true}');
normalize
-------------
{"a": true}
(1 row)

If we could want to be really fancy we could start storing our json in
some format which
is faster to parse, like tnetstrings, but probably it is too late in
release cycle to change this now.

Hannu
>
> Best,
>
> David
>