Re: JSON data type status?

From: Joseph Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)gmail(dot)com>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: JSON data type status?
Date: 2011-01-21 08:58:53
Message-ID: AANLkTi=ti88X2KR035_G28WJR+-EFTgMMQYpTd5YZw1=@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 20, 2011 at 8:13 PM, Itagaki Takahiro
<itagaki(dot)takahiro(at)gmail(dot)com> wrote:
> On Fri, Jan 21, 2011 at 09:11, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> What happened to our work to add a JSON data type for PG 9.1?
>
> Nothing will happen in 9.1.
> I assume we are in "competition" status:
>
> http://archives.postgresql.org/pgsql-hackers/2010-11/msg00481.php
>
> Also, if PGXN will work well, we might not have to include JSON
> in the core. We can download any JSON implementations from the
> site after installing the core server. Of course, if we will
> use JSON types in the core (EXPLAIN JSON output?), we have to
> include one of them.

Thanks for the update. However, I should probably take some blame as
well for spending basically zero time on the JSON datatype for lengths
at a time.

After someone emailed me asking how to install the JSON datatype, I
fixed up my module version of it:

http://git.postgresql.org/gitweb?p=json-datatype.git

I also have a version where JSON is a core datatype (the patches I
submitted to CommitFest), but it is obsolete now. The module version
has all the same features, but also fixes a Unicode bug* and adds
PostgreSQL 8.4 compatibility. For those who want the JSON datatype
right now, I recommend using the module.

I plan to maintain and improve the JSON module (repository linked
above), albeit at a snail's pace. Patches are certainly welcome. See
roadmap.markdown in the repository for some minor and not-so-minor
design decisions.

Joey Adams

* P.S.

The "Unicode bug" involved UTF-16 surrogate pair calculation. JSON
encodes Unicode characters that aren't in the Basic Multilingual Plane
the same way UTF-16 does: using two 4-digit hex codes. The correct
formula for determining the Unicode codepoint of a surrogate pair is:

unicode = 0x10000 + (((uc & 0x3FF) << 10) | (lc & 0x3FF));

where:

0xD800 <= uc <= 0xDBFF
0xDC00 <= lc <= 0xDFFF

For example, consider the JSON string "\uD835\uDD0D":

uc = 0xD835
lc = 0xDD0D
unicode = 0x1D50D

This pair of hex indices collapses into one Unicode codepoint, 0x1D50D
(or "𝔍").

I originally used this:

/* WRONG */
unicode = 0x10000 | ((uc & 0x3FF) << 10) | (lc & 0x3FF);

The problem is, ((uc & 0x3FF) << 10) | (lc & 0x3FF) has a range of 0
to 0xFFFFF, and in some of those numbers, `0x10000 |` adds zero.

I am sharing all this because I have made this mistake twice, and I
believe it is an easy mistake to make.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2011-01-21 09:10:47 Re: SSI and Hot Standby
Previous Message Fujii Masao 2011-01-21 08:35:07 Re: pg_basebackup for streaming base backups