Unicode string literals versus the world

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)postgreSQL(dot)org
Subject: Unicode string literals versus the world
Date: 2009-04-10 21:54:25
Message-ID: 1927.1239400465@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

So I started to look at what might be involved in teaching plpgsql about
standard_conforming_strings, and was soon dismayed by the sheer epic
nature of its failure to act like the core lexer. It was shaky enough
before, but the recent introduction of Unicode strings and identifiers
into the core has left plpgsql hopelessly behind.

I can see two basic approaches to making things work: copy-and-paste
practically all of parser/scan.l into plpgsql's lexer (certainly all of
it that involves exclusive states); or throw out plpgsql's lexer
altogether in favor of somehow using the core lexer directly. Neither
one looks very attractive.

It gets worse though: I have seldom seen such a badly designed piece of
syntax as the Unicode string syntax --- see
http://developer.postgresql.org/pgdocs/postgres/sql-syntax-lexical.html#SQL-SYNTAX-STRINGS-UESCAPE

You scan the string, and then after that they tell you what the escape
character is!? Not to mention the obvious ambiguity with & as an
operator.

If we let this go into 8.4, our previous rounds with security holes
caused by careless string parsing will look like a day at the beach.
No frontend that isn't fully cognizant of the Unicode string syntax is
going to parse such things correctly --- it's going to be trivial for
a bad guy to confuse a quoting mechanism as to what's an escape and what
isn't.

I think we need to give very serious consideration to ripping out that
"feature".

regards, tom lane

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2009-04-10 21:59:31 Re: pg_restore dependencies
Previous Message Tom Lane 2009-04-10 21:35:07 Re: A renewed plea for inclusion of zone.tab