Quick Links

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
Date:	2013-06-09 02:52:54
Message-ID:	20130609025254.GA452642@tornado.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-committers pgsql-hackers

On Sat, Jun 08, 2013 at 08:09:15PM -0400, Robert Haas wrote:
> On Sat, Jun 8, 2013 at 10:25 AM, Andrew Dunstan <andrew(at)dunslane(dot)net> wrote:
> > Don't downcase non-ascii identifier chars in multi-byte encodings.
> >
> > Long-standing code has called tolower() on identifier character bytes
> > with the high bit set. This is clearly an error and produces junk output
> > when the encoding is multi-byte. This patch therefore restricts this
> > activity to cases where there is a character with the high bit set AND
> > the encoding is single-byte.
> >
> > There have been numerous gripes about this, most recently from Martin
> > Sch?fer.
> >
> > Backpatch to all live releases.
>
> I'm all for changing this, but back-patching seems like a terrible
> idea. This could easily break queries that are working now.

If more than one encoding covers the characters used in a given application,
that application's semantics should be the same regardless of which of those
encodings is in use. We certainly don't _guarantee_ that today; PostgreSQL
leaves much to libc, which may not implement the relevant locales compatibly.
However, this change bakes into PostgreSQL itself a departure from that
principle. If a database contains tables "ä" and "Ä", which of those "SELECT
* FROM Ä" finds will be encoding-dependent. If we're going to improve the
current (granted, worse) downcase_truncate_identifier() behavior, we should
not adopt another specification bearing such surprises.

Let's return to the drawing board on this one. I would be inclined to keep
the current bad behavior until we implement the i18n-aware case folding
required by SQL. If I'm alone in thinking that, perhaps switch to downcasing
only ASCII characters regardless of the encoding. That at least gives
consistent application behavior.

I apologize for not noticing to comment on this week's thread.

Thanks,
nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

In response to

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding at 2013-06-09 00:09:15 from Robert Haas

Responses

Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding at 2013-06-09 03:50:53 from Andrew Dunstan

Browse pgsql-committers by date

	From	Date	Subject
Next Message	Andrew Dunstan	2013-06-09 03:50:53	Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding
Previous Message	Robert Haas	2013-06-09 00:09:15	Re: [COMMITTERS] pgsql: Don't downcase non-ascii identifier chars in multi-byte encoding

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Stephen Frost	2013-06-09 03:25:12	Re: Proposed patch: remove hard-coded limit MAX_ALLOCATED_DESCS
Previous Message	Greg Smith	2013-06-09 02:00:58	Re: Cost limited statements RFC