Quick Links

Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeevan Chalke <jeevan(dot)chalke(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS
Date:	2011-06-09 19:14:24
Message-ID:	12914.1307646864@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Thu, Jun 9, 2011 at 2:58 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>>> Right. Understood. So let's look at the cases (from git grep
>>> pg_strcasecmp and pg_strncasecmp):

>> Also pg_toupper and pg_tolower. Right offhand, it looks like psql
>> *does* assume it can lower-case identifiers this way :-(

> Blarg. I dunno what to do about that - but surely we must have the
> encoding available somewhere?

Well, psql does at least, but changing all the callers to pass encoding
info would be a PITA.

The idea I'd been mulling was to keep a static variable in
pgstrcasecmp.c, which would control whether to allow the isupper/tolower
path to be tried. The static initializer value would be false, meaning
you get ASCII-only case folding by default. We'd also add a callable
function to set the flag, which could be called during startup by
whatever bit of logic is aware of the encoding we're using. psql,
pg_dump, and the backend could do that; offhand I suspect it doesn't
matter for anything else.

However, this is all still dependent on the assumption that our idea of
the encoding matches the libc locale setting. I'm prepared to believe
that we have that locked down pretty well now in the backend, but I
don't think I believe it for either psql or pg_dump.

There's also the meta-problem that psql's locale might not match the
backend's, leading to wrong case folding of identifiers compared to what
the backend does, even if it's entirely correct for psql's locale. I'm
not sure that's a huge deal for psql, since most likely anybody who's
typing non-ASCII identifiers has taken the trouble to set the locale the
way she wants, and anyway the effects would only be seen in interactive
commands and so are easily recovered from. However, if pg_dump is
trying to do this, the possible downsides seem a lot worse.

regards, tom lane

In response to

Re: Invalid byte sequence for encoding "UTF8", caused due to non wide-char-aware downcase_truncate_identifier() function on WINDOWS at 2011-06-09 18:59:57 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kohei KaiGai	2011-06-09 19:28:33	Re: [v9.1] sepgsql - userspace access vector cache
Previous Message	Kohei KaiGai	2011-06-09 19:09:56	Re: [v9.1] sepgsql - userspace access vector cache