Quick Links

BUG #5661: The character encoding in logfile is confusing.

Lists:	pgsql-bugspgsql-hackers

From:	"Mikio" <tkbysh2000(at)yahoo(dot)co(dot)jp>
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-16 11:12:34
Message-ID:	201009161112.o8GBCYak052654@wwwmaster.postgresql.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

The following bug has been logged online:

Bug reference: 5661
Logged by: Mikio
Email address: tkbysh2000(at)yahoo(dot)co(dot)jp
PostgreSQL version: 9.0 RC1
Operating system: Windows XP SP3 Japanese
Description: The character encoding in logfile is confusing.
Details:

I'm using postgresql 9.0 rc1 on Japanese Windows XP.
I found character encoding is confusing in log files in pg_log directory.
Default character encoding of all of databases are UTF-8, and almost message
strings in log files are described by UTF-8 correctly.
But few lines are described by EUC_JP.
So 2 character encoding strings are existing in 1 log file and I can't read
the messages parts of logs.
Incidentally, client_encoding in postgresql.conf is commented out.

Thank you.

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	Mikio <tkbysh2000(at)yahoo(dot)co(dot)jp>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-17 02:53:45
Message-ID:	4C92D839.8020800@postnewspapers.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On 09/16/2010 07:12 PM, Mikio wrote:
>
> The following bug has been logged online:
>
> Bug reference: 5661
> Logged by: Mikio
> Email address: tkbysh2000(at)yahoo(dot)co(dot)jp
> PostgreSQL version: 9.0 RC1
> Operating system: Windows XP SP3 Japanese
> Description: The character encoding in logfile is confusing.
> Details:
>
> I'm using postgresql 9.0 rc1 on Japanese Windows XP.
> I found character encoding is confusing in log files in pg_log directory.
> Default character encoding of all of databases are UTF-8, and almost message
> strings in log files are described by UTF-8 correctly.
> But few lines are described by EUC_JP.
> So 2 character encoding strings are existing in 1 log file and I can't read
> the messages parts of logs.
> Incidentally, client_encoding in postgresql.conf is commented out.

Thankyou for your report. This certainly sounds like a potential bug -
but to do anything about it, we will need to see the contents of the
actual log file in question and the contents of postgresql.conf .

Only partial log file contents should be necessary, showing the EUC_JP
encoded parts of the logs and say ten lines either side. If the EUC_JP
contents were generated by client code (say, RAISE NOTICE statements in
PL/PgSQL) then you will also need to supply the client code.

Please bundle all the files up in a zip file to protect them from
possible text encoding conversion during transfer, and post them to a
file hosting site. If you don't want them to be public, just collect the
logs up and wait for people to ask you to send them to them by private
email. Please send a copy to me, as I've dealt with encoding issues in
software (though not PostgreSQL) quite a bit.

--
Craig Ringer

From:	tkbysh2000(at)yahoo(dot)co(dot)jp
To:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-17 05:10:13
Message-ID:	20100917133155.BB5F.A495B709@yahoo.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Hi Craig,

Thank you very much for your quick response.
I'm happy to participate to improve rc1.

This is my first report to postgre team so I'm not sure where is the
file hosting site.
I'm attaching the log file and postgresql.conf on this email.
Please let me know if this is not convenience for the team, can you tell
me the url of the appropriate upload site? I'll upload the file onto it.
I don't mind for it will be public.

BTW, I found third character encoding in the file, Shift_JIS. Attached
file is including all of 3 character encoded lines.
For your reference:
Shift_JIS: Default encoding of Japanese Windows. I found this problem
on posgre server which is working as Windows service.
EUC_JP: Very major encoding of Japanese Unix. I guess that the
developper which worked for this, on some Unix or Linux.
UTF-8: Major encoding especially ralating java in Japan. And I
specified as default encoding for my all of databases.

I didn't edit the log file to avoid change some data by text editor when
save it. So attached log file is including from start to end a service.
But the log file is very small. Total size is 7kb.
And client code is not attached. Cause the messages of bad character
encoding are relevant to starting up and shutting down messages.
So you can find easily this problem. They are in top and end of log
file.

Please let me know if you need additional information.

Regards.

--
<tkbysh2000(at)yahoo(dot)co(dot)jp>

On Fri, 17 Sep 2010 10:53:45 +0800
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote:

> On 09/16/2010 07:12 PM, Mikio wrote:
> >
> > The following bug has been logged online:
> >
> > Bug reference: 5661
> > Logged by: Mikio
> > Email address: tkbysh2000(at)yahoo(dot)co(dot)jp
> > PostgreSQL version: 9.0 RC1
> > Operating system: Windows XP SP3 Japanese
> > Description: The character encoding in logfile is confusing.
> > Details:
> >
> > I'm using postgresql 9.0 rc1 on Japanese Windows XP.
> > I found character encoding is confusing in log files in pg_log directory.
> > Default character encoding of all of databases are UTF-8, and almost message
> > strings in log files are described by UTF-8 correctly.
> > But few lines are described by EUC_JP.
> > So 2 character encoding strings are existing in 1 log file and I can't read
> > the messages parts of logs.
> > Incidentally, client_encoding in postgresql.conf is commented out.
>
> Thankyou for your report. This certainly sounds like a potential bug -
> but to do anything about it, we will need to see the contents of the
> actual log file in question and the contents of postgresql.conf .
>
> Only partial log file contents should be necessary, showing the EUC_JP
> encoded parts of the logs and say ten lines either side. If the EUC_JP
> contents were generated by client code (say, RAISE NOTICE statements in
> PL/PgSQL) then you will also need to supply the client code.
>
> Please bundle all the files up in a zip file to protect them from
> possible text encoding conversion during transfer, and post them to a
> file hosting site. If you don't want them to be public, just collect the
> logs up and wait for people to ask you to send them to them by private
> email. Please send a copy to me, as I've dealt with encoding issues in
> software (though not PostgreSQL) quite a bit.
>
> --
> Craig Ringer
>

Attachment	Content-Type	Size
postgresql-2010-09-16_000000.zip	application/x-zip-compressed	6.8 KB

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	tkbysh2000(at)yahoo(dot)co(dot)jp
Cc:	pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-18 02:17:29
Message-ID:	4C942139.4090706@postnewspapers.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On 09/17/2010 01:10 PM, tkbysh2000(at)yahoo(dot)co(dot)jp wrote:

> BTW, I found third character encoding in the file, Shift_JIS. Attached
> file is including all of 3 character encoded lines.
> For your reference:
> Shift_JIS: Default encoding of Japanese Windows. I found this problem
> on posgre server which is working as Windows service.
> EUC_JP: Very major encoding of Japanese Unix. I guess that the
> developper which worked for this, on some Unix or Linux.
> UTF-8: Major encoding especially ralating java in Japan. And I
> specified as default encoding for my all of databases.

Thanks for that.

> I didn't edit the log file to avoid change some data by text editor when
> save it. So attached log file is including from start to end a service.
> But the log file is very small. Total size is 7kb.

Good plan. Thanks.

> And client code is not attached. Cause the messages of bad character
> encoding are relevant to starting up and shutting down messages.
> So you can find easily this problem. They are in top and end of log
> file.

Yes, the mismatched encodings in the data are clear and obvious.

Given that the messages are coming purely from postgresql, not client
code, I'm now wondering if what we're dealing with is mismatched
encodings in the translation files, where some messages were translated
with a different encoding to other messages.

One of the correctly encoded messages is "Unexpected EOF received on
client connection"

One of the incorrectly encoded (shift-JIS) messages is: "Fast Shutdown
request received". Another is "Aborting any active transactions".

I can find the correctly encoded messages in
share/locale/ja/LC_MESSAGES/postgres-9.0.mo

The incorrectly encoded messages appear in the same file, but are
encoded in utf-8 in that file despite being output to the logs in
shift-JIS. For example, with the badly encoded data from the logs
extracted into the file 'x':

$ python
>>> x = open("x").read()
>>> x
'\x8d\x82\x91\xac\x83V\x83\x83\x83b\x83g\x83_\x83E\x83\x93\x97v\x8b\x81\x82\xf0\x8e\xf3\x82\xaf\x8e\xe6\x82\xe8\x82\xdc\x82\xb5\x82\xbd\r\n'
>>> print x.decode("shift-jis")
高速シャットダウン要求を受け取りました

$ grep '高速シャットダウン要求を受け取りました' *
Binary file postgres-9.0.mo matches
$

So - either something in the pipeline is "helpfully" converting your
error messages, or your locale files aren't the same as mine. I doubt
the latter; it seems almost impossible that just a few messages would be
converted to shift-JIS by accident in the Windows release only. So the
question now is where the messages are converted from UTF-8 to shift-JIS
and why that conversion is being applied inconsistently.

I'll try to have a look and see what I can find.

--
Craig Ringer

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc:	tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-bugs(at)postgresql(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-19 19:10:07
Message-ID:	7243.1284923407@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> Yes, the mismatched encodings in the data are clear and obvious.

> Given that the messages are coming purely from postgresql, not client
> code, I'm now wondering if what we're dealing with is mismatched
> encodings in the translation files, where some messages were translated
> with a different encoding to other messages.

The examples you give don't seem to support that idea. I don't read
Japanese, but at least these cases look like they are all UTF8 as
expected in the .po files.

> One of the correctly encoded messages is "Unexpected EOF received on
> client connection"

> One of the incorrectly encoded (shift-JIS) messages is: "Fast Shutdown
> request received". Another is "Aborting any active transactions".

> ... question now is where the messages are converted from UTF-8 to shift-JIS
> and why that conversion is being applied inconsistently.

Given those three examples, I wonder whether all the mis-encoded
messages are emitted by the postmaster, rather than backends.
Anyway it seems that you ought to look for some pattern in which
messages are correctly vs incorrectly encoded.

regards, tom lane

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 08:25:33
Message-ID:	4C99BD7D.1080409@postnewspapers.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

[moving to pgsql-hackers; this isn't the simple bug I initially
suspected it might be]

On 20/09/10 03:10, Tom Lane wrote:
> Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
>> One of the correctly encoded messages is "Unexpected EOF received on
>> client connection"
>
>> One of the incorrectly encoded (shift-JIS) messages is: "Fast Shutdown
>> request received". Another is "Aborting any active transactions".
>
>> ... question now is where the messages are converted from UTF-8 to shift-JIS
>> and why that conversion is being applied inconsistently.
>
> Given those three examples, I wonder whether all the mis-encoded
> messages are emitted by the postmaster, rather than backends.
> Anyway it seems that you ought to look for some pattern in which
> messages are correctly vs incorrectly encoded.

I think you're right. Looking into it more, though, I'm not even sure
what the correct behaviour even is. I don't think this is a simple bug
where Pg fails to convert between encodings in a few places; rather,
it's a design oversight where the effect of having a system encoding
different from the encoding of the database(s) isn't considered.

A single log file should obviously be in a single encoding, it's the
only sane way to do things. But which encoding is it in? And which
*should* it be in?

- The system text encoding? This is what the postmaster will have from
its environment, and is what the user will expect the logs to be in.
Postmaster will emit messages in this encoding at least during
startup, as it doesn't know what encoding the cluster uses yet.
(In fact it seems to stick to the system encoding throughout its
life).

- The default database encoding supplied to initdb during cluster
creation?

- The encoding of the database emitting a message? This makes sense
when considering RAISE messages, for example. Backends will currently
use this encoding when emitting log messages, whether user-supplied
or translated from po files.

This confusion leads to the mixed encoding issues reported by the OP.
It's not a simple bug, it's a design issue.

Unfortunately, it's not as simple as picking one of the above encodings
for all logging.

The system encoding isn't a good choice, because it might not be capable
of representing all characters emitted by user RAISE statements in
databases with a different encoding, nor all "double quoted"
identifiers, parameter values, etc etc etc. For example, if the system
encoding is SHIFT-JIS, but user databases emit messages with Chinese,
Cyrillic, extended latin, or pretty much any non-Japanese characters,
there's no sane way to convert messages containing any user text to
shift-JIS for logging. The same applies with a latin-1 (iso-8859-1)
system encoding and a utf-8 or shift-jis database emitting Japanese
messages. Scratch using the system encoding for logging.

What about the encoding used by initdb to create the cluster? It seems
sensible, but:
- The postmaster doesn't know what it is when it's doing it's initial
startup. How can the postmaster complain that it can't find / open
the cluster datadir when it doesn't know what encoding to use for the
complaint?
- If the cluster isn't created as utf-8, the same issue as with the
system encoding applies.

Using the encoding of the emitting database will permit all messages to
be represented, but will give rise to mixed encodings in the log file,
and still won't help the postmaster know what to do before it's found
and read the cluster.

I'm now inclined to propose that all logging be done unconditionally in
utf-8, with a BOM written to the start of every log file. Backends with
non-utf-8 databases should convert messages to utf-8 for logging.
Because PostgreSQL supports the use of different encodings in different
databases this is the only way to ensure sane logging with consistent
encoding in a single log file.

The only alternative I see is to break logging out into separate files:
- postmaster.log for postmaster etc
- [databasename].log for each database, in that database's encoding
... but I'm not confident that'd be worth the confusion.

Neither scheme solves the question of what to do when logging to syslog,
though. Syslog expects messages in the system encoding, and Pg would be
wrong to log in any other encoding. Yet as databases may have characters
that cannot be represented in the system encoding, the system encoding
isn't good enough. Should syslog messages be converted to the system
encoding with non-representable characters replaced by "?" or some other
placeholder? Blech.

Ideas? Suggestions?

--
Craig Ringer

Tech-related writing: http://soapyfrogs.blogspot.com/

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 09:45:22
Message-ID:	1285148722.15691.19.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
> A single log file should obviously be in a single encoding, it's the
> only sane way to do things. But which encoding is it in? And which
> *should* it be in?

We need to produce the log output in the server encoding, because that's
how we need to send it to the client. If you have different databases
with different server encodings, you will get inconsistently encoded
output in the log file.

Conceivably, we could create a configuration option that specifies the
encoding for the log file, and strings a recoded from whatever gettext()
produces to the specified encoding. initdb could initialize that option
suitably, so in most cases users won't have to do anything.

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 11:25:47
Message-ID:	4C99E7BB.40402@postnewspapers.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
> On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
>> A single log file should obviously be in a single encoding, it's the
>> only sane way to do things. But which encoding is it in? And which
>> *should* it be in?
>
> We need to produce the log output in the server encoding, because that's
> how we need to send it to the client.

That doesn't mean it can't be recoded for writing to the log file,
though. Perhaps it needs to be. It should be reasonably practical to
detect when the database and log encoding are the same and avoid the
transcoding performance penalty, not that it's big anyway.

> If you have different databases
> with different server encodings, you will get inconsistently encoded
> output in the log file.

I don't think that's an OK answer, myself. Mixed encodings with no
delineation in one file = bug as far as I'm concerned. You can't even
rely on being able to search the log anymore. You'll only get away with
it when using languages that mostly stick to the 7-bit ASCII subset, so
most text is still readable; with most other languages you'll get logs
full of what looks to the user like garbage.

> Conceivably, we could create a configuration option that specifies the
> encoding for the log file, and strings a recoded from whatever gettext()
> produces to the specified encoding. initdb could initialize that option
> suitably, so in most cases users won't have to do anything.

Yep, I tend to think that'd be the right way to go. It'd still be a bit
of a pain, though, as messages written to stdout/stderr by the
postmaster should be in the system encoding, but messages written to the
log files should be in the encoding specified for logs, unless logging
is being done to syslog, in which case it has to be in the system
encoding after all...

And, of course, the postmaster still doesn't know how to log anything it
might emit before reading postgresql.conf, because it doesn't know what
encoding to use.

I still wonder if, rather than making this configurable, the right
choice is to force logging to UTF-8 (with BOM) across the board, right
from postmaster startup. It's consistent, all messages in all other
encodings can be converted to UTF-8 for logging, it's platform
independent, and text editors etc tend to understand and recognise UTF-8
especially with the BOM.

Unfortunately, because many unix utilities (grep etc) aren't encoding
aware, that'll cause problems when people go to search log files. For
(eg) "広告掲載" The log files will contain the utf-8 bytes:

\xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89

but grep on a shift-jis system will be looking for:

\x8d\x4c\x8d\x90\x8cf\x8d\xda

so it won't match.

Ugh. If only we could say "PostgreSQL requires a system locale with a
UTF-8 encoding". Alas, I don't think that'd go down very well with
packagers or installers. [Insert rant about how stupid it is that *nix
systems still aren't all UTF-8 here].

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/

From:	Dave Page <dpage(at)pgadmin(dot)org>
To:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 11:30:06
Message-ID:	AANLkTinX6kKqwXZ1G3JwLkZBZQriNYrKQ=f5T_FQ6vGA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On Wed, Sep 22, 2010 at 12:25 PM, Craig Ringer
<craig(at)postnewspapers(dot)com(dot)au> wrote:
> I don't think that's an OK answer, myself. Mixed encodings with no
> delineation in one file = bug as far as I'm concerned. You can't even rely
> on being able to search the log anymore. You'll only get away with it when
> using languages that mostly stick to the 7-bit ASCII subset, so most text is
> still readable; with most other languages you'll get logs full of what looks
> to the user like garbage.

This issue crops up periodically in the pgAdmin lists as well, as the
mixed encoding sometimes break the log viewer.

> I still wonder if, rather than making this configurable, the right choice is
> to force logging to UTF-8 (with BOM) across the board, right from postmaster
> startup. It's consistent, all messages in all other encodings can be
> converted to UTF-8 for logging, it's platform independent, and text editors
> etc tend to understand and recognise UTF-8 especially with the BOM.

That would be ideal for us.

> Unfortunately, because many unix utilities (grep etc) aren't encoding aware,
> that'll cause problems when people go to search log files. For (eg) "広告掲載"

But not for others!

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise Postgres Company

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 12:43:02
Message-ID:	1285159382.15691.44.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
> Yep, I tend to think that'd be the right way to go. It'd still be a bit
> of a pain, though, as messages written to stdout/stderr by the
> postmaster should be in the system encoding, but messages written to the
> log files should be in the encoding specified for logs, unless logging
> is being done to syslog, in which case it has to be in the system
> encoding after all...

I think that should not be a problem to implement. Those two go through
different routines anyway.

> And, of course, the postmaster still doesn't know how to log anything it
> might emit before reading postgresql.conf, because it doesn't know what
> encoding to use.

That should also not be a big issue. The postmaster needs the
configuration file to know where to write the log file anyway.

> I still wonder if, rather than making this configurable, the right
> choice is to force logging to UTF-8 (with BOM) across the board, right
> from postmaster startup. It's consistent, all messages in all other
> encodings can be converted to UTF-8 for logging, it's platform
> independent, and text editors etc tend to understand and recognise UTF-8
> especially with the BOM.

I don't think this would make things better or easier. At some point
you're going to have to insert a recode call, and it doesn't matter much
whether the destination argument is a constant or a variable.

From:	tkbysh2000(at)yahoo(dot)co(dot)jp
To:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 13:30:40
Message-ID:	20100922212552.93B2.A495B709@yahoo.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Hi Craig,

Almost Japanese software emit log files by encoding of the server the
software running on. I'm not sure it is the best way or not, but
Japanese users taking it for granted.
So I feel that Japanese users would hope that postgre server has same
style with other software, cause many administrators in Japan are
familiar and experienced for the way.

On Unix, user can specify default character encoding at installing.
Software can get it to refer the environment value $LANG e.g.
> % echo $LANG
> ja_JP.eucJP

On Japanese Windows, default encoding is MS-932(or cp-932 or Windows-31J).
This is fixed.
MS-932 is almost same as Shift-JIS but very few characters has different
character code between MS-932 and Shit-JIS. And Shift-JIS doesn't have
some characters in MS-932.
This is very important issue.
This issue has been making a lot of related bugs e.g. below:
http://bugs.mysql.com/bug.php?id=7607

And if postgre could be configured to emit a log file by row English
messages, some users will choice it if the translating messages to
Japanese has some costs. Some administrators in Japan don't hate reading
English messages. (Many software are not user friendly for not English
users. Many Japanese users are wondering and impressed with postgre
emits Japanese messages in log file.)

Thank you.

=Mikio

--
<tkbysh2000(at)yahoo(dot)co(dot)jp>

On Wed, 22 Sep 2010 19:25:47 +0800
Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> wrote:

> On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
> > On ons, 2010-09-22 at 16:25 +0800, Craig Ringer wrote:
> >> A single log file should obviously be in a single encoding, it's the
> >> only sane way to do things. But which encoding is it in? And which
> >> *should* it be in?
> >
> > We need to produce the log output in the server encoding, because that's
> > how we need to send it to the client.
>
> That doesn't mean it can't be recoded for writing to the log file,
> though. Perhaps it needs to be. It should be reasonably practical to
> detect when the database and log encoding are the same and avoid the
> transcoding performance penalty, not that it's big anyway.
>
> > If you have different databases
> > with different server encodings, you will get inconsistently encoded
> > output in the log file.
>
> I don't think that's an OK answer, myself. Mixed encodings with no
> delineation in one file = bug as far as I'm concerned. You can't even
> rely on being able to search the log anymore. You'll only get away with
> it when using languages that mostly stick to the 7-bit ASCII subset, so
> most text is still readable; with most other languages you'll get logs
> full of what looks to the user like garbage.
>
> > Conceivably, we could create a configuration option that specifies the
> > encoding for the log file, and strings a recoded from whatever gettext()
> > produces to the specified encoding. initdb could initialize that option
> > suitably, so in most cases users won't have to do anything.
>
> Yep, I tend to think that'd be the right way to go. It'd still be a bit
> of a pain, though, as messages written to stdout/stderr by the
> postmaster should be in the system encoding, but messages written to the
> log files should be in the encoding specified for logs, unless logging
> is being done to syslog, in which case it has to be in the system
> encoding after all...
>
> And, of course, the postmaster still doesn't know how to log anything it
> might emit before reading postgresql.conf, because it doesn't know what
> encoding to use.
>
> I still wonder if, rather than making this configurable, the right
> choice is to force logging to UTF-8 (with BOM) across the board, right
> from postmaster startup. It's consistent, all messages in all other
> encodings can be converted to UTF-8 for logging, it's platform
> independent, and text editors etc tend to understand and recognise UTF-8
> especially with the BOM.
>
> Unfortunately, because many unix utilities (grep etc) aren't encoding
> aware, that'll cause problems when people go to search log files. For
> (eg) "広告掲載" The log files will contain the utf-8 bytes:
>
> \xe5\xba\x83\xe5\x91\x8a\xe6\x8e\xb2\xe8\xbc\x89
>
> but grep on a shift-jis system will be looking for:
>
> \x8d\x4c\x8d\x90\x8cf\x8d\xda
>
> so it won't match.
>
>
> Ugh. If only we could say "PostgreSQL requires a system locale with a
> UTF-8 encoding". Alas, I don't think that'd go down very well with
> packagers or installers. [Insert rant about how stupid it is that *nix
> systems still aren't all UTF-8 here].
>
> --
> Craig Ringer
>
> Tech-related writing at http://soapyfrogs.blogspot.com/

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 13:41:42
Message-ID:	23765.1285162902@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Craig Ringer <craig(at)postnewspapers(dot)com(dot)au> writes:
> On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
>> We need to produce the log output in the server encoding, because that's
>> how we need to send it to the client.

> That doesn't mean it can't be recoded for writing to the log file,
> though. Perhaps it needs to be. It should be reasonably practical to
> detect when the database and log encoding are the same and avoid the
> transcoding performance penalty, not that it's big anyway.

We have seen ... and rejected ... such proposals before. The problem is
that "transcode to some other encoding" is not a simple and guaranteed
error-free operation. As an example, if you choose to name some table
using a character that doesn't exist in the log encoding, you have just
ensured that no message about that table will ever get to the log.
Nice way to hide your activities from the DBA ;-) Transcoding also
eats memory, which might be in exceedingly short supply while trying
to report an "out of memory" error; and IIRC there are some other
failure scenarios to be concerned about.

We could maybe accept a design for this that included a sufficiently
well-thought-out set of fallback behaviors. But we haven't seen one
yet.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-22 13:55:54
Message-ID:	24027.1285163754@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
>> I still wonder if, rather than making this configurable, the right
>> choice is to force logging to UTF-8 (with BOM) across the board,

> I don't think this would make things better or easier. At some point
> you're going to have to insert a recode call, and it doesn't matter much
> whether the destination argument is a constant or a variable.

It'd avoid the problem of having possibly-unconvertable messages ...
at the cost of pissing off users who have a uniform server encoding
selection already and don't see why they should be forced to deal with
UTF8 in the log.

It's pretty much just one step from here to deciding that the server
should work exclusively in UTF8 and never mind all those other legacy
encodings. We've resisted that attitude for quite some years now,
and are probably not really ready to adopt it for the log either.

regards, tom lane

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-25 03:33:03
Message-ID:	4C9D6D6F.4050806@postnewspapers.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On 09/22/2010 09:55 PM, Tom Lane wrote:
> Peter Eisentraut<peter_e(at)gmx(dot)net> writes:
>> On ons, 2010-09-22 at 19:25 +0800, Craig Ringer wrote:
>>> I still wonder if, rather than making this configurable, the right
>>> choice is to force logging to UTF-8 (with BOM) across the board,
>
>> I don't think this would make things better or easier. At some point
>> you're going to have to insert a recode call, and it doesn't matter much
>> whether the destination argument is a constant or a variable.
>
> It'd avoid the problem of having possibly-unconvertable messages ...
> at the cost of pissing off users who have a uniform server encoding
> selection already and don't see why they should be forced to deal with
> UTF8 in the log.
>
> It's pretty much just one step from here to deciding that the server
> should work exclusively in UTF8 and never mind all those other legacy
> encodings. We've resisted that attitude for quite some years now,
> and are probably not really ready to adopt it for the log either.

Fair enough. The current approach is broken, though. Mis-encoded
messages the user can't read are little more good to them than messages
that're never logged.

I see four options here (two of which are practical IMO):

(1) Log in UTF-8, convert everything to UTF-8. Better for admin tools &
apps, sucks for OS utilities/grep/etc on non-utf-8 locales. Preserves
all messages no matter what the database and system encodings are.

(2) Log in default encoding for locale, convert all messages to that
encoding. Where characters cannot be represented in the target encoding
replace them with a placeholder (? or something). Better - but far from
good - for OS utilities/grep/etc, sucks for admin tools and apps.
Doesn't preserve all messages properly if user has databases in
encodings other than the system encoding.

(3) Have a log for the postmaster in the default locale for the system.
Have a log file for each database that's in the encoding for that
database. IMO this is the worst of both worlds, but it does preserve
original encodings without transcoding or forcing a particular encoding
and does preserve messages. Horribly complicated for admin tools,
inconsistent and horrid for grep etc.

(4) Keep things much as they are, but log an encoding identifier prefix
for each line. Lets GUI/admin tools post-process the logs into something
sane, permits automated log processing because line encodings are known.
Sucks for shell tools, which can't tell which lines are which; we'd need
to provide a "pggrep" and "pgless" for reliable log search! Preserves
all messages, but not in a reliably searchable manner.

(0) Change nothing. Log all messages in the original encoding they were
generated in. Perform no conversion. Logs contain mixed encodings.
Horrible for admin/gui tools (broken text). Horrible for shell
utilities/OS tools (can't trust grep results etc). Automatic log
processing impossible as the encoding for each line isn't known and
can't be reliably discovered.

As far as I'm concerned, (3) is out. It's horrible. I don't think the
status quo (0) is OK either, it's producing broken log files. (4) is
pretty awful too, but it's the smallest change that kind-of fixes the
issue to the point where it's at least possible for PgAdmin etc to
convert the logs into a consistent encoding.

IMO it's down to (1) and (2). There's no clear consensus between those
two, so I'd be inclined to offer the admin the choice between them as a
config option, depending on the trade-off they prefer to make.

For sensible systems in a utf-8 locale (1) and (2) are equivalent, and
(2) is fine for systems where the database encoding is always the same
as the system encoding. It's only for systems with a non-utf-8 locale
that use databases in encodings other than the system locale's encoding
that problems arise. In this case they're going to get suboptimal
results one way or the other, it's just a matter of letting them pick how.

Thoughts?

I should ask on the various language-specific mailing lists and see what
people there have to say about it. Maybe it doesn't affect people enough
in practice for them to care.

--
Craig Ringer

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Peter Eisentraut <peter_e(at)gmx(dot)net>, tkbysh2000(at)yahoo(dot)co(dot)jp, pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: BUG #5661: The character encoding in logfile is confusing.
Date:	2010-09-25 06:48:34
Message-ID:	4C9D9B42.5060007@postnewspapers.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-bugs pgsql-hackers

On 22/09/2010 9:41 PM, Tom Lane wrote:
> Craig Ringer<craig(at)postnewspapers(dot)com(dot)au> writes:
>> On 22/09/2010 5:45 PM, Peter Eisentraut wrote:
>>> We need to produce the log output in the server encoding, because that's
>>> how we need to send it to the client.
>
>> That doesn't mean it can't be recoded for writing to the log file,
>> though. Perhaps it needs to be. It should be reasonably practical to
>> detect when the database and log encoding are the same and avoid the
>> transcoding performance penalty, not that it's big anyway.
>
> We have seen ... and rejected ... such proposals before. The problem is
> that "transcode to some other encoding" is not a simple and guaranteed
> error-free operation. As an example, if you choose to name some table
> using a character that doesn't exist in the log encoding, you have just
> ensured that no message about that table will ever get to the log.

Well, an arguably reasonable if still suboptimal approach is to mask out
characters without any representation in the target encoding, replacing
them with a substitute ("?" or whatever). The rest of the log message is
still emitted that way.

Currently, Pg may as well be emitting "!(at)#!#!#!@#$!(at)#$" for these log
records. It's garbage unless the user's editor/log viewer/whatever
happens to use the encoding of that set of messages, turning all the
others into garbage instead. To interpret them, I had to

It's not a big deal with languages that mostly use the 7-bit ascii space
most encodings share, but for russian, chinese, japanese, thai, the
various indian languages, etc etc etc it's pretty awful, as seen in
Mikio's example log files.

> Nice way to hide your activities from the DBA ;-)

Emitting messages in the wrong encoding doesn't do the DBA any favours
either. Automated log analysis and reporting will have a hard time
dealing with the logs, and the DBA will have to keep on switching
encodings in their editor/viewer to interpret or search the logs.
Assuming they know how, and know they need to.

> Transcoding also
> eats memory, which might be in exceedingly short supply while trying
> to report an "out of memory" error; and IIRC there are some other
> failure scenarios to be concerned about.

Yep, that's certainly a problem. Pre-transcoding them on backend start
isn't particularly desirable (wasted startup time, memory) and neither
is pre-allocating extra memory for use on fatal exit paths.

OTOH, don't the current message translations also cost at least some
memory, too?

I don't have a good answer for this issue. Only rather less-than-good
ideas like: mmap() a file the postmaster generates that contains various
fatal messages, already in the right encodings/translations, with an
offset table at the front? Icky, but effective and doesn't waste
precious shared memory or produce new unsharable allocations in the
backends that'll only ever get used when something breaks.

--
Craig Ringer

Tech-related writing at http://soapyfrogs.blogspot.com/