Re: Solution of the file name problem of copy on windows.

Lists: pgsql-hackers
From: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Solution of the file name problem of copy on windows.
Date: 2009-04-07 16:40:49
Message-ID: C96B1306C7B243E9BFDFE88F8DB405C6@HIRO57887DE653
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Tom-san.

I want to solve one problem before the release of 8.4.
However, since it also seems to be the new feature,
if not enough for 8.4, you may suggest that it is 8.5.

In Japan, the local file name of a server is dealt with by SJIS.
The example present Postgres...

server_encoding = UTF-8
client_encoding = SJIS

At this time, a copy file name is UTF-8. It was troubled by handling.:-(
Then, I make this proposal patch.

regression test

=======================
All 120 tests passed.
=======================

as for database is UTF-8.

HIROSHI=# \l
データベース一覧
名前 | 所有者 | エンコーディング | Collation | Ctype | アクセス権

-----------+---------+------------------+-----------+-------+-------------------
--
HIROSHI | HIROSHI | UTF8 | C | C |
eucdb | HIROSHI | EUC_JP | C | C |

HIROSHI=# create table 日本語てすと (きー text);
CREATE TABLE
HIROSHI=# insert into 日本語てすと values('わーい');
INSERT 0 1
HIROSHI=# copy 日本語てすと to 'C:/tmp/日本語UTF8.txt';
COPY 1
HIROSHI=# delete from 日本語てすと;
DELETE 1
HIROSHI=# copy 日本語てすと from 'C:/tmp/日本語UTF8.txt';
COPY 1
HIROSHI=# select * from 日本語てすと;
きー
--------
わーい
(1 行)

as for database is eucjp.

HIROSHI=# \c eucdb
psql (8.4devel)
データベース "eucdb" に接続しました。.
eucdb=# \d
リレーションの一覧
スキーマ | 名前 | 型 | 所有者
----------+--------------+-------+---------
public | 日本語てすと | table | HIROSHI
(1 行)

eucdb=# select * from 日本語てすと;
きー
--------
わーい
(1 行)

eucdb=# copy 日本語てすと to 'C:/tmp/日本語eucdb.txt';
COPY 1
eucdb=# delete from 日本語てすと;
DELETE 1
eucdb=# copy 日本語てすと from 'C:/tmp/日本語eucdb.txt';
COPY 1
eucdb=# select * from 日本語てすと;
きー
--------
わーい
(1 行)

C:\tmp>dir 日本語*
ドライブ C のボリューム ラベルは SYS です
ボリューム シリアル番号は 1433-2C7C です

C:\tmp のディレクトリ

2009/04/07 13:58 8 日本語eucdb.txt
2009/04/07 13:58 8 日本語utf8.txt
2 個のファイル 16 バイト

It seems that it is very comfortable. !!
What do you think?

Regards,
Hiroshi Saito

Attachment Content-Type Size
copy_patch3 application/octet-stream 3.1 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-07 17:46:16
Message-ID: 18205.1239126376@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp> writes:
> I want to solve one problem before the release of 8.4.
> However, since it also seems to be the new feature,
> if not enough for 8.4, you may suggest that it is 8.5.

I'm not too clear on what this is really supposed to accomplish, but
we are hardly going to put code like that into every single file access
in Postgres, which is what seems to be the logical implication.
Shouldn't we just tell people to use a database encoding that matches
their system environment?

regards, tom lane


From: Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-08 02:19:33
Message-ID: 20090408105426.8FBC.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

"Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp> wrote:

> At this time, a copy file name is UTF-8. It was troubled by handling.:-(
> Then, I make this proposal patch.

I think the problem is not only in Windows but also in all platforms
where the database encoding doesn't match their OS's encoding.

Instead of Windows specific codes, how about adding GetPlatformEncoding()
and convert all of *absolute* paths? It would be performed at the lowest
API layer; i.e, BasicOpenFile(). Standard database file accesses with
RelFileNode are not affected because is uses *relative* paths.

There are some issues:
* Is it possible to determine the platform encoding?
* The above cannot handle non-ascii path under $PGDATA.
Is it acceptable?
* In Windows, the native encoding is UTF-16, but we will use SJIS
if we take on the above method. Is the limitation acceptable?

Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Hiroshi Inoue <inoue(at)tpf(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hiroshi Saito <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-08 10:31:08
Message-ID: 49DC7CEC.80703@tpf.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp> writes:
>> I want to solve one problem before the release of 8.4.
>> However, since it also seems to be the new feature,
>> if not enough for 8.4, you may suggest that it is 8.5.
>
> I'm not too clear on what this is really supposed to accomplish, but
> we are hardly going to put code like that into every single file access
> in Postgres, which is what seems to be the logical implication.
> Shouldn't we just tell people to use a database encoding that matches
> their system environment?

Unfortunately (as usual) under Japanese Windows there's no database
encoding that matches the system environment.
As for the file name in COPY command, there's little meaning to
convert it to the server encoding because the file name is irrelevant
to the database. Because Windows is Unicode(UTF-16) based, it seems
natural to convert the file name to wide characters once.

regards,
Hiroshi Inoue


From: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>
To: "Hiroshi Inoue" <inoue(at)tpf(dot)co(dot)jp>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-08 14:00:28
Message-ID: F8979E9BE3F34CF0B3390EDD1821AE99@HIRO57887DE653
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi.

----- Original Message -----
From: "Hiroshi Inoue" <inoue(at)tpf(dot)co(dot)jp>

> Tom Lane wrote:
>> "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp> writes:
>>> I want to solve one problem before the release of 8.4.
>>> However, since it also seems to be the new feature,
>>> if not enough for 8.4, you may suggest that it is 8.5.
>>
>> I'm not too clear on what this is really supposed to accomplish, but
>> we are hardly going to put code like that into every single file access
>> in Postgres, which is what seems to be the logical implication.
>> Shouldn't we just tell people to use a database encoding that matches
>> their system environment?
>
> Unfortunately (as usual) under Japanese Windows there's no database
> encoding that matches the system environment.
> As for the file name in COPY command, there's little meaning to
> convert it to the server encoding because the file name is irrelevant
> to the database. Because Windows is Unicode(UTF-16) based, it seems
> natural to convert the file name to wide characters once.

Yes, If server encoding can be chosen by windows, the facilities
in good working order. It was not possible though it was regrettable.

Regards,
Hiroshi Saito


From: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>
To: "Itagaki Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-08 16:01:59
Message-ID: 7793F2D8201847ADAFAED8615C69A9F3@HIRO57887DE653
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Itagaki-san.

Um, I had a focus in help the problem which is not avoided.
I am not sensitive to a problem being avoided depending on usage.
However, I will wish to work spontaneously, when it is help much.

Regards,
Hiroshi Saito

----- Original Message -----
From: "Itagaki Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>

> Hi,
>
> "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp> wrote:
>
>> At this time, a copy file name is UTF-8. It was troubled by handling.:-(
>> Then, I make this proposal patch.
>
> I think the problem is not only in Windows but also in all platforms
> where the database encoding doesn't match their OS's encoding.
>
> Instead of Windows specific codes, how about adding GetPlatformEncoding()
> and convert all of *absolute* paths? It would be performed at the lowest
> API layer; i.e, BasicOpenFile(). Standard database file accesses with
> RelFileNode are not affected because is uses *relative* paths.
>
> There are some issues:
> * Is it possible to determine the platform encoding?
> * The above cannot handle non-ascii path under $PGDATA.
> Is it acceptable?
> * In Windows, the native encoding is UTF-16, but we will use SJIS
> if we take on the above method. Is the limitation acceptable?
>
> Comments welcome.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


From: Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-09 00:22:58
Message-ID: 20090409091252.B123.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp> wrote:

> Um, I had a focus in help the problem which is not avoided.
> I am not sensitive to a problem being avoided depending on usage.
> However, I will wish to work spontaneously, when it is help much.

I'll research whether encoding of filesystem path is affected by
locale settings or not in some platforms. Also, we need to research
where we should get the system encoding when the locale is set to "C",
which is popular in Japanese users.

I'll report to you the progress :)

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-13 10:13:15
Message-ID: 20090413184335.39BE.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote:

> "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp> wrote:
>
> > Um, I had a focus in help the problem which is not avoided.
> > I am not sensitive to a problem being avoided depending on usage.
> > However, I will wish to work spontaneously, when it is help much.
>
> I'll research whether encoding of filesystem path is affected by
> locale settings or not in some platforms. Also, we need to research
> where we should get the system encoding when the locale is set to "C",
> which is popular in Japanese users.

Here is a patch to implement GetPlatformEncoding() and convert absolute
file paths from database encoding to platform encoding. Since encoding
of paths are converted at AllocateFile() and BasicOpenFile(), not only
COPY TO/FROM but also almost of file operations are covered by the patch.
Callers of file access methods don't have to modify their codes.

Please test the patch in a variety of platforms. I tested it on Windows
and Linux, and then I found {PG_UTF8, "ANSI_X3.4-1968"} is required for
encoding_match_list in src/port/chklocale.c on Linux (FC6).

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment Content-Type Size
GetPlatformEncoding.patch application/octet-stream 6.3 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-13 14:14:37
Message-ID: 734.1239632077@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Here is a patch to implement GetPlatformEncoding() and convert absolute
> file paths from database encoding to platform encoding.

This seems like a fairly significant overhead added to solve a really
minor problem (if it's not minor why has it never come up before?).

I'm also not convinced by any of the details --- why are GetACP and
pg_get_encoding_from_locale the things to look at, and why is fd.c an
appropriate place to hook in? Surely if we need it here, we need it in
places like initdb as well. But really this is much too low a level to
be solving the problem at. If we have to convert path encodings in the
backend, we should be doing it once somewhere around the place where we
identify the value of PGDATA. It should not be necessary to repeat all
this for every file access within the database directory.

regards, tom lane


From: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>
To: "Itagaki Takahiro" <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-13 15:12:36
Message-ID: 023558DBC4B14E82A186799942144A65@HIRO57887DE653
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi.

Anyhow, I appreciate discussion.

----- Original Message -----
From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>

> Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
>> Here is a patch to implement GetPlatformEncoding() and convert absolute
>> file paths from database encoding to platform encoding.
>
> This seems like a fairly significant overhead added to solve a really
> minor problem (if it's not minor why has it never come up before?).
>
> I'm also not convinced by any of the details --- why are GetACP and
> pg_get_encoding_from_locale the things to look at, and why is fd.c an
> appropriate place to hook in? Surely if we need it here, we need it in
> places like initdb as well. But really this is much too low a level to
> be solving the problem at. If we have to convert path encodings in the
> backend, we should be doing it once somewhere around the place where we
> identify the value of PGDATA. It should not be necessary to repeat all
> this for every file access within the database directory.

Ahh, I think this is a sensitive problem and requires careful handling too.
However, following tests are shown in order to help your understanding.
This is the case which can't be operated if no apply the patch of Itagaki-san.

C:\work>set PGDATA=C:\tmp\日本語 data

C:\work>set PGPORT=5444

C:\work>set PGHOME=C:\MinGW\local\pgsql

C:\work>cmd.exe
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\work>initdb -E UTF-8 --no-locale
データベースシステム内のファイルの所有者は"HIROSHI"ユーザでした。
このユーザがサーバプロセスを所有しなければなりません。

データベースクラスタはロケールCで初期化されます。
デフォルトのテキスト検索設定はenglishに設定されました。

ディレクトリC:/tmp/日本語 dataの権限を設定しています ... ok
サブディレクトリを作成しています ... ok
デフォルトのmax_connectionsを選択しています ... 100
デフォルトの shared_buffers を選択しています ... 32MB
設定ファイルを作成しています ... ok
C:/tmp/日本語 data/base/1にtemplate1データベースを作成しています ... ok
pg_authidを初期化しています ... ok
依存関係を初期化しています ... ok
システムビューを作成しています ... ok
システムオブジェクトの定義をロードしています ... ok
変換を作成しています ... ok
ディレクトリを作成しています ... ok
組み込みオブジェクトに権限を設定しています ... ok
情報スキーマを作成しています ... ok
template1データベースをバキュームしています ... ok
template1からtemplate0へコピーしています ... ok
template1からpostgresへコピーしています ... ok

警告: ローカル接続向けに"trust"認証が有効です。
pg_hba.confを編集する、もしくは、次回initdbを実行する時に-Aオプショ
ンを使用することで変更することができます。

成功しました。以下を使用してデータベースサーバを起動することができます。

"postmaster" -D "C:/tmp/日本語 data"
または
"pg_ctl" -D "C:/tmp/日本語 data" -l logfile start

C:\work>set PGCLIENTENCODING=SJIS

C:\work>psql postgres
psql (8.4beta1)
"help" でヘルプを表示します.

postgres=# create table 日本語(きー text);
CREATE TABLE
postgres=# insert into 日本語 values('いれた');
INSERT 0 1
postgres=# copy 日本語 to 'C:/tmp/日本語 data/日本語utf8.txt';
COPY 1
postgres=# delete from 日本語;
DELETE 1
postgres=# copy 日本語 from 'C:/tmp/日本語 data/日本語utf8.txt';
COPY 1
postgres=# select * from 日本語;
きー
--------
いれた
(1 行)

C:\work>dir "C:\tmp\日本語 data"
ドライブ C のボリューム ラベルは SYS です
ボリューム シリアル番号は 1433-2C7C です

C:\tmp\日本語 data のディレクトリ

2009/04/13 23:22 <DIR> .
2009/04/13 23:22 <DIR> ..
2009/04/13 23:18 <DIR> base
2009/04/13 23:19 <DIR> global
2009/04/13 23:17 <DIR> pg_clog
2009/04/13 23:17 3,616 pg_hba.conf
2009/04/13 23:17 1,611 pg_ident.conf
2009/04/13 23:17 <DIR> pg_multixact
2009/04/13 23:23 <DIR> pg_stat_tmp
2009/04/13 23:17 <DIR> pg_subtrans
2009/04/13 23:17 <DIR> pg_tblspc
2009/04/13 23:17 <DIR> pg_twophase
2009/04/13 23:17 4 PG_VERSION
2009/04/13 23:17 <DIR> pg_xlog
2009/04/13 23:17 17,112 postgresql.conf
2009/04/13 23:19 38 postmaster.opts
2009/04/13 23:19 24 postmaster.pid
2009/04/13 23:22 8 日本語utf8.txt
7 個のファイル 22,413 バイト
11 個のディレクトリ 42,780,246,016 バイトの空き領域


From: Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-14 00:51:54
Message-ID: 20090414093452.91E3.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> > Here is a patch to implement GetPlatformEncoding() and convert absolute
> > file paths from database encoding to platform encoding.
>
> This seems like a fairly significant overhead added to solve a really
> minor problem (if it's not minor why has it never come up before?).

It's not always a minor problem in Japan. It has been discussed in
users group in Japan several times. However, surely I should pay attention
to the performance. One of the solutions might be to cache the encoding
in GetPlatformEncoding(). There will be no overheads when database
encoding and platform encoding are same, that would be a typical use.

> It should not be necessary to repeat all
> this for every file access within the database directory.

That's why I added checking with is_absolute_path() there. We can
avoid conversion in normal file access under PGDATA because relative
paths are used for it. But I should have checked all of file access
not only in backends but also in client programs. I'll research them...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


From: Sergey Burladyan <eshkinkot(at)gmail(dot)com>
To: Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc: "Hiroshi Saito" <z-saito(at)guitar(dot)ocn(dot)ne(dot)jp>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Solution of the file name problem of copy on windows.
Date: 2009-04-14 20:41:52
Message-ID: 873acb7ysv.fsf@seb.progtech.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> writes:

> There are some issues:
> * Is it possible to determine the platform encoding?

There is no platform encoding in linux. File name encoding depend on user
locale, so different users can have different encoding of file name.

--
Sergey Burladyan