Quick Links

Re: pg_read_file() and non-ascii input file

Lists:	pgsql-hackers

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	pg_read_file() and non-ascii input file
Date:	2009-10-28 02:27:35
Message-ID:	20091028112735.7F97.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

pg_read_file() takes byte-offset and length as arguments,
but we don't check the result text with pg_verify_mbstr().
Should pg_read_file() return bytea instead of text or adding
some codes to verify the input? Only superusers are allowed
to use the function, but it is still dangerous.

If we leave the result in text type and add verifier, we also need to
consider how to handle multi-byte text. Offset and length should not
split one multi-byte character. We can assume the offset as a correct
boundary if we can trust users, but no one knows correct length before
the function call.

An idea is to have binary and text versions of pg_read_file:
* pg_read_binary_file(filename, offset, length) : bytea
* pg_read_text_file(filename, offset) : ROW( text, nextline_offset )
-- it returns the next line starting with 'offset'.
but such changes could bring on compatibility problems.

Comments, better ideas?

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

From:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pg_read_file() and non-ascii input file
Date:	2009-11-30 09:36:05
Message-ID:	20091130183605.64E5.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote:

> pg_read_file() takes byte-offset and length as arguments,
> but we don't check the result text with pg_verify_mbstr().
> Should pg_read_file() return bytea instead of text or adding
> some codes to verify the input? Only superusers are allowed
> to use the function, but it is still dangerous.

Here is a patch to modify the result type of pg_read_file to bytea.
I think it is a possibly-security hole -- it might be safe because only
supersusers can use the function, but it is still dangerous.
We can still use the function to read a text file:
SELECT convert_from(pg_read_file(...), 'encoding')

If we want to keep backward compatibility, the issue can be fixed
by adding pg_verifymbstr() to the function. We can also have the
binary version in another name, like pg_read_binary_file().

Which solution is better? Comments welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment	Content-Type	Size
pg_read_file_as_bytea.patch	application/octet-stream	1.3 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Itagaki Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pg_read_file() and non-ascii input file
Date:	2009-12-30 21:13:21
Message-ID:	603c8f070912301313u5b413581qa1ff6da0a7f27cf7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 30, 2009 at 4:36 AM, Itagaki Takahiro
<itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> If we want to keep backward compatibility, the issue can be fixed
> by adding pg_verifymbstr() to the function. We can also have the
> binary version in another name, like pg_read_binary_file().

I don't feel good about changing the return type of an existing
function, so I guess +1 from me on the approach quoted above.

...Robert

From:	Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pg_read_file() and non-ascii input file
Date:	2010-01-04 02:10:07
Message-ID:	20100104111007.98BE.52131E4D@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> > If we want to keep backward compatibility, the issue can be fixed
> > by adding pg_verifymbstr() to the function.
>
> I don't feel good about changing the return type of an existing
> function, so I guess +1 from me on the approach quoted above.

Ok, I just added pg_verifymbstr() instead of changing the result type.

I didn't add any additinal file reading functions in the patch, but
I'm willing to add them if someone want them:
- pg_read_file_with_encoding()
- pg_read_binary_file() RETURNS bytea
- pg_read_text_file() RETURNS SETOF text -- returns set of lines

One thing bothering me is the HINT message on error is just pointless;
The encoding is controlled by "server_encoding" here. We will have the
same error message in server-side COPY commands. We'd better improving
the message, though it should be done by another patch.

=# SELECT pg_read_file('invalid.txt', 0, (pg_stat_file('invalid.txt')).size);
ERROR: invalid byte sequence for encoding "UTF8": 0x93
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".

=# COPY tbl FROM 'invalid.txt'; -- server-side copy from a local file.
(the same message -- but the encoding should match with "server_encoding")

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

Attachment	Content-Type	Size
pg_read_file_20100104.patch	application/octet-stream	785 bytes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: pg_read_file() and non-ascii input file
Date:	2010-01-04 03:36:03
Message-ID:	603c8f071001031936t731c240dlbdcabe32a75aa72c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Jan 3, 2010 at 9:10 PM, Takahiro Itagaki
<itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
>
> Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>> > If we want to keep backward compatibility, the issue can be fixed
>> > by adding pg_verifymbstr() to the function.
>>
>> I don't feel good about changing the return type of an existing
>> function, so I guess +1 from me on the approach quoted above.
>
> Ok, I just added pg_verifymbstr() instead of changing the result type.

Sounds fine.

> I didn't add any additinal file reading functions in the patch, but
> I'm willing to add them if someone want them:
> - pg_read_file_with_encoding()
> - pg_read_binary_file() RETURNS bytea
> - pg_read_text_file() RETURNS SETOF text -- returns set of lines

OK.

> One thing bothering me is the HINT message on error is just pointless;
> The encoding is controlled by "server_encoding" here. We will have the
> same error message in server-side COPY commands. We'd better improving
> the message, though it should be done by another patch.

Interestingly, this same issue is being discussed on the thread
entitled "invalid UTF-8 via pl/perl". I suppose whatever solution is
adopted there can be used here as well. FWIW, repurposing the third
argument of pg_verifymbstr seems like the most sensible option to me.

...Robert