Re: libpq should have functions for escaping data for use in COPY FROM

Lists: pgsql-hackers
From: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: libpq should have functions for escaping data for use in COPY FROM
Date: 2012-03-10 02:16:21
Message-ID: CAARyMpA7Gq3joL=1AqRKBS0FOGAOHb6Fh-wU=hbLghHcaFjOXw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

libpq has functions for escaping values in SQL commands
(PQescapeStringConn, PQescapeByteaConn, and the new PQescapeLiteral),
and it supports parameterizing queries with PQexecParams. But it does
not (to my knowledge) have functions for escaping values for COPY
FROM.

COPY FROM is useful for inserting rows in bulk (though I wonder if
constructing massive INSERT statements and using PQexecParams is just
as efficient). It is also useful for generating .sql files which can
be run on a database elsewhere.

I think libpq should include functions for escaping with COPY FROM.
Perhaps the API could look like this:

typedef struct EscapeCopyInfo
{
/*
* If this is not NULL, the PQescapeCopy functions will take
advantage of
* connection-specific information to minimize the escaped
representation
* length.
*/
PGConn *conn;

char delimiter;

const char *null_string;
} EscapeCopyInfo;

void initEscapeCopyInfo(EscapeCopyInfo *);

/* If info is NULL, it defaults to {NULL, '\t', "\\N"} */
size_t PQescapeStringCopy(
EscapeCopyInfo *info,
char *to,
const char *from,
size_t from_length,
int *error
);

size_t PQescapeByteaCopy(
EscapeCopyInfo *info,
unsigned char *to,
const unsigned char *from,
size_t from_length,
int *error
);

Using an EscapeCopyInfo structure to specify format information cuts
down on parameter counts for the escaping functions, and makes it
possible to support more options in the future (e.g. CSV).

I'm not terribly attached to the parameter order of the functions. I
was just following the lead of PQescapeStringConn.

This API writes text into a buffer, rather than allocating a buffer
with malloc like PQescapeByteaConn and PQescapeLiteral do. This means
rows containing multiple values can be constructed efficiently. On
the other hand, it is inconvenient and error-prone, since the
programmer has to calculate how big the buffer needs to be (and this
varies depending on settings). Imagine this bug:

* A programmer allocates a buffer of size 2*length + 2 for a BYTEA,
assuming the bytestring will be escaped using the \x... format

* A user runs the program with PostgreSQL 8.4, which predates the
\x... format, and now the bytestring's escaped representation can be
up to 5*length bytes long (but very frequently isn't!)

I wrote a basic implementation for a Haskell binding, but it only
supports default COPY options (text mode, with tab delimiter and \N
null string).

https://github.com/joeyadams/haskell-libpq/blob/copy-from/cbits/escape-copy.c

Before spending a bunch of time on this, I'd like some input. A few questions:

* Should we have corresponding functions for parsing COPY TO data, or
is PQexecParams sufficient?

* Should we support CSV escaping? Can the CSV format safely encode
all characters (in particular, newlines)?

* Should we deal with encodings here, or just escape everything that
isn't printable ASCII like the code I wrote does?

Thanks,
-Joey


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: libpq should have functions for escaping data for use in COPY FROM
Date: 2012-03-15 04:07:56
Message-ID: CA+TgmoYQ8tksGPfyxx=dyrTO72xKLsgvuGodc99sxCLBXV0Gng@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Mar 9, 2012 at 9:16 PM, Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com> wrote:
> libpq has functions for escaping values in SQL commands
> (PQescapeStringConn, PQescapeByteaConn, and the new PQescapeLiteral),
> and it supports parameterizing queries with PQexecParams.  But it does
> not (to my knowledge) have functions for escaping values for COPY
> FROM.
>
> COPY FROM is useful for inserting rows in bulk (though I wonder if
> constructing massive INSERT statements and using PQexecParams is just
> as efficient).  It is also useful for generating .sql files which can
> be run on a database elsewhere.
>
> I think libpq should include functions for escaping with COPY FROM.

I'm a little bit confused about what you're getting at here, because
COPY has a huge pile of options - not just CSV or text, but also
things like QUOTE and DELIMITER. It's not like there is ONE way to
escape things for COPY. I guess we could include code that escapes
things in the manner that an optionless COPY expects, or we could
include in the API all the same options that COPY supports, but the
former sounds narrow and the latter complex.

> Before spending a bunch of time on this, I'd like some input.  A few questions:
>
>  * Should we have corresponding functions for parsing COPY TO data, or
> is PQexecParams sufficient?
>
>  * Should we support CSV escaping?  Can the CSV format safely encode
> all characters (in particular, newlines)?

The fine manual page for COPY discusses how to encode CSV data in
considerable detail.

>  * Should we deal with encodings here, or just escape everything that
> isn't printable ASCII like the code I wrote does?

I think your code will fall over badly if fed, say, UTF-8 characters
with code points greater than 0x7F.

I doubt very much that we would accept anything into libpq that
doesn't handle all the encodings we support, and that covers a lot of
territory. There are some restrictions on the set of server-side
encodings - we only allow those that have certain "safe" properties -
but IIUC client encodings are much less restricted and a lot of wacky
stuff is possible. Even if you can come up with code that handles all
cases correctly, it'll probably perform much less well in simple cases
than the quick hack you linked to here.

Considering all the above, this seems like it might be a solution in
search of a problem. It's not actually that hard to write code to do
proper escaping for a *given* encoding and a *given* set of COPY
options, but trying to write something general sounds like a job and a
half.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joey Adams <joeyadams3(dot)14159(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: libpq should have functions for escaping data for use in COPY FROM
Date: 2012-03-15 14:17:49
Message-ID: 19641.1331821069@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Considering all the above, this seems like it might be a solution in
> search of a problem. It's not actually that hard to write code to do
> proper escaping for a *given* encoding and a *given* set of COPY
> options, but trying to write something general sounds like a job and a
> half.

Yeah, it wouldn't be easy. The other point to keep in mind here is that
one of the key requirements for anything dealing with COPY is that it be
fast. It seems likely to me that bespoke code for a particular encoding
and set of copy options would outrun something that's general-purpose.
Now maybe people would be happy to use something slower if it meant
they didn't have to write it themselves, but ...

regards, tom lane