Re: copy_from does not stop reading after an error

Lists: psycopg
From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: psycopg(at)postgresql(dot)org
Subject: copy_from does not stop reading after an error
Date: 2011-02-01 10:24:33
Message-ID: AANLkTikpbc25nTVWfZ_f8L0zZuAj0Bv0UwSLtztZc_23@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

Hello,

It looks like copy_from does not stop reading after an error. When the input
file is short, it is okay. But when the input file is very long, it is
really boring to wait for the entire file to be read just to discover there
is an error on the 10th row.

Given the same input file, it looks like psql \copy command behaves
correctly and stops just after the incorrect row, without reading the entire
file. I have checked that just by looking at the command execution time that
seems proportional to the number of processed rows...

Here is a script to reproduce this bug (just create a database "test" and
run the script):
https://gist.github.com/805669

I don't know what to do with this bug. Thanks for your help and advice.

Regards,

Nicolas Grilly

Here is the same script inline:

import traceback
import psycopg2

def main():
con = psycopg2.connect(database='test', user='postgres')
cur = con.cursor()

try:
cur.execute("create temp table test (id integer primary key, data
text)")

rows = iter_rows()
copy_file = CopyFile(rows)
cur.copy_from(copy_file, 'test')

cur.execute("select count(*) from test")
print "{0} rows written to database".format(cur.fetchall()[0][0])

finally:
cur.close()
con.close()

def iter_rows():
random_data = 'x' * 100
for i in range(0, 500):
if i == 10:
i = 'Bad key'
print "Yield incorrect data row (copy_from should stop reading
after that)"
else:
print "Yield correct data row"
yield '{0}\t{1}\n'.format(i, random_data)

class CopyFile(object):

def __init__(self, row_reader):
self.next_row = row_reader.next
self.buffer = ''

def read(self, limit=-1):
print "Read {0} bytes".format(limit)

try:
buffer = self.buffer

while limit < 0 or len(buffer) < limit:
try:
buffer += self.next_row()
except StopIteration:
break

if limit < 0:
self.buffer = ''
return buffer
else:
self.buffer = buffer[limit:]
return buffer[:limit]

except:
# Report unexpected errors because psycopg2 does not report them
traceback.print_exc()
raise

# Method readline is required by psycopg2 but actually never called
def readline(self, limit=-1):
raise NotImplementedError()

if __name__ == '__main__':
main()


From: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
To: psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-01 10:48:07
Message-ID: 4D47E4E7.7090307@dndg.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On 01/02/11 11:24, Nicolas Grilly wrote:
> It looks like copy_from does not stop reading after an error. When the
> input file is short, it is okay. But when the input file is very long,
> it is really boring to wait for the entire file to be read just to
> discover there is an error on the 10th row.

Thank you very much. I just used your email to create a bug report on
Lighthouse:

http://psycopg.lighthouseapp.com/projects/62710-psycopg/tickets/37-copy_from-does-not-stop-reading-after-an-error

federico

--
Federico Di Gregorio federico(dot)digregorio(at)dndg(dot)it
Studio Associato Di Nunzio e Di Gregorio http://dndg.it
Gli avvoltoi cinesi si nutrono di arte, ma possono anche mangiare
i `domani'. -- Haruki Murakami


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
Cc: psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-01 11:34:49
Message-ID: AANLkTi=5+DHSrtoKK4KhAidaBYLFyK7=CUkMN9eJdap_@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

Thank you Federico for your answer.

I have ran the same script with pg8000, and it does not stop reading after
an error either... Maybe it is not a bug, and just a limitation of the
PostgreSQL protocol? Maybe the copy from protocol is not designed to return
errors in the middle of the data stream, and I just have to split my data
stream into many chunks and call copy_from for each chunk?

On Tue, Feb 1, 2011 at 11:48, Federico Di Gregorio <
federico(dot)digregorio(at)dndg(dot)it> wrote:

> On 01/02/11 11:24, Nicolas Grilly wrote:
> > It looks like copy_from does not stop reading after an error. When the
> > input file is short, it is okay. But when the input file is very long,
> > it is really boring to wait for the entire file to be read just to
> > discover there is an error on the 10th row.
>
> Thank you very much. I just used your email to create a bug report on
> Lighthouse:
>
>
> http://psycopg.lighthouseapp.com/projects/62710-psycopg/tickets/37-copy_from-does-not-stop-reading-after-an-error
>
> federico
>


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-01 12:57:03
Message-ID: AANLkTimSyph9d15GnfmyAOjC=4sMXpTWZbW1taS63ayr@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

I have tested again the \copy command of psql and, contrary to what I wrote
before, it looks like psql does not stop reading after an error either, and
has to read the complete file before reporting the error.

Conclusion: psycopg2, pg8000 and psql have the same behavior regarding the
command "copy from stdin". The input data file is read entirely, even if
there is incorrect data at the start of the file, and errors are reported
only after having read the complete file.

Therefore it is probably not a bug in psycopg2, and just a "limitation" of
PostgreSQL protocol. Here is the protocol official documentation:
http://www.postgresql.org/docs/9.0/static/protocol-flow.html#PROTOCOL-COPY

I understand we have to "end" the copy before having a chance to retrieve
PostgreSQL backend response and know if our data are correct, or not. Do you
confirm this analysis?

It means copy_from is not designed to send a 10 gigabytes stream to
PostgreSQL, with just one "copy from stdin" command. Maybe I have to split
my input stream into smaller chunks and execute a "copy from stdin" command
for each of them. Do you confirm this is the only (and adequate) solution?

Thanks for you help and advice.

PS: I've copied that email to lighthouse for future reference.

On Tue, Feb 1, 2011 at 12:34, Nicolas Grilly <nicolas(at)gardentechno(dot)com>wrote:

> Thank you Federico for your answer.
>
> I have ran the same script with pg8000, and it does not stop reading after
> an error either... Maybe it is not a bug, and just a limitation of the
> PostgreSQL protocol? Maybe the copy from protocol is not designed to return
> errors in the middle of the data stream, and I just have to split my data
> stream into many chunks and call copy_from for each chunk?
>
> On Tue, Feb 1, 2011 at 11:48, Federico Di Gregorio <
> federico(dot)digregorio(at)dndg(dot)it> wrote:
>
>> On 01/02/11 11:24, Nicolas Grilly wrote:
>> > It looks like copy_from does not stop reading after an error. When the
>> > input file is short, it is okay. But when the input file is very long,
>> > it is really boring to wait for the entire file to be read just to
>> > discover there is an error on the 10th row.
>>
>> Thank you very much. I just used your email to create a bug report on
>> Lighthouse:
>>
>>
>> http://psycopg.lighthouseapp.com/projects/62710-psycopg/tickets/37-copy_from-does-not-stop-reading-after-an-error
>>
>> federico
>>
>
>


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-01 13:18:29
Message-ID: AANLkTi=zm89T0uqPA2ERkrc8APiNJRAThvFo7KBs-0ig@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Tue, Feb 1, 2011 at 2:57 PM, Nicolas Grilly <nicolas(at)gardentechno(dot)com> wrote:
> I have tested again the \copy command of psql and, contrary to what I wrote
> before, it looks like psql does not stop reading after an error either, and
> has to read the complete file before reporting the error.
> Conclusion: psycopg2, pg8000 and psql have the same behavior regarding the
> command "copy from stdin". The input data file is read entirely, even if
> there is incorrect data at the start of the file, and errors are reported
> only after having read the complete file.
> Therefore it is probably not a bug in psycopg2, and just a "limitation" of
> PostgreSQL protocol. Here is the protocol official documentation:
> http://www.postgresql.org/docs/9.0/static/protocol-flow.html#PROTOCOL-COPY
> I understand we have to "end" the copy before having a chance to retrieve
> PostgreSQL backend response and know if our data are correct, or not. Do you
> confirm this analysis?

No, the error message should arrive immediately. But it may be deficiency of
libpq that you cannot aquire it before ending the copy.

I'm not sure about that actually. It should be possible to call
select() & PQconsumeInput
between copy calls, thus also PQgetResult to get the error.

> It means copy_from is not designed to send a 10 gigabytes stream to
> PostgreSQL, with just one "copy from stdin" command. Maybe I have to split
> my input stream into smaller chunks and execute a "copy from stdin" command
> for each of them. Do you confirm this is the only (and adequate) solution?

It should work around the problem, with speed decrease.

--
marko


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-01 13:41:07
Message-ID: AANLkTimYOEeHNKCFfewKqKJsmt8wxAjRJMwfubD9JrK6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

So, considering the test case I attached in my previous message, do you
think it should be possible to retrieve the error message (Invalid syntax
for integer "Bad key", CONTEXT: COPY test, line 10, column id: "Bad
key") without waiting for complete transmission?

Do you think it is possible to modify psycopg2 to implement what you
suggested? I guess the related code is in file pqpath.c, function
_pq_copy_in_v3, correct?

On Tue, Feb 1, 2011 at 14:18, Marko Kreen <markokr(at)gmail(dot)com> wrote:

> No, the error message should arrive immediately. But it may be deficiency
> of
> libpq that you cannot aquire it before ending the copy.
>
> I'm not sure about that actually. It should be possible to call
> select() & PQconsumeInput
> between copy calls, thus also PQgetResult to get the error.

--
> marko
>


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-07 15:06:19
Message-ID: AANLkTint+=5UrOzww9-DrS5BZBgf5gFhpshYLSSH9fWh@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Tue, Feb 1, 2011 at 14:18, Marko Kreen <markokr(at)gmail(dot)com> wrote:

> No, the error message should arrive immediately. But it may be deficiency
> of
> libpq that you cannot aquire it before ending the copy.
>

I have analyzed the PostgreSQL protocol using Wireshark (an open source
packet analyzer), and I confirm that the PostgreSQL backend, while doing a
COPY ... FROM STDIN, reports errors as soon as possible (especially errors
related to invalid data).

Therefore, the "late" reporting of errors while doing a COPY ... FROM STDIN
is not a limitation of the underlying protocol; it is a limitation (or a
design choice) of the libpq library.

It looks like this is a well known issue because it is listed on the todo
list:
http://wiki.postgresql.org/wiki/Todo#COPY

And was discussed before:
http://archives.postgresql.org/pgsql-hackers/2008-04/msg01169.php

> I'm not sure about that actually. It should be possible to call
> select() & PQconsumeInput
> between copy calls, thus also PQgetResult to get the error.
>

I have tried that, and many other things, and I have carefully read libpq
source code, and my understanding is that it is impossible with the current
implementation of libpq.

It is impossible because the function pqParseInput3 (file fe-protocol3.c)
does not parse error responses while the connection is in PGASYNC_COPY_IN
state. We have to call PQputCopyEnd to switch the connection to PGASYNC_BUSY
and start error messages parsing.

Can Marko or someone else confirm this analysis? Any idea to improve that?

Regards,

Nicolas Grilly

PS : Considering that both psycopg2 and psql are built on top of libpq, this
a quite logical they behave the same regarding late reporting of copy
errors, contrary to pg8000 that could easily read, parse and report the
errors early.


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-08 12:30:31
Message-ID: AANLkTi=6fJ3vWfNQdxZRog5fzvi+hS9U2H-FxNfpr-G7@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Mon, Feb 7, 2011 at 5:06 PM, Nicolas Grilly <nicolas(at)gardentechno(dot)com> wrote:
> On Tue, Feb 1, 2011 at 14:18, Marko Kreen <markokr(at)gmail(dot)com> wrote:
>>
>> No, the error message should arrive immediately.  But it may be deficiency
>> of
>> libpq that you cannot aquire it before ending the copy.
>
> I have analyzed the PostgreSQL protocol using Wireshark (an open source
> packet analyzer), and I confirm that the PostgreSQL backend, while doing a
> COPY ... FROM STDIN, reports errors as soon as possible (especially errors
> related to invalid data).
> Therefore, the "late" reporting of errors while doing a COPY ... FROM STDIN
> is not a limitation of the underlying protocol; it is a limitation (or a
> design choice) of the libpq library.
> It looks like this is a well known issue because it is listed on the todo
> list:
> http://wiki.postgresql.org/wiki/Todo#COPY
> And was discussed before:
> http://archives.postgresql.org/pgsql-hackers/2008-04/msg01169.php
>
>>
>> I'm not sure about that actually.  It should be possible to call
>> select() & PQconsumeInput
>> between copy calls, thus also PQgetResult to get the error.
>
> I have tried that, and many other things, and I have carefully read libpq
> source code, and my understanding is that it is impossible with the current
> implementation of libpq.
> It is impossible because the function pqParseInput3 (file fe-protocol3.c)
> does not parse error responses while the connection is in PGASYNC_COPY_IN
> state. We have to call PQputCopyEnd to switch the connection to PGASYNC_BUSY
> and start error messages parsing.
> Can Marko or someone else confirm this analysis? Any idea to improve that?
> Regards,
> Nicolas Grilly
> PS : Considering that both psycopg2 and psql are built on top of libpq, this
> a quite logical they behave the same regarding late reporting of copy
> errors, contrary to pg8000 that could easily read, parse and report the
> errors early.

Yeah, seems your analysis is right and it's libpq (design?) bug,
so no way to work around it.

--
marko


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-08 12:38:20
Message-ID: AANLkTin6mBtLmvOuw2SW7=iAnoO8sC3FNNH+bWj_=-2p@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Tue, Feb 8, 2011 at 13:30, Marko Kreen <markokr(at)gmail(dot)com> wrote:
>
> Yeah, seems your analysis is right and it's libpq (design?) bug,
> so no way to work around it.
>

Nice to know that my analysis is right... but eventually it is a bad news
because the only way to fix the issue is to split input data into many
chunks or to directly modify libpq... By the way, I have posted my analysis
to pgsql-hackers and asked what the list thinks about modify pqParseInput to
handle this case. But I have never posted to this list before so I don't
know what is the good way to ask for feedback.

Nicolas Grilly


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
Cc: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-08 12:52:50
Message-ID: AANLkTi=JywGtY_+2M-fs79gM6kUCozdQ-728r5YpNdBx@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Tue, Feb 8, 2011 at 2:38 PM, Nicolas Grilly <nicolas(at)gardentechno(dot)com> wrote:
> On Tue, Feb 8, 2011 at 13:30, Marko Kreen <markokr(at)gmail(dot)com> wrote:
>>
>> Yeah, seems your analysis is right and it's libpq (design?) bug,
>> so no way to work around it.
>
> Nice to know that my analysis is right... but eventually it is a bad news
> because the only way to fix the issue is to split input data into many
> chunks or to directly modify libpq... By the way, I have posted my analysis
> to pgsql-hackers and asked what the list thinks about modify pqParseInput to
> handle this case. But I have never posted to this list before so I don't
> know what is the good way to ask for feedback.

It's the high point of the of last flame^Wcommitfest for 9.1,
so it may take some time.

--
marko


From: Karsten Hilbert <Karsten(dot)Hilbert(at)gmx(dot)net>
To: psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-08 12:53:17
Message-ID: 20110208125317.GO2869@hermes.hilbert.loc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Tue, Feb 08, 2011 at 01:38:20PM +0100, Nicolas Grilly wrote:

> Nice to know that my analysis is right... but eventually it is a bad news
> because the only way to fix the issue is to split input data into many
> chunks

Maybe there can be a parameter to the Python level .copy()
telling psycopg2 which way to do this. The default would be
the standard behaviour as it is now.

Karsten
--
GPG key ID E4071346 @ gpg-keyserver.de
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346


From: Nicolas Grilly <nicolas(at)gardentechno(dot)com>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>, psycopg(at)postgresql(dot)org
Subject: Re: copy_from does not stop reading after an error
Date: 2011-02-08 14:24:34
Message-ID: AANLkTi=jDthBu=amGh7jQDjtZhqt6-pM_8P6i-bVxs6P@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Tue, Feb 8, 2011 at 13:52, Marko Kreen <markokr(at)gmail(dot)com> wrote:

> It's the high point of the of last flame^Wcommitfest for 9.1,
> so it may take some time.
>

Understood. Thanks for this information.