Re: plpython SPI cursors

Lists: pgsql-hackers
From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: plpython SPI cursors
Date: 2011-10-15 23:28:56
Message-ID: 4E9A1738.6090108@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

attached is a patch implementing the usage of SPI cursors in PL/Python.
Currently when trying to process a large table in PL/Python you have
slurp it all into memory (that's what plpy.execute does).

This patch allows reading the result set in smaller chunks, using a SPI
cursor behind the scenes.

Example usage:

cursor = plpy.cursor("select a, b from hugetable")
for row in cursor:
plpy.info("a is %s and b is %s" % (row['a'], row['b']))

The patch itself is simple, but there's a lot of boilerplate dedicated
to opening a subtransaction and handling prepared plans. I'd like to do
some refactoring of they way PL/Python uses SPI to reduce the amount of
boilerplate needed, but that'll come as a separate patch (just before
the patch to split plpython.c in smaller chunks).

This feature has been sponsored by Nomao.

Cheers,
Jan

PS: I already added it to the November CF.

J

Attachment Content-Type Size
0001-Add-cursor-support-to-plpythonu.patch text/x-diff 0 bytes

From: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-11-20 18:14:17
Message-ID: BLU0-SMTP4530C1907601DFB9AEAE778ECA0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11-10-15 07:28 PM, Jan Urbański wrote:
> Hi,
>
> attached is a patch implementing the usage of SPI cursors in PL/Python.
> Currently when trying to process a large table in PL/Python you have
> slurp it all into memory (that's what plpy.execute does).
>
> J

I found a few bugs (see my testing section below) that will need fixing
+ a few questions about the code

Overview & Feature Review
-----------
This patch adds cursor support to plpython. SPI cursors will allow
a plpython function to read process a large results set without having to
read it all into memory at once. This is a good thing. Without this
patch I think you could accomplish the same with using SQL DECLARE CURSOR
and SQL fetch. This feature allows you to use a python cursor as
an iterator resulting in much cleaner python code than the SQL FETCH
approach. I think the feature is worth having

Usability Review
----------------------
The patch adds the user methods
cursor=plpy.cursor(query_or_plan)
cursor.fetch(100)
cursor.close()

Do we like the name plpy.cursor or would we rather call it something like
plpy.execute_cursor(...) or plpy.cursor_open(...) or plpy.create_cursor(...)
Since we will be mostly stuck with the API once we release 9.2 this is worth
some opinions on. I like cursor() but if anyone disagrees now is the time.

This patch does not provide a wrapper around SPI_cursor_move. The patch
is useful without that and I don't see anything that preculdes someone else
adding that later if they see a need.

Documentation Review
---------------------

The patch includes documentation updates that describes the new feature.
The Database Access page doesn't provide a API style list of database access
functions like the plperl
http://www.postgresql.org/docs/9.1/interactive/plperl-builtins.html page
does. I think the organization of the perl page is
clearer than the python one and we should think about a doing some
documentaiton refactoring. That should be done as a seperate patch and
shouldn't be a barrier to committing this one.

Code Review
----------------

in PLy_cursor_plan line 4080
+ PG_TRY();
+ {
+ Portal portal;
+ char *volatile nulls;
+ volatile int j;
+
+ if (nargs > 0)
+ nulls = palloc(nargs * sizeof(char));
+ else
+ nulls = NULL;
+
+ for (j = 0; j < nargs; j++)
+ {
+ PyObject *elem;
I am probably not seeing a code path or misunderstanding something
about the setjmp/longjump usages but I don't see why nulls and j need to be
volatile here.

line 444
PLy_cursor(PyObject *self, PyObject *args)
+ {
+ char *query;
+ PyObject *plan;
+ PyObject *planargs = NULL;
+
+ if (PyArg_ParseTuple(args, "s", &query))
+ return PLy_cursor_query(query);
+

Should query be freed with PyMem_free()

Testing
-----------

I tested both python 2.6 and 3 on a Linux system

create or replace function x() returns text as $$
cur=None
try:
with plpy.subtransaction():
cur=plpy.cursor('select generate_series(1,1000)')
rows=cur.fetch(10);
plpy.execute('select f()')

except plpy.SPIError:
rows=cur.fetch(10);
return rows[0]['generate_series']
return 'none'
$$ language plpythonu;
select x();

crashes the backend
test=# select x();
The connection to the server was lost. Attempting reset: LOG: server
process (PID 3166) was terminated by signal 11: Segmentation fault

The below test gives me a strange error message:

create or replace function x1() returns text as $$
plan=None
try:
with plpy.subtransaction():
plpy.execute('CREATE TEMP TABLE z AS select generate_series(1,1000)')
plan=plpy.prepare('select * FROM z')
plpy.execute('select * FROM does_not_exist')
except plpy.SPIError, e:
cur=plpy.cursor(plan)
rows=cur.fetch(10)
return rows[0]['generate_series']
return '1'
$$ language plpythonu;
select x1();

test=# select x1()
test-# ;
ERROR: TypeError: Expected sequence of 82187072 arguments, got 0: <NULL>
CONTEXT: Traceback (most recent call last):
PL/Python function "x1", line 9, in <module>
cur=plpy.cursor(plan)
PL/Python function "x1"
STATEMENT: select x1()

I was expecting an error from the function just a bit more useful one.

Performance
-------------------
I did not do any specific performance testing but I don't see this patch
as having any impact to performance


From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-11-20 18:22:23
Message-ID: 4EC9455F.5070209@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20/11/11 19:14, Steve Singer wrote:
> On 11-10-15 07:28 PM, Jan Urbański wrote:
>> Hi,
>>
>> attached is a patch implementing the usage of SPI cursors in PL/Python.

> I found a few bugs (see my testing section below) that will need fixing
> + a few questions about the code

Hi Steve,

thanks a lot for the review, I'll investigate the errors you were
getting and post a follow-up.

Good catch on trying cursors with explicit subtransactions, I didn't
think about how they would interact.

Cheers,
Jan


From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-11-23 18:58:55
Message-ID: 4ECD426F.5060702@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 20/11/11 19:14, Steve Singer wrote:
> On 11-10-15 07:28 PM, Jan Urbański wrote:
>> Hi,
>>
>> attached is a patch implementing the usage of SPI cursors in PL/Python.
>> Currently when trying to process a large table in PL/Python you have
>> slurp it all into memory (that's what plpy.execute does).
>>
>> J
>
> I found a few bugs (see my testing section below) that will need fixing
> + a few questions about the code

Responding now to all questions and attaching a revised patch based on
your comments.

> Do we like the name plpy.cursor or would we rather call it something like
> plpy.execute_cursor(...) or plpy.cursor_open(...) or
> plpy.create_cursor(...)
> Since we will be mostly stuck with the API once we release 9.2 this is
> worth
> some opinions on. I like cursor() but if anyone disagrees now is the time.

We use plpy.subtransaction() to create Subxact objects, so I though
plpy.cursor() would be most appropriate.

> This patch does not provide a wrapper around SPI_cursor_move. The patch
> is useful without that and I don't see anything that preculdes someone else
> adding that later if they see a need.

My idea is to add keyword arguments to plpy.cursor() that will allow you
to decide whether you want a scrollable cursor and after that provide a
move() method.

> The patch includes documentation updates that describes the new feature.
> The Database Access page doesn't provide a API style list of database
> access
> functions like the plperl
> http://www.postgresql.org/docs/9.1/interactive/plperl-builtins.html page
> does. I think the organization of the perl page is
> clearer than the python one and we should think about a doing some
> documentaiton refactoring. That should be done as a seperate patch and
> shouldn't be a barrier to committing this one.

Yeah, the PL/Python docs are a bit chaotic right now. I haven't yet
summoned force to overhaul them.

> in PLy_cursor_plan line 4080
> + PG_TRY();
> + {
> + Portal portal;
> + char *volatile nulls;
> + volatile int j;

> I am probably not seeing a code path or misunderstanding something
> about the setjmp/longjump usages but I don't see why nulls and j need to be
> volatile here.

It looked like you could drop volatile there (and in
PLy_spi_execute_plan, where this is copied from (did I mention there's
quite some code duplication in PL/Python?)) but digging in git I found
this commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2789b7278c11785750dd9d2837856510ffc67000

that added the original volatile qualification, so I guess there's a reason.

> line 444
> PLy_cursor(PyObject *self, PyObject *args)
> + {
> + char *query;
> + PyObject *plan;
> + PyObject *planargs = NULL;
> +
> + if (PyArg_ParseTuple(args, "s", &query))
> + return PLy_cursor_query(query);
> +
>
> Should query be freed with PyMem_free()

No, PyArg_ParseTuple returns a string on the stack, I check that
repeatedly creating a cursor with a plan argument does not leak memory
and that adding PyMem_Free there promptly leads to a segfault.

> I tested both python 2.6 and 3 on a Linux system
>
> [test cases demonstrating bugs]

Turns out it's a really bad idea to store pointers to Portal structures,
because they get invalidated by the subtransaction abort hooks.

I switched to storing the cursor name and looking it up in the
appropriate hash table every time it's used. The examples you sent
(which I included as regression tests) now cause a ValueError to be
raised with a message stating that the cursor has been created in an
aborted subtransaction.

Not sure about the wording of the error message, though.

Thanks again for the review!

Cheers,
Jan

Attachment Content-Type Size
cursor-support-for-plpython-v2.patch text/x-diff 40.8 KB

From: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-11-26 16:21:14
Message-ID: BLU0-SMTP2363BD56F6BB1AD3E704448ECC0@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 11-11-23 01:58 PM, Jan Urbański wrote:
> On 20/11/11 19:14, Steve Singer wrote:
>> On 11-10-15 07:28 PM, Jan Urbański wrote:
>>> Hi,
>>>
>>> attached is a patch implementing the usage of SPI cursors in PL/Python.
>>> Currently when trying to process a large table in PL/Python you have
>>> slurp it all into memory (that's what plpy.execute does).
>>>
>>> J
>>
>> I found a few bugs (see my testing section below) that will need fixing
>> + a few questions about the code
>
> Responding now to all questions and attaching a revised patch based on
> your comments.
>

I've looked over the revised version of the patch and it now seems fine.

Ready for committer.

>> Do we like the name plpy.cursor or would we rather call it something
>> like
>> plpy.execute_cursor(...) or plpy.cursor_open(...) or
>> plpy.create_cursor(...)
>> Since we will be mostly stuck with the API once we release 9.2 this is
>> worth
>> some opinions on. I like cursor() but if anyone disagrees now is the
>> time.
>
> We use plpy.subtransaction() to create Subxact objects, so I though
> plpy.cursor() would be most appropriate.
>
>> This patch does not provide a wrapper around SPI_cursor_move. The patch
>> is useful without that and I don't see anything that preculdes
>> someone else
>> adding that later if they see a need.
>
> My idea is to add keyword arguments to plpy.cursor() that will allow
> you to decide whether you want a scrollable cursor and after that
> provide a move() method.
>
>> The patch includes documentation updates that describes the new feature.
>> The Database Access page doesn't provide a API style list of database
>> access
>> functions like the plperl
>> http://www.postgresql.org/docs/9.1/interactive/plperl-builtins.html page
>> does. I think the organization of the perl page is
>> clearer than the python one and we should think about a doing some
>> documentaiton refactoring. That should be done as a seperate patch and
>> shouldn't be a barrier to committing this one.
>
> Yeah, the PL/Python docs are a bit chaotic right now. I haven't yet
> summoned force to overhaul them.
>
>> in PLy_cursor_plan line 4080
>> + PG_TRY();
>> + {
>> + Portal portal;
>> + char *volatile nulls;
>> + volatile int j;
>
>> I am probably not seeing a code path or misunderstanding something
>> about the setjmp/longjump usages but I don't see why nulls and j need
>> to be
>> volatile here.
>
> It looked like you could drop volatile there (and in
> PLy_spi_execute_plan, where this is copied from (did I mention there's
> quite some code duplication in PL/Python?)) but digging in git I found
> this commit:
>
> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2789b7278c11785750dd9d2837856510ffc67000
>
>
> that added the original volatile qualification, so I guess there's a
> reason.
>
>> line 444
>> PLy_cursor(PyObject *self, PyObject *args)
>> + {
>> + char *query;
>> + PyObject *plan;
>> + PyObject *planargs = NULL;
>> +
>> + if (PyArg_ParseTuple(args, "s", &query))
>> + return PLy_cursor_query(query);
>> +
>>
>> Should query be freed with PyMem_free()
>
> No, PyArg_ParseTuple returns a string on the stack, I check that
> repeatedly creating a cursor with a plan argument does not leak memory
> and that adding PyMem_Free there promptly leads to a segfault.
>
>
>> I tested both python 2.6 and 3 on a Linux system
>>
>> [test cases demonstrating bugs]
>
> Turns out it's a really bad idea to store pointers to Portal
> structures, because they get invalidated by the subtransaction abort
> hooks.
>
> I switched to storing the cursor name and looking it up in the
> appropriate hash table every time it's used. The examples you sent
> (which I included as regression tests) now cause a ValueError to be
> raised with a message stating that the cursor has been created in an
> aborted subtransaction.
>
> Not sure about the wording of the error message, though.
>
> Thanks again for the review!
>
> Cheers,
> Jan
>
>
>


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Steve Singer <ssinger_pg(at)sympatico(dot)ca>
Cc: Jan Urbański <wulczer(at)wulczer(dot)org>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-11-29 05:01:41
Message-ID: 1322542901.4595.7.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On lör, 2011-11-26 at 11:21 -0500, Steve Singer wrote:
> I've looked over the revised version of the patch and it now seems
> fine.
>
> Ready for committer.

I can take it from here.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-12-05 17:58:07
Message-ID: 1323107887.10992.10.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On ons, 2011-11-23 at 19:58 +0100, Jan Urbański wrote:
> On 20/11/11 19:14, Steve Singer wrote:
> > On 11-10-15 07:28 PM, Jan Urbański wrote:
> >> Hi,
> >>
> >> attached is a patch implementing the usage of SPI cursors in PL/Python.
> >> Currently when trying to process a large table in PL/Python you have
> >> slurp it all into memory (that's what plpy.execute does).
> >>
> >> J
> >
> > I found a few bugs (see my testing section below) that will need fixing
> > + a few questions about the code
>
> Responding now to all questions and attaching a revised patch based on
> your comments.

Committed

Please refresh the other patch.


From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc: Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-12-05 17:59:58
Message-ID: 4EDD069E.8090006@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/12/11 18:58, Peter Eisentraut wrote:
> On ons, 2011-11-23 at 19:58 +0100, Jan Urbański wrote:
>> On 20/11/11 19:14, Steve Singer wrote:
>> Responding now to all questions and attaching a revised patch based on
>> your comments.
>
> Committed
>
> Please refresh the other patch.

Great, thanks!

I'll try to send an updated version of the other patch this evening.

Cheers,
Jan


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Jan Urbański <wulczer(at)wulczer(dot)org>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-12-05 18:12:19
Message-ID: 201112051812.pB5ICJC22418@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jan Urbaski wrote:
> On 05/12/11 18:58, Peter Eisentraut wrote:
> > On ons, 2011-11-23 at 19:58 +0100, Jan Urba?ski wrote:
> >> On 20/11/11 19:14, Steve Singer wrote:
> >> Responding now to all questions and attaching a revised patch based on
> >> your comments.
> >
> > Committed
> >
> > Please refresh the other patch.
>
> Great, thanks!
>
> I'll try to send an updated version of the other patch this evening.

I assume this is _not_ related to this TODO item:

Add a DB-API compliant interface on top of the SPI interface

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Jan Urbański <wulczer(at)wulczer(dot)org>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Peter Eisentraut <peter_e(at)gmx(dot)net>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-12-05 18:20:34
Message-ID: 4EDD0B72.6090804@wulczer.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 05/12/11 19:12, Bruce Momjian wrote:
> Jan Urbański wrote:
>> On 05/12/11 18:58, Peter Eisentraut wrote:
>>> On ons, 2011-11-23 at 19:58 +0100, Jan Urba?ski wrote:
>>>> On 20/11/11 19:14, Steve Singer wrote:
>>>> Responding now to all questions and attaching a revised patch based on
>>>> your comments.
>>>
>>> Committed
>>>
>>> Please refresh the other patch.
>>
>> Great, thanks!
>>
>> I'll try to send an updated version of the other patch this evening.
>
> I assume this is _not_ related to this TODO item:
>
> Add a DB-API compliant interface on top of the SPI interface

No, not related.


From: Peter Eisentraut <peter_e(at)gmx(dot)net>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Jan Urbański <wulczer(at)wulczer(dot)org>, Steve Singer <ssinger_pg(at)sympatico(dot)ca>, Postgres - Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: plpython SPI cursors
Date: 2011-12-10 17:23:32
Message-ID: 1323537812.20837.0.camel@vanquo.pezone.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On mån, 2011-12-05 at 13:12 -0500, Bruce Momjian wrote:
> Jan Urbański wrote:
> > On 05/12/11 18:58, Peter Eisentraut wrote:
> > > On ons, 2011-11-23 at 19:58 +0100, Jan Urba?ski wrote:
> > >> On 20/11/11 19:14, Steve Singer wrote:
> > >> Responding now to all questions and attaching a revised patch based on
> > >> your comments.
> > >
> > > Committed
> > >
> > > Please refresh the other patch.
> >
> > Great, thanks!
> >
> > I'll try to send an updated version of the other patch this evening.
>
> I assume this is _not_ related to this TODO item:
>
> Add a DB-API compliant interface on top of the SPI interface

No, but this is:
http://petereisentraut.blogspot.com/2011/11/plpydbapi-db-api-for-plpython.html