Proposal: efficient iter on named cursors

Lists: psycopg
From: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
To: psycopg(at)postgresql(dot)org
Subject: Proposal: efficient iter on named cursors
Date: 2011-01-13 16:37:47
Message-ID: AANLkTimJMR3RFpWS4-QG-d6vaKZTGY_2eEuxCqcZv=p0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

Hello,

you may know psycopg offers server-side cursors, known in
postgres/psycopg as named cursors. If you don't, well, this is good
news :) Named cursors are useful with datasets too big to be handled
by the client, as regular cursors transfer all the data to the client
during the execute() method. Named cursors on the other hand only
transfer the required amount of records to the client (one record with
cur.fetchone(), n with cursor.fetchmany(n)).

There is a shortcoming though: iter(cursor) will fetch the records one
at a time, with a noticeable time overhead in case of large recordsets
(exactly the ones you may want to retrieve with a named cursors...)
Currently the most efficient way to iterate on a named cursor is
something like:

nrecs = 100 # or some other reasonable number
while 1:
recs = cur.fetchmany(nrecs)
if not recs:
break
for rec in recs:
# do something

This would use only the memory used by nrecs record on the client and
require just 1/nrecs of the roundtrips required by a naive operation.
But it make the named cursors harder to use and not a drop-in
replacement for regular cursors that can be idiomatically used with:

for rec in cur:
# do something

So, I'd like to modify the cursor so that in case of __iter__, a
certain number of record is fetched and iteration is performed on
them. The cursor already has the state to keep the dataset so probably
only the code would require change, not so much the data structures.

How do we make the users choose their nrecs? I think the cursor should
have an attribute with a sensible default: 100? 1000? 1024? What
attribute name?

It shouldn't be hard to implement. Does anybody want to try it? I've
open the ticket #33
<http://psycopg.lighthouseapp.com/projects/62710/tickets/33> for any
update.

Comments?

Cheers,

-- Daniele


From: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
To: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-01-13 16:46:41
Message-ID: 4D2F2C71.8080805@dndg.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On 13/01/2011 17:37, Daniele Varrazzo wrote:
[snip]
> There is a shortcoming though: iter(cursor) will fetch the records one
> at a time, with a noticeable time overhead in case of large recordsets
> (exactly the ones you may want to retrieve with a named cursors...)
> Currently the most efficient way to iterate on a named cursor is
> something like:
>
> nrecs = 100 # or some other reasonable number
> while 1:
> recs = cur.fetchmany(nrecs)
> if not recs:
> break
> for rec in recs:
> # do something
>
> This would use only the memory used by nrecs record on the client and
> require just 1/nrecs of the roundtrips required by a naive operation.
> But it make the named cursors harder to use and not a drop-in
> replacement for regular cursors that can be idiomatically used with:
>
> for rec in cur:
> # do something
>
> So, I'd like to modify the cursor so that in case of __iter__, a
> certain number of record is fetched and iteration is performed on
> them. The cursor already has the state to keep the dataset so probably
> only the code would require change, not so much the data structures.
>
> How do we make the users choose their nrecs? I think the cursor should
> have an attribute with a sensible default: 100? 1000? 1024? What
> attribute name?
[snip]
> Comments?

I mostly agree. I'd like to see a .fetchsize parameter both on the
module, on the connection and on the cursor (usual psycopg cascade
initialization). A good default value, supposing rows of 10-20 columns
with a mix of textual and numeric data, is ~2000 because it will
probably fetch less than 100KB of data and that seems a good compromise
between the number of rows transferred and data you'll throw away if you
decide to leave the loop early.

federico

--
Federico Di Gregorio federico(dot)digregorio(at)dndg(dot)it
Studio Associato Di Nunzio e Di Gregorio http://dndg.it
If nobody understand you, that doesn't mean you're an artist.
-- anonymous


From: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
To: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
Cc: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-02-15 12:22:10
Message-ID: AANLkTi=y7x32LTLA65JKm5BSD3kN0rDRq+wC6HhjAqcg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Thu, Jan 13, 2011 at 4:46 PM, Federico Di Gregorio
<federico(dot)digregorio(at)dndg(dot)it> wrote:
> On 13/01/2011 17:37, Daniele Varrazzo wrote:

>> So, I'd like to modify the cursor so that in case of __iter__, a
>> certain number of record is fetched and iteration is performed on
>> them. The cursor already has the state to keep the dataset so probably
>> only the code would require change, not so much the data structures.

> I mostly agree. I'd like to see a .fetchsize parameter both on the
> module, on the connection and on the cursor (usual psycopg cascade
> initialization). A good default value, supposing rows of 10-20 columns
> with a mix of textual and numeric data, is ~2000 because it will
> probably fetch less than 100KB of data and that seems a good compromise
> between the number of rows transferred and data you'll throw away if you
> decide to leave the loop early.

The feature discussed here is included in the current betas. In
particular, in the released implementation, we are using the already
existing 'arraysize' attribute (used as default for 'fetchmany()') as
the number of records to fetch per roundtrip. But because the default
for arraysize is 1 (per DB-API), and this value basically disables the
feature, 'iter()' currently uses the proposed value of 2000 as
default.

However the ticket #41
<http://psycopg.lighthouseapp.com/projects/62710/tickets/41>, while
invalid for me as I don't see "for record in cur" as a synonym for
"fetch one record at time", has made me wonder if we are too
aggressive with the default: maybe discarding arraysize=1 is not the
best option. Albeit somebody can still use the "while True + /
fetchone() + break" pattern to force record-per-record fetching, I
think if she is careful enough to use named cursors for its task she
may also care to set an appropriate value > 1 for arraysize. Named
cursors are still easier to use, but I don't want to make people think
they can be a replacement for *all* cursors - they still require more
resources on the server, so are better used only when required.

On this reasoning, I've committed this patch
<https://github.com/dvarrazzo/psycopg/commit/1dd71947bff415f1c79a5e6b81b6ba89d717ecb9>
in a separate branch: it makes iteration respect arraysize in its
default value 1 too, and improves the documentation explaining the
complete picture. If ok, the patch will be merged in devel.

Comments?

-- Daniele


From: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
To: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
Cc: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-02-17 10:44:13
Message-ID: 4D5CFBFD.8080103@dndg.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On 15/02/11 13:22, Daniele Varrazzo wrote:
> However the ticket #41
> <http://psycopg.lighthouseapp.com/projects/62710/tickets/41>, while
> invalid for me as I don't see "for record in cur" as a synonym for
> "fetch one record at time", has made me wonder if we are too
> aggressive with the default: maybe discarding arraysize=1 is not the
> best option. Albeit somebody can still use the "while True + /
> fetchone() + break" pattern to force record-per-record fetching, I
> think if she is careful enough to use named cursors for its task she
> may also care to set an appropriate value > 1 for arraysize. Named
> cursors are still easier to use, but I don't want to make people think
> they can be a replacement for *all* cursors - they still require more
> resources on the server, so are better used only when required.
>
> On this reasoning, I've committed this patch
> <https://github.com/dvarrazzo/psycopg/commit/1dd71947bff415f1c79a5e6b81b6ba89d717ecb9>
> in a separate branch: it makes iteration respect arraysize in its
> default value 1 too, and improves the documentation explaining the
> complete picture. If ok, the patch will be merged in devel.
>
> Comments?

I think the original implementation was right because "foreach ..."
doesn't mean fetch one record at a time. IMHO,

1) .fetchone() should _always_ fetch one record
2) iter(cursor) should fetch as many records as we feel right

But we can do a little trick here and make iter(cursor) respect
.arraysize if arraysize was explicitly set so that if one really wants
to fetch one record at a time can just set .arraysize to 1.

Good or bad?

federico

--
Federico Di Gregorio federico(dot)digregorio(at)dndg(dot)it
Studio Associato Di Nunzio e Di Gregorio http://dndg.it
I did appreciate the irony that I was whining about encoding issues on
a mailing list that was unable to show those chars, too.
-- Antti S. Lankila to mono-devel-list@


From: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
To: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
Cc: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-02-17 10:57:37
Message-ID: AANLkTi=HUtBqM3Oiy0riUj9TYXC-6GB-hD6=JYc_EBXN@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Thu, Feb 17, 2011 at 10:44 AM, Federico Di Gregorio
<federico(dot)digregorio(at)dndg(dot)it> wrote:
> On 15/02/11 13:22, Daniele Varrazzo wrote:

>> On this reasoning, I've committed this patch
>> <https://github.com/dvarrazzo/psycopg/commit/1dd71947bff415f1c79a5e6b81b6ba89d717ecb9>
>> in a separate branch: it makes iteration respect arraysize in its
>> default value 1 too, and improves the documentation explaining the
>> complete picture. If ok, the patch will be merged in devel.
>>
>> Comments?
>
> I think the original implementation was right because "foreach ..."
> doesn't mean fetch one record at a time. IMHO,
>
> 1) .fetchone() should _always_ fetch one record
> 2) iter(cursor) should fetch as many records as we feel right

Yes, this is what I think too. It is consistent with what happens with
iter(file) vs. file.readline(). The only hitch is that the DBAPI asks
for a default of 1 for arraysize.

> But we can do a little trick here and make iter(cursor) respect
> .arraysize if arraysize was explicitly set so that if one really wants
> to fetch one record at a time can just set .arraysize to 1.
>
> Good or bad?

Quite tricky as arraysize is currently a simple property. Even if we
could do it with some property trickery, it would be surprising if
"print cur.arraysize" would return 1 and iter(cur) was efficient;
then, after "cur.arraysize = 1", iter(cur) would switch to fetch one
record at time, while "print cur.arraysize" would still report 1. I
feel it violates the principle of least astonishment, as much as being
difficult for the user to predict what the library would do.

-- Daniele


From: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
To: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
Cc: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-02-17 11:09:10
Message-ID: 4D5D01D6.8030607@dndg.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On 17/02/11 11:57, Daniele Varrazzo wrote:
>> I think the original implementation was right because "foreach ..."
>> > doesn't mean fetch one record at a time. IMHO,
>> >
>> > 1) .fetchone() should _always_ fetch one record
>> > 2) iter(cursor) should fetch as many records as we feel right
> Yes, this is what I think too. It is consistent with what happens with
> iter(file) vs. file.readline(). The only hitch is that the DBAPI asks
> for a default of 1 for arraysize.
>
>
>> > But we can do a little trick here and make iter(cursor) respect
>> > .arraysize if arraysize was explicitly set so that if one really wants
>> > to fetch one record at a time can just set .arraysize to 1.
>> >
>> > Good or bad?
> Quite tricky as arraysize is currently a simple property. Even if we
> could do it with some property trickery, it would be surprising if
> "print cur.arraysize" would return 1 and iter(cur) was efficient;
> then, after "cur.arraysize = 1", iter(cur) would switch to fetch one
> record at time, while "print cur.arraysize" would still report 1. I
> feel it violates the principle of least astonishment, as much as being
> difficult for the user to predict what the library would do.

Then we need a different property: itersize?

federico

--
Federico Di Gregorio federico(dot)digregorio(at)dndg(dot)it
Studio Associato Di Nunzio e Di Gregorio http://dndg.it
Io non sono romantica. La candelina sul tavolo mi vede e si spegne.
-- sisterconfusion


From: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
To: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
Cc: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-02-17 11:12:12
Message-ID: AANLkTimiLhbme59jpbr+frx7YNW0MJpMgv-WZ1eyVprA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Thu, Feb 17, 2011 at 11:09 AM, Federico Di Gregorio
<federico(dot)digregorio(at)dndg(dot)it> wrote:
> On 17/02/11 11:57, Daniele Varrazzo wrote:
>>> I think the original implementation was right because "foreach ..."
>>> > doesn't mean fetch one record at a time. IMHO,
>>> >
>>> > 1) .fetchone() should _always_ fetch one record
>>> > 2) iter(cursor) should fetch as many records as we feel right
>> Yes, this is what I think too. It is consistent with what happens with
>> iter(file) vs. file.readline(). The only hitch is that the DBAPI asks
>> for a default of 1 for arraysize.
>>
>>
>>> > But we can do a little trick here and make iter(cursor) respect
>>> > .arraysize if arraysize was explicitly set so that if one really wants
>>> > to fetch one record at a time can just set .arraysize to 1.
>>> >
>>> > Good or bad?
>> Quite tricky as arraysize is currently a simple property. Even if we
>> could do it with some property trickery, it would be surprising if
>> "print cur.arraysize" would return 1 and iter(cur) was efficient;
>> then, after "cur.arraysize = 1", iter(cur) would switch to fetch one
>> record at time, while "print cur.arraysize" would still report 1. I
>> feel it violates the principle of least astonishment, as much as being
>> difficult for the user to predict what the library would do.
>
> Then we need a different property: itersize?

While I don't like the multiplication of attributes and extensions,
this sounds like the cleaner option.

-- Daniele


From: Daniele Varrazzo <daniele(dot)varrazzo(at)gmail(dot)com>
To: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
Cc: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-02-17 12:38:53
Message-ID: AANLkTinL6dH0yp-yuicJfiPK0vhiOfSuVBoVDtzXdLwZ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On Thu, Feb 17, 2011 at 11:12 AM, Daniele Varrazzo
<daniele(dot)varrazzo(at)gmail(dot)com> wrote:
> On Thu, Feb 17, 2011 at 11:09 AM, Federico Di Gregorio

>> Then we need a different property: itersize?
>
> While I don't like the multiplication of attributes and extensions,
> this sounds like the cleaner option.

It's in my devel.

-- Daniele


From: Federico Di Gregorio <federico(dot)digregorio(at)dndg(dot)it>
To: psycopg(at)postgresql(dot)org
Subject: Re: Proposal: efficient iter on named cursors
Date: 2011-02-17 12:44:36
Message-ID: 4D5D1834.5040408@dndg.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: psycopg

On 17/02/11 13:38, Daniele Varrazzo wrote:
> On Thu, Feb 17, 2011 at 11:12 AM, Daniele Varrazzo
> <daniele(dot)varrazzo(at)gmail(dot)com> wrote:
>> > On Thu, Feb 17, 2011 at 11:09 AM, Federico Di Gregorio
>>> >> Then we need a different property: itersize?
>> >
>> > While I don't like the multiplication of attributes and extensions,
>> > this sounds like the cleaner option.
> It's in my devel.

I just reviewed and merged your devel.

federico

--
Federico Di Gregorio federico(dot)digregorio(at)dndg(dot)it
Studio Associato Di Nunzio e Di Gregorio http://dndg.it
Qu'est ce que la folie? Juste un sentiment de liberté si
fort qu'on en oublie ce qui nous rattache au monde... -- J. de Loctra