Quick Links

Speed dblink using alternate libpq tuple storage

Lists:	pgsql-hackers

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Allow substitute allocators for PGresult.
Date:	2011-11-11 09:18:43
Message-ID:	20111111.181843.88228004.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. This message is a proposal of a pair of patches that
enables the memory allocator for PGresult in libpq to be
replaced.

The comment at the the begging of pqexpbuffer.c says that libpq
should not rely on palloc(). Besides, Tom Lane said that palloc
should not be visible outside the backend(*1) and I agree with
it.

*1: http://archives.postgresql.org/pgsql-hackers/1999-02/msg00364.php

On the other hand, in dblink, dblink-plus (our product!), and
maybe FDW's connect to other PostgreSQL servers are seem to copy
the result of the query contained in PGresult into tuple store. I
guess that this is in order to avoid memory leakage on
termination in halfway.

But it is rather expensive to copy whole PGresult, and the
significance grows as the data received gets larger. Furthermore,
it requires about twice as much memory as the net size of the
data. And it is fruitless to copy'n modify libpq or reinvent it
from scratch. So we shall be happy to be able to use palloc's in
libpq at least for PGresult for such case in spite of the policy.

For these reasons, I propose to make allocators for PGresult
replaceable.

The modifications are made up into two patches.

1. dupEvents() and pqAddTuple() get new memory block by malloc
currently, but the aquired memory block is linked into
PGresult finally. So I think it is preferable to use
pqResultAlloc() or its descendents in consistensy with the
nature of the place to link.

But there is not PQresultRealloc() and it will be costly, so
pqAddTuple() is not modified in this patch.

2. Define three function pointers
PQpgresult_(malloc|realloc|free) and replace the calls to
malloc/realloc/free in the four functions below with these
pointers.

PQmakeEmptyPGresult()
pqResultAlloc()
PQclear()
pqAddTuple()

This patches make the tools run in backend process and use libpq
possible to handle PGresult as it is with no copy, no more memory.

(Of cource, someone wants to use his/her custom allocator for
PGresult on standalone tools could do that using this feature.)

Three files are attached to this message.

First, the patch with respect to "1" above.
Second, the patch with respect to "2" above.
Third, a very simple sample program.

I have built and briefly tested on CentOS6, with the sample
program mentioned above and valgrind, but not on Windows.

How do you think about this?

Regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
libpq_dupevents_alloc_r1.patch	text/x-patch	2.5 KB
libpq_replasable_alloc_r1.patch	text/x-patch	3.8 KB
testlibpq.c.gz	application/octet-stream	1.7 KB

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-11 09:29:30
Message-ID:	4EBCEAFA.6010507@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11.11.2011 11:18, Kyotaro HORIGUCHI wrote:
> The comment at the the begging of pqexpbuffer.c says that libpq
> should not rely on palloc(). Besides, Tom Lane said that palloc
> should not be visible outside the backend(*1) and I agree with
> it.
>
> *1: http://archives.postgresql.org/pgsql-hackers/1999-02/msg00364.php
>
> On the other hand, in dblink, dblink-plus (our product!), and
> maybe FDW's connect to other PostgreSQL servers are seem to copy
> the result of the query contained in PGresult into tuple store. I
> guess that this is in order to avoid memory leakage on
> termination in halfway.
>
> But it is rather expensive to copy whole PGresult, and the
> significance grows as the data received gets larger. Furthermore,
> it requires about twice as much memory as the net size of the
> data. And it is fruitless to copy'n modify libpq or reinvent it
> from scratch. So we shall be happy to be able to use palloc's in
> libpq at least for PGresult for such case in spite of the policy.
>
>
> For these reasons, I propose to make allocators for PGresult
> replaceable.

You could use the resource owner mechanism to track them. Register a
callback function with RegisterResourceReleaseCallback(). Whenever a
PGresult is returned from libpq, add it to e.g a linked list, kept in
TopMemoryContext, and also store a reference to CurrentResourceOwner in
the list element. In the callback function, scan through the list and
free all the PGresults associated with the resource owner that's being
released.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-11 23:48:55
Message-ID:	14204.1321055335@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 11.11.2011 11:18, Kyotaro HORIGUCHI wrote:
>> The comment at the the begging of pqexpbuffer.c says that libpq
>> should not rely on palloc(). Besides, Tom Lane said that palloc
>> should not be visible outside the backend(*1) and I agree with
>> it.
>>
>> *1: http://archives.postgresql.org/pgsql-hackers/1999-02/msg00364.php
>>
>> On the other hand, in dblink, dblink-plus (our product!), and
>> maybe FDW's connect to other PostgreSQL servers are seem to copy
>> the result of the query contained in PGresult into tuple store. I
>> guess that this is in order to avoid memory leakage on
>> termination in halfway.
>>
>> But it is rather expensive to copy whole PGresult, and the
>> significance grows as the data received gets larger. Furthermore,
>> it requires about twice as much memory as the net size of the
>> data. And it is fruitless to copy'n modify libpq or reinvent it
>> from scratch. So we shall be happy to be able to use palloc's in
>> libpq at least for PGresult for such case in spite of the policy.
>>
>> For these reasons, I propose to make allocators for PGresult
>> replaceable.

> You could use the resource owner mechanism to track them.

Heikki's idea is probably superior so far as PG backend usage is
concerned in isolation, but I wonder if there are scenarios where a
client application would like to be able to manage libpq's allocations.
If so, Kyotaro-san's approach would solve more problems than just
dblink's.

However, the bigger picture here is that I think Kyotaro-san's desire to
not have dblink return a tuplestore may be misplaced. Tuplestores can
spill to disk, while PGresults don't; so the larger the result, the
more important it is to push it into a tuplestore and PQclear it as soon
as possible.

Despite that worry, it'd likely be a good idea to adopt one or the other
of these solutions anyway, because I think there are corner cases where
dblink.c can leak a PGresult --- for instance, what if dblink_res_error
fails due to out-of-memory before reaching PQclear? And we could get
rid of the awkward and none-too-cheap PG_TRY blocks that it uses to try
to defend against such leaks in other places.

So I'm in favor of making a change along that line, although I'd want
to see more evidence before considering changing dblink to not return
tuplestores.

regards, tom lane

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-12 01:34:40
Message-ID:	20111112013439.GO24234@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Heikki's idea is probably superior so far as PG backend usage is
> concerned in isolation, but I wonder if there are scenarios where a
> client application would like to be able to manage libpq's allocations.

The answer to that is certainly 'yes'. It was one of the first things
that I complained about when moving from Oracle to PG. With OCI, you
can bulk load results directly into application-allocated memory areas.

Haven't been following the dblink discussion, so not going to comment
about that piece.

Thanks,

Stephen

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Stephen Frost <sfrost(at)snowman(dot)net>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-12 05:48:26
Message-ID:	19444.1321076906@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stephen Frost <sfrost(at)snowman(dot)net> writes:
> * Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
>> Heikki's idea is probably superior so far as PG backend usage is
>> concerned in isolation, but I wonder if there are scenarios where a
>> client application would like to be able to manage libpq's allocations.

> The answer to that is certainly 'yes'. It was one of the first things
> that I complained about when moving from Oracle to PG. With OCI, you
> can bulk load results directly into application-allocated memory areas.

Well, loading data in a form whereby the application can access it
without going through the PGresult accessor functions would be an
entirely different (and vastly larger) project. I'm not sure I want
to open that can of worms --- it seems like you could write a huge
amount of code trying to provide every format someone might want,
and still find that there were impedance mismatches for many
applications.

AIUI Kyotaro-san is just suggesting that the app should be able to
provide a substitute malloc function for use in allocating PGresult
space (and not, I think, anything else that libpq allocates internally).
Basically this would allow PGresults to be cleaned up with methods other
than calling PQclear on each one. It wouldn't affect how you'd interact
with one while you had it. That seems like pretty much exactly what we
want for preventing memory leaks in the backend; but is it going to be
useful for other apps?

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Stephen Frost <sfrost(at)snowman(dot)net>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-12 06:36:30
Message-ID:	CA+TgmobDGGXXOYWokkJm8Tj8R0zHKzbag_cysJNLCXNdRXS97A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 12, 2011 at 12:48 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> AIUI Kyotaro-san is just suggesting that the app should be able to
> provide a substitute malloc function for use in allocating PGresult
> space (and not, I think, anything else that libpq allocates internally).
> Basically this would allow PGresults to be cleaned up with methods other
> than calling PQclear on each one. It wouldn't affect how you'd interact
> with one while you had it. That seems like pretty much exactly what we
> want for preventing memory leaks in the backend; but is it going to be
> useful for other apps?

I think it will.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-12 17:29:50
Message-ID:	29692.1321118990@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> writes:
> Hello. This message is a proposal of a pair of patches that
> enables the memory allocator for PGresult in libpq to be
> replaced.

Since there seems to be rough consensus that something like this would
be a good idea, I looked more closely at the details of the patch.
I think the design could use some adjustment.

To start with, the patch proposes exposing some global variables that
affect the behavior of libpq process-wide. This seems like a pretty bad
design, because a single process could contain multiple usages of libpq
with different requirements. As an example, if dblink.c were to set
these variables inside a backend process, it would break usage of libpq
from PL/Perl via DBI. Global variables also tend to be a bad idea
whenever you think about multi-threaded applications --- they require
locking facilities, which are not in this patch.

I think it'd be better to consider the PGresult alloc/free functions to
be a property of a PGconn, which you'd set with a function call along the
lines of PQsetResultAllocator(conn, alloc_func, realloc_func, free_func)
after having successfully opened a connection. Then we just have some
more PGconn fields (and I guess PGresult will need a copy of the
free_func pointer) and no new global variables.

I am also feeling dubious about whether it's a good idea to expect the
functions to have exactly the signature of malloc/free. They are
essentially callbacks, and in most places where a library provides for
callbacks, it's customary to include a "void *" passthrough argument
in case the callback needs some context information. I am not sure that
dblink.c would need such a thing, but if we're trying to design a
general-purpose feature, then we probably should have it. The cost
would be having shim functions inside libpq for the default case, but
it doesn't seem likely that they'd cost enough to notice.

The patch lacks any user documentation, which it surely must have if
we are claiming this is a user-visible feature. And I think it could
use some attention to updating code comments, notably the large block
about PGresult space management near the top of fe-exec.c.

Usually, when writing a feature of this sort, it's a good idea to
implement a prototype use-case to make sure you've not overlooked
anything. So I'd feel happier about the patch if it came along with
a patch to make dblink.c use it to prevent memory leaks.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-12 18:49:21
Message-ID:	1042.1321123761@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 11.11.2011 11:18, Kyotaro HORIGUCHI wrote:
>> For these reasons, I propose to make allocators for PGresult
>> replaceable.

> You could use the resource owner mechanism to track them.

BTW, I just thought of a potentially fatal objection to making PGresult
allocation depend on palloc: libpq is absolutely not prepared to handle
losing control on out-of-memory. While I'm not certain that its
behavior with malloc is entirely desirable either (it tends to loop in
hopes of getting the memory next time), we cannot just plop in palloc
in place of malloc and imagine that we're not breaking it.

This makes me think that Heikki's approach is by far the more tenable
one, so far as dblink is concerned. Perhaps the substitute-malloc idea
is still useful for some other application, but I'm inclined to put that
idea on the back burner until we have a concrete use case for it.

regards, tom lane

From:	Matteo Beccati <php(at)beccati(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-12 20:38:55
Message-ID:	4EBED95F.3010901@beccati.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/11/2011 07:36, Robert Haas wrote:
> On Sat, Nov 12, 2011 at 12:48 AM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> AIUI Kyotaro-san is just suggesting that the app should be able to
>> provide a substitute malloc function for use in allocating PGresult
>> space (and not, I think, anything else that libpq allocates internally).
>> Basically this would allow PGresults to be cleaned up with methods other
>> than calling PQclear on each one. It wouldn't affect how you'd interact
>> with one while you had it. That seems like pretty much exactly what we
>> want for preventing memory leaks in the backend; but is it going to be
>> useful for other apps?
>
> I think it will.

Maybe I've just talking nonsense, I just have little experience hacking
the pgsql and pdo-pgsql exstensions, but to me it would seem something
that could easily avoid an extra duplication of the data returned by
pqgetvalue. To me it seems a pretty nice win.

Cheers
--
Matteo Beccati

Development & Consulting - http://www.beccati.com/

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-12 22:02:23
Message-ID:	20111112220223.GR24234@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Tom Lane (tgl(at)sss(dot)pgh(dot)pa(dot)us) wrote:
> Well, loading data in a form whereby the application can access it
> without going through the PGresult accessor functions would be an
> entirely different (and vastly larger) project.

Looking through the thread, I agree that it's a different thing than
what's being discussed here.

> I'm not sure I want
> to open that can of worms --- it seems like you could write a huge
> amount of code trying to provide every format someone might want,
> and still find that there were impedance mismatches for many
> applications.

The OCI approach is actually very similar to how we handle our
catalogs internally.. Imagine you define a C struct which matched your
table structure, then you allocate 5000 (or however) of those, give the
base pointer to the 'getResult' call and a integer array of offsets into
that structure for each of the columns. There might have been a few
other minor things (like some notion of how to handle NULLs), but it was
pretty straight-forward from the C perspective, imv.

Trying to provide alternative formats (I'm guessing you were referring
to something like XML..? Or some complex structure?) would certainly be
a whole different ballgame.

Thanks,

Stephen

> AIUI Kyotaro-san is just suggesting that the app should be able to
> provide a substitute malloc function for use in allocating PGresult
> space (and not, I think, anything else that libpq allocates internally).
> Basically this would allow PGresults to be cleaned up with methods other
> than calling PQclear on each one. It wouldn't affect how you'd interact
> with one while you had it. That seems like pretty much exactly what we
> want for preventing memory leaks in the backend; but is it going to be
> useful for other apps?
>
> regards, tom lane

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	heikki(dot)linnakangas(at)enterprisedb(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-11-22 10:56:55
Message-ID:	20111122.195655.117506884.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

At Fri, 11 Nov 2011 11:29:30 +0200, Heikki Linnakangas wrote
> You could use the resource owner mechanism to track
> them. Register a callback function with
> RegisterResourceReleaseCallback().

Thank you for letting me know about it. I have dug up a message
in pg-hackers refering to the mechanism on discussion about
postgresql-fdw. I'll put further thought into dblink-plus taking
it into account.

By the way, thinking about memory management for the result in
libpq is considerable as another issue.

At Sat, 12 Nov 2011 12:29:50 -0500, Tom Lane wrote
> To start with, the patch proposes exposing some global
> variables that affect the behavior of libpq process-wide. This
> seems like a pretty bad design, because a single process could
> contain multiple usages of libpq

You're right to say the design is bad. I've designed it to have
minimal impact on libpq by limiting usage and imposing any
reponsibility on the users, that is the developers of the modules
using it. If there are any other applications that want to use
their own allocators, there are some points to be considered.

I think it is preferable consiering multi-threading to make libpq
write PGresult into memory blocks passed from the application
like OCI does, instead of letting libpq itself make request for
them.

This approach hands the responsibility of memory management to
the user and gives them the capability to avoid memory exhaustion
by their own measures.

On the other hand, this way could produce the situation that
libpq cannot write all of the data to receive from the server
onto handed memory block. So, the API must be able to return the
partial data to the caller.

More advancing, if libpq could store the result directly into
user-allocated memory space using tuplestore-like interface, it
is better on performance if the final storage is a tuplestore
itself.

I will be happy with the part-by-part passing of result. So I
will think about this as the next issue.

> So I'd feel happier about the patch if it came along with a
> patch to make dblink.c use it to prevent memory leaks.

I take it is about my original patch.

Mmm, I heard that dblink copies received data in PGResult to
tuple store not only because of the memory leak, but less memory
usage (after the copy is finished). I think I could show you the
patch ignoring the latter, but it might take some time for me to
start from understand dblink and tuplestore closely...

If I find RegisterResourceReleaseCallback short for our
requirement, I will show it. If not, I withdraw this patch for
ongoing CF and propose another patch based on the discussion
above at another time.

Please let me have a little more time.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	heikki(dot)linnakangas(at)enterprisedb(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-01 10:18:04
Message-ID:	20111201.191804.237535571.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

me> I'll put further thought into dblink-plus taking it into
me> account.
..
me> Please let me have a little more time.

I've inquired the developer of dblink-plus about
RegisterResourceReleaseCallback(). He said that the function is
in bad compatibility with current implementation. In addition to
this, storing into tuplestore directly seems to me a good idea
than palloc'ed PGresult.

So I tried to make libpq/PGresult be able to handle alternative
tuple store by hinting to PGconn, and modify dblink to use the
mechanism as the first sample code.

I will show it as a series of patches in next message.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-01 10:24:19
Message-ID:	20111201.192419.103527179.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, This is the next version of Allow substitute allocators
for PGresult.

Totally chaning the concept from the previous one, this patch
allows libpq to handle alternative tuple store for received
tuples.

Design guidelines are shown below.

- No need to modify existing client code of libpq.

- Existing libpq client runs with roughly same performance, and
dblink with modification runs faster to some extent and
requires less memory.

I have measured roughly of run time and memory requirement for
three configurations on CentOS6 on Vbox with 2GB mem 4 cores
running on Win7-Corei7, transferring (30 bytes * 2 cols) *
2000000 tuples (120MB net) within this virutal machine. The
results are below.

xfer time Peak RSS
Original : 6.02s 850MB
libpq patch + Original dblink : 6.11s 850MB
full patch : 4.44s 643MB

xfer time here is the mean of five 'real time's measured by
running sql script like this after warmup run.

=== test.sql
select dblink_connect('c', 'host=localhost port=5432 dbname=test');
select * from dblink('c', 'select a,c from foo limit 2000000') as (a text, b bytea) limit 1;

select dblink_disconnect('c');
===
$ for i in $(seq 1 10); do time psql test -f t.sql; done
===

Peak RSS is measured by picking up heap Rss in /proc/[pid]/smaps.

It seems somewhat slow using patched libpq and original dblink,
but it seems within error range too. If this amount of slowdown
is not permissible, it might be improved by restoring the static
call route before for extra redundancy of the code.

On the other hand, full patch version seems obviously fast and
requires less memory. Isn't it nice?

This patch consists of two sub patches.

The first is a patch for libpq to allow rewiring tuple storage
mechanism. But default behavior is not changed. Existing libpq
client should run with it.

The second is modify dblink to storing received tuples into
tuplestore directly using the mechanism above.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
libpq_subst_storage_v1.patch	text/x-patch	15.8 KB
dblink_direct_tuplestore_v1.patch	text/x-patch	11.4 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-01 10:48:03
Message-ID:	20111201.194803.07040624.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ouch! I'm sorry for making a reverse patch for the first modification.

This is an amendment of the message below. The body text is
copied into this message.

http://archives.postgresql.org/message-id/20111201.192419.103527179.horiguchi.kyotaro@oss.ntt.co.jp

=======
Hello, This is the next version of Allow substitute allocators
for PGresult.

Totally chaning the concept from the previous one, this patch
allows libpq to handle alternative tuple store for received
tuples.

Design guidelines are shown below.

- No need to modify existing client code of libpq.

- Existing libpq client runs with roughly same performance, and
dblink with modification runs faster to some extent and
requires less memory.

xfer time Peak RSS
Original : 6.02s 850MB
libpq patch + Original dblink : 6.11s 850MB
full patch : 4.44s 643MB

xfer time here is the mean of five 'real time's measured by
running sql script like this after warmup run.

=== test.sql
select dblink_connect('c', 'host=localhost port=5432 dbname=test');
select * from dblink('c', 'select a,c from foo limit 2000000') as (a text, b bytea) limit 1;

select dblink_disconnect('c');
===
$ for i in $(seq 1 10); do time psql test -f t.sql; done
===

Peak RSS is measured by picking up heap Rss in /proc/[pid]/smaps.

On the other hand, full patch version seems obviously fast and
requires less memory. Isn't it nice?

This patch consists of two sub patches.

The first is a patch for libpq to allow rewiring tuple storage
mechanism. But default behavior is not changed. Existing libpq
client should run with it.

The second is modify dblink to storing received tuples into
tuplestore directly using the mechanism above.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
libpq_subst_storage_v1.patch	text/x-patch	15.8 KB
dblink_direct_tuplestore_v1.patch	text/x-patch	11.4 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-08 10:41:56
Message-ID:	20111208.194156.192895812.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

The documentation had slipped my mind.

This is the patch to add the documentation of PGresult custom
storage. It shows in section '31.19. Alternative result
storage'.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
libpq_subst_storage_doc_v1.patch	text/x-patch	9.9 KB

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-16 19:25:16
Message-ID:	4EEB9B1C.1020300@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/01/2011 05:48 AM, Kyotaro HORIGUCHI wrote:
> xfer time Peak RSS
> Original : 6.02s 850MB
> libpq patch + Original dblink : 6.11s 850MB
> full patch : 4.44s 643MB
>

These look like interesting results. Currently Tom is listed as the
reviewer on this patch, based on comments made before the CF really
started. And the patch has been incorrectly been sitting in "Waiting
for author" for the last week; oops. I'm not sure what to do with this
one now except raise a general call to see if anyone wants to take a
look at it, now that it seems to be in good enough shape to deliver
measurable results.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <greg(at)2ndQuadrant(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-18 04:09:39
Message-ID:	21132.1324181379@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <greg(at)2ndQuadrant(dot)com> writes:
> On 12/01/2011 05:48 AM, Kyotaro HORIGUCHI wrote:
>> xfer time Peak RSS
>> Original : 6.02s 850MB
>> libpq patch + Original dblink : 6.11s 850MB
>> full patch : 4.44s 643MB

> These look like interesting results. Currently Tom is listed as the
> reviewer on this patch, based on comments made before the CF really
> started. And the patch has been incorrectly been sitting in "Waiting
> for author" for the last week; oops. I'm not sure what to do with this
> one now except raise a general call to see if anyone wants to take a
> look at it, now that it seems to be in good enough shape to deliver
> measurable results.

I did list myself as reviewer some time ago, but if anyone else wants to
take it I won't be offended ;-)

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-21 16:09:59
Message-ID:	CA+TgmoagHm_06sG6AhZRC8Oe+OHLhCdhAphZ+sn-cV5GQi24uQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 8, 2011 at 5:41 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> This is the patch to add the documentation of PGresult custom
> storage. It shows in section '31.19. Alternative result
> storage'.

It would be good to consolidate this into the main patch.

I find the names of the functions added here to be quite confusing and
would suggest renaming them. I expected PQgetAsCstring to do
something similar to PQgetvalue, but the code is completely different,
and even after reading the documentation I still don't understand what
that function is supposed to be used for. Why "as cstring"? What
would the other option be?

I also don't think the "add tuple" terminology is particularly good.
It's not obvious from the name that what you're doing is overriding
the way memory is allocated and results are stored.

Also, what about the problem Tom mentioned here?

http://archives.postgresql.org/message-id/1042.1321123761@sss.pgh.pa.us

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	robertmhaas(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Allow substitute allocators for PGresult.
Date:	2011-12-23 07:38:28
Message-ID:	20111223.163828.75337757.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, thank you for taking the time for comment.

At Wed, 21 Dec 2011 11:09:59 -0500, Robert Haas <robertmhaas(at)gmail(dot)com> wrote
> I find the names of the functions added here to be quite
> confusing and would suggest renaming them. I expected
> PQgetAsCstring to do something similar to PQgetvalue, but the
> code is completely different,

To be honest, I've also felt that kind of perplexity. If the
problem is simply of the naming, I can propose the another name
"PQreadAttValue"... This is not so good too...

But...

> and even after reading the documentation I still don't
> understand what that function is supposed to be used for. Why
> "as cstring"? What would the other option be?

Is it a problem of the poor description? Or about the raison
d'être of the function?

The immediate cause of the existence of the function is that
getAnotherTuple internally stores the field values of the tuples
sent from the server, in the form of PGresAttValue, and I have
found only one route to store a tuple into TupleStore is
BuildeTupleFromCStrings() to tupelstore_puttuple() which is
dblink does in materializeResult(), and of cource C-string is the
most natural format in C program, and I have hesitated to modify
execTuples.c, and I wanted to hide the details of PGresAttValue.

Assuming that the values are passed as PGresAttValue* is given
(for the reasons of performance and the extent of the
modification), the "adding tuple" functions should get the value
from the struct. This can be done in two ways from the view of
authority (`encapsulation', in other words) and convenience, one
is with the knowledge of the structure, and the other is without
it. PQgetAsCstring is the latter approach. (And it is
inconsistent with the fact that the definition of PGresAttValue
is moved into lipq-fe.h from libpq-int.h. The details of the
structure should be hidden like PGresult in this approach).

But it is not obvious that the choice is better than the another
one. If we consider that PGresAttValue is too simple and stable
to hide the details, PQgetAsCString will be taken off and the
problem will go out. PGresAttValue needs documentation in this
case.

I prefer to handle PGresAttValue directly if no problem.

> I also don't think the "add tuple" terminology is particularly good.
> It's not obvious from the name that what you're doing is overriding
> the way memory is allocated and results are stored.

This phrase is taken from pqAddTuple() in fe-exec.c at first and
have not been changed after the function is integrated with other
functions.

I propose "tuple storage handler" for the alternative.

- typedef void *(*addTupleFunction)(...);
+ typedef void *(*tupleStorageHandler)(...);

- typedef enum { ADDTUP_*, } AddTupFunc;
+ typedef enum { TSHANDLER_*, } TSHandlerCommand;

- void *PQgetAddTupleParam(...);
+ void *PQgetTupleStrageHandlerContext(...);

- void PQregisterTupleAdder(...);
+ void PQregisterTupleStoreHandler(...);

- addTupleFunction PGresult.addTupleFunc;
+ tupleStorageHandler PGresult.tupleStorageHandlerFunc;

- void *PGresult.addTuleFuncParam;
+ void *PGresult.tupleStorageHandlerContext;

- char *PGresult.addTuleFuncErrMes;
+ void *PGresult.tupelStrageHandlerErrMes;

> Also, what about the problem Tom mentioned here?
>
> http://archives.postgresql.org/message-id/1042.1321123761@sss.pgh.pa.us

The plan that simply replace malloc's with something like
palloc's is abandoned for the narrow scope.

dblink-plus copies whole PGresult into TupleStore in order to
avoid making orphaned memory on SIGINT. The resource owner
mechanism is principally applicable to that but practically hard
for the reason that current implementation without radically
modification couldn't accept it.. In addition to that, dblink
also does same thing for maybe the same reason with dblink-plus
and another reason as far as I heard.

Whatever the reason is, both dblink and dblink-plus do the same
thing that could lower the performance than expected.

If TupleStore(TupleDesc) is preferable to PGresult for in-backend
use and oridinary(client-use) libpq users can handle only
PGresult, the mechanism like this patch would be reuired to
maintain the compatibility, I think. To the contrary, if there is
no significant reason to use TupleStore in backend use - it
implies that existing mechanisms like resource owner can save the
backend inexpensively from possible inconvenience caused by using
PGresult storage in backends - PGresult should be used as it is.

I think TupleStore prepared to be used in backend is preferable
for the usage and don't want to get data making detour via
PGresult.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Greg Smith <greg(at)2ndQuadrant(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Speed dblink using alternate libpq tuple storage
Date:	2012-01-14 04:55:47
Message-ID:	4F110AD3.2090607@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

One patch that fell off the truck during a turn in the November
CommitFest was "Allow substitute allocators for PGresult". Re-reading
the whole thing again, it actually turned into a rather different
submission in the middle, and I know I didn't follow that shift
correctly then. I'm replying to its thread but have changed the subject
to reflect that change. From a procedural point of view, I don't feel
right kicking this back to its author on a Friday night when the
deadline for resubmitting it would be Sunday. Instead I've refreshed
the patch myself and am adding it to the January CommitFest. The new
patch is a single file; it's easy enough to split out the dblink changes
if someone wants to work with the pieces separately.

After my meta-review I think we should get another reviewer familiar
with using dblink to look at this next. This is fundamentally a
performance patch now. Some results and benchmarking code were
submitted along with it; the other issues are moot if those aren't
reproducible. The secondary goal for a new review here is to provide
another opinion on the naming issues and abstraction concerns raised so far.

To clear out the original line of thinking, this is not a replacement
low-level storage allocator anymore. The idea of using such a mechanism
to help catch memory leaks has also been dropped.

Instead this adds and documents a new path for libpq callers to more
directly receive tuples, for both improved speed and lower memory
usage. dblink has been modified to use this new mechanism.
Benchmarking by the author suggests no significant change in libpq speed
when only that change was made, while the modified dblink using the new
mechanism was significantly faster. It jumped from 332K tuples/sec to
450K, a 35% gain, and had a lower memory footprint too. Test
methodology and those results are at
http://archives.postgresql.org/pgsql-hackers/2011-12/msg00008.php

Robert Haas did a quick code review of this already, it along with
author response mixed in are at
http://archives.postgresql.org/pgsql-hackers/2011-12/msg01149.php I see
two areas of contention there:

-There are several naming bits no one is happy with yet. Robert didn't
like some of them, but neither did Kyotaro. I don't have an opinion
myself. Is it the case that some changes to the existing code's
terminology are what's actually needed to make this all better? Or is
this just fundamentally warty and there's nothing to be done about it.
Dunno.

-There is an abstraction wrapper vs. coding convenience trade-off
centering around PGresAttValue. It sounded to me like it raised always
fun questions like "where's the right place for the line between
lipq-fe.h and libpq-int.h to be?"

dblink is pretty popular, and this is a big performance win for it. If
naming and API boundary issues are the worst problems here, this sounds
like something well worth pursuing as part of 9.2's still advancing
performance theme.

--
Greg Smith 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com

Attachment	Content-Type	Size
alt-result-storage-v1.patch	text/x-patch	42.1 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	pgsql-hackers(at)postgresql(dot)org, greg(at)2ndQuadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-17 08:53:33
Message-ID:	20120117.175333.162644966.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, This is revised and rebased version of the patch.

a. Old term `Add Tuple Function' is changed to 'Store
Handler'. The reason why not `storage' is simply length of the
symbols.

b. I couldn't find the place to settle PGgetAsCString() in. It is
removed and storeHandler()@dblink.c touches PGresAttValue
directly in this new patch. Definition of PGresAttValue stays
in lipq-fe.h and provided with comment.

c. Refine error handling of dblink.c. I think it preserves the
previous behavior for column number mismatch and type
conversion exception.

d. Document is revised.

> It jumped from 332K tuples/sec to 450K, a 35% gain, and had a
> lower memory footprint too. Test methodology and those results
> are at
> http://archives.postgresql.org/pgsql-hackers/2011-12/msg00008.php

It is a disappointment that I found that the gain had become
lower than that according to the re-measuring.

For CentOS6.2 and other conditions are the same to the previous
testing, the overall performance became hihger and the loss of
libpq patch was 1.8% and the gain of full patch had been fallen
to 5.6%. But the reduction of the memory usage was not changed.

Original : 3.96s 100.0%
w/libpq patch : 4.03s 101.8%
w/libpq+dblink patch : 3.74s 94.4%

The attachments are listed below.

libpq_altstore_20120117.patch
- Allow alternative storage for libpql.

dblink_perf_20120117.patch
- Modify dblink to use alternative storage mechanism.

libpq_altstore_doc_20120117.patch
- Document for libpq_altstore. Shows in "31.19. Alternatie result storage"

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
libpq_altstore_20120117.patch	text/x-patch	15.6 KB
dblink_perf_20120117.patch	text/x-patch	11.8 KB
libpq_altstore_doc_20120117.patch	text/x-patch	9.2 KB

From:	Marko Kreen <markokr(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org, greg(at)2ndQuadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-20 14:49:45
Message-ID:	20120120144945.GA4863@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 17, 2012 at 05:53:33PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, This is revised and rebased version of the patch.
>
> a. Old term `Add Tuple Function' is changed to 'Store
> Handler'. The reason why not `storage' is simply length of the
> symbols.
>
> b. I couldn't find the place to settle PGgetAsCString() in. It is
> removed and storeHandler()@dblink.c touches PGresAttValue
> directly in this new patch. Definition of PGresAttValue stays
> in lipq-fe.h and provided with comment.
>
> c. Refine error handling of dblink.c. I think it preserves the
> previous behavior for column number mismatch and type
> conversion exception.
>
> d. Document is revised.

First, my priority is one-the-fly result processing,
not the allocation optimizing. And this patch seems to make
it possible, I can process results row-by-row, without the
need to buffer all of them in PQresult. Which is great!

But the current API seems clumsy, I guess its because the
patch grew from trying to replace the low-level allocator.

I would like to propose better one-shot API with:

void *(*RowStoreHandler)(PGresult *res, PGresAttValue *columns);

where the PGresAttValue * is allocated once, inside PQresult.
And the pointers inside point directly to network buffer.
Ofcourse this requires replacing the current per-column malloc+copy
pattern with per-row parse+handle pattern, but I think resulting
API will be better:

1) Pass-through processing do not need to care about unnecessary
per-row allocations.

2) Handlers that want to copy of the row (like regular libpq),
can optimize allocations by having "global" view of the row.
(Eg. One allocation for row header + data).

This also optimizes call patterns - first libpq parses packet,
then row handler processes row, no unnecessary back-and-forth.

Summary - current API has various assumptions how the row is
processed, let's remove those.

--
marko

From:	"Marc Mamin" <M(dot)Mamin(at)intershop(dot)de>
To:	"Marko Kreen" <markokr(at)gmail(dot)com>, "Kyotaro HORIGUCHI" <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, <greg(at)2ndQuadrant(dot)com>
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-21 11:52:34
Message-ID:	C4DAC901169B624F933534A26ED7DF310861B301@JENMAIL01.ad.intershop.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> >
> > c. Refine error handling of dblink.c. I think it preserves the
> > previous behavior for column number mismatch and type
> > conversion exception.

Hello,

I don't know if this cover following issue.
I just mention it for the case you didn't notice it and would like to
handle this rather cosmetic issue as well.

http://archives.postgresql.org/pgsql-bugs/2011-08/msg00113.php

best regards,

Marc Mamin

From:	Marko Kreen <markokr(at)gmail(dot)com>
To:	Marc Mamin <M(dot)Mamin(at)intershop(dot)de>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, greg(at)2ndquadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-23 11:17:49
Message-ID:	CACMqXCJ9DTHvQ2xrc8PKrgvP8SrryNk4XS0BJ1jcSdHPpsG2Ew@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jan 21, 2012 at 1:52 PM, Marc Mamin <M(dot)Mamin(at)intershop(dot)de> wrote:
>> >
>> > c. Refine error handling of dblink.c. I think it preserves the
>> > previous behavior for column number mismatch and type
>> > conversion exception.
>
> Hello,
>
> I don't know if this cover following issue.
> I just mention it for the case you didn't notice it and would like to
> handle this rather cosmetic issue as well.
>
> http://archives.postgresql.org/pgsql-bugs/2011-08/msg00113.php

It is not relevant to this thread, but seems good idea to implement indeed.
It should be simple matter of creating handler that uses dblink_res_error()
to report the notice.

Perhaps you could create and submit the patch by yourself?

For reference, here it the full flow in PL/Proxy:

1) PQsetNoticeReceiver:
https://github.com/markokr/plproxy-dev/blob/master/src/execute.c#L422
2) handle_notice:
https://github.com/markokr/plproxy-dev/blob/master/src/execute.c#L370
3) plproxy_remote_error:
https://github.com/markokr/plproxy-dev/blob/master/src/main.c#L82

--
marko

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	markokr(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, greg(at)2ndQuadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-27 04:45:15
Message-ID:	20120127.134515.205871834.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for the comment,

> First, my priority is one-the-fly result processing,
> not the allocation optimizing. And this patch seems to make
> it possible, I can process results row-by-row, without the
> need to buffer all of them in PQresult. Which is great!
>
> But the current API seems clumsy, I guess its because the
> patch grew from trying to replace the low-level allocator.

Exactly.

> I would like to propose better one-shot API with:
>
> void *(*RowStoreHandler)(PGresult *res, PGresAttValue *columns);
>
> where the PGresAttValue * is allocated once, inside PQresult.
> And the pointers inside point directly to network buffer.

Good catch, thank you. The patch is dragging too much from the
old implementation. It is no need to copy the data inside
getAnotherTuple to do it, as you say.

> Ofcourse this requires replacing the current per-column malloc+copy
> pattern with per-row parse+handle pattern, but I think resulting
> API will be better:
>
> 1) Pass-through processing do not need to care about unnecessary
> per-row allocations.
>
> 2) Handlers that want to copy of the row (like regular libpq),
> can optimize allocations by having "global" view of the row.
> (Eg. One allocation for row header + data).
>
> This also optimizes call patterns - first libpq parses packet,
> then row handler processes row, no unnecessary back-and-forth.
>
>
> Summary - current API has various assumptions how the row is
> processed, let's remove those.

Thank you, I rewrite the patch to make it realize.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	markokr(at)gmail(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, greg(at)2ndQuadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-27 08:57:01
Message-ID:	20120127.175701.135957224.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, This is a new version of the patch formerly known as
'alternative storage for libpq'.

- Changed the concept to 'Alternative Row Processor' from
'Storage handler'. Symbol names are also changed.

- Callback function is modified following to the comment.

- From the restriction of time, I did minimum check for this
patch. The purpose of this patch is to show the new implement.

- Proformance is not measured for this patch for the same
reason. I will do that on next monday.

- The meaning of PGresAttValue is changed. The field 'value' now
contains a value withOUT terminating zero. This change seems to
have no effect on any other portion within the whole source
tree of postgresql from what I've seen.

> > I would like to propose better one-shot API with:
> >
> > void *(*RowStoreHandler)(PGresult *res, PGresAttValue *columns);
...
> > 1) Pass-through processing do not need to care about unnecessary
> > per-row allocations.
> >
> > 2) Handlers that want to copy of the row (like regular libpq),
> > can optimize allocations by having "global" view of the row.
> > (Eg. One allocation for row header + data).

I expect the new implementation is far more better than the
orignal.

regargs,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
libpq_rowproc_20120127.patch	text/x-patch	18.2 KB
libpq_rowproc_doc_20120127.patch	text/x-patch	6.0 KB
dblink_use_rowproc_20120127.patch	text/x-patch	11.5 KB

From:	Merlin Moncure <mmoncure(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	markokr(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, greg(at)2ndquadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-27 15:35:04
Message-ID:	CAHyXU0wYA6A0+K=iH2-oX+6Q20TFqhk6o7dOrurEEOY=5iftLg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 2:57 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp> wrote:
> Hello, This is a new version of the patch formerly known as
> 'alternative storage for libpq'.

I took a quick look at the patch and the docs. Looks good and agree
with rationale and implementation. I see you covered the pqsetvalue
case which is nice. I expect libpq C api clients coded for
performance will immediately gravitate to this api.

> - The meaning of PGresAttValue is changed. The field 'value' now
> contains a value withOUT terminating zero. This change seems to
> have no effect on any other portion within the whole source
> tree of postgresql from what I've seen.

This is a minor point of concern. This function was exposed to
support libpqtypes (which your stuff compliments very nicely by the
way) and I quickly confirmed removal of the null terminator didn't
cause any problems there. I doubt anyone else is inspecting the
structure directly (also searched the archives and didn't find
anything).

This needs to be advertised very loudly in the docs -- I understand
why this was done but it's a pretty big change in the way the api
works.

merlin

From:	Marko Kreen <markokr(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org, greg(at)2ndQuadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-27 15:42:14
Message-ID:	20120127154214.GB4107@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 05:57:01PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, This is a new version of the patch formerly known as
> 'alternative storage for libpq'.
>
> - Changed the concept to 'Alternative Row Processor' from
> 'Storage handler'. Symbol names are also changed.
>
> - Callback function is modified following to the comment.
>
> - From the restriction of time, I did minimum check for this
> patch. The purpose of this patch is to show the new implement.
>
> - Proformance is not measured for this patch for the same
> reason. I will do that on next monday.
>
> - The meaning of PGresAttValue is changed. The field 'value' now
> contains a value withOUT terminating zero. This change seems to
> have no effect on any other portion within the whole source
> tree of postgresql from what I've seen.

I think we have general structure in place. Good.

Minor notes:

= rowhandler api =

* It returns bool, so void* is wrong. Instead libpq style is to use int,
with 1=OK, 0=Failure. Seems that was also old pqAddTuple() convention.

* Drop PQgetRowProcessorParam(), instead give param as argument.

* PQsetRowProcessorErrMes() should strdup() the message. That gets rid of
allocator requirements in API. This also makes safe to pass static
strings there. If strdup() fails, fall back to generic no-mem message.

* Create new struct to replace PGresAttValue for rowhandler usage.
RowHandler API is pretty unique and self-contained. It should have
it's own struct. Main reason is that it allows to properly document it.
Otherwise the minor details get lost as they are different from
libpq-internal usage. Also this allows two structs to be
improved separately. (PGresRawValue?)

* Stop storing null_value into ->value. It's libpq internal detail.
Instead the ->value should always point into buffer where the value
info is located, even for NULL. This makes safe to simply subtract
pointers to get row size estimate. Seems pqAddTuple() already
does null_value logic, so no need to do it in rowhandler api.

= libpq =

Currently its confusing whether rowProcessor can be NULL, and what
should be done if so. I think its better to fix usage so that
it is always set.

* PQregisterRowProcessor() should use default func if func==NULL.
and set default handler if so.
* Never set rowProcessor directly, always via PQregisterRowProcessor()
* Drop all if(rowProcessor) checks.

= dblink =

* There are malloc failure checks missing in initStoreInfo() & storeHandler().

--
marko

PS. You did not hear it from me, but most raw values are actually
nul-terminated in protocol. Think big-endian. And those which
are not, you can make so, as the data is not touched anymore.
You cannot do it for last value, as next byte may not be allocated.
But you could memmove() it lower address so you can null-terminate.

I'm not suggesting it for official patch, but it would be fun to know
if such hack is benchmarkable, and benchmarkable on realistic load.

From:	Marko Kreen <markokr(at)gmail(dot)com>
To:	Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>, pgsql-hackers(at)postgresql(dot)org, greg(at)2ndquadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-27 15:48:11
Message-ID:	20120127154811.GC4107@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jan 27, 2012 at 09:35:04AM -0600, Merlin Moncure wrote:
> On Fri, Jan 27, 2012 at 2:57 AM, Kyotaro HORIGUCHI
> > - The meaning of PGresAttValue is changed. The field 'value' now
> > contains a value withOUT terminating zero. This change seems to
> > have no effect on any other portion within the whole source
> > tree of postgresql from what I've seen.
>
> This is a minor point of concern. This function was exposed to
> support libpqtypes (which your stuff compliments very nicely by the
> way) and I quickly confirmed removal of the null terminator didn't
> cause any problems there. I doubt anyone else is inspecting the
> structure directly (also searched the archives and didn't find
> anything).
>
> This needs to be advertised very loudly in the docs -- I understand
> why this was done but it's a pretty big change in the way the api
> works.

Note that the non-NUL-terminated PGresAttValue is only used for row
handler. So no existing usage is affected.

But I agree using same struct in different situations is confusing,
thus the request for separate struct for row handler usage.

--
marko

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
To:	markokr(at)gmail(dot)com
Cc:	mmoncure(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, greg(at)2ndquadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-30 09:06:57
Message-ID:	20120130.180657.220412574.horiguchi.kyotaro@oss.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for comments, this is revised version of the patch.

The gain of performance is more than expected. Measure script now
does query via dblink ten times for stability of measuring, so
the figures become about ten times longer than the previous ones.

sec % to Original
Original : 31.5 100.0%
RowProcessor patch : 31.3 99.4%
dblink patch : 24.6 78.1%

RowProcessor patch alone makes no loss or very-little gain, and
full patch gives us 22% gain for the benchmark(*1).

The modifications are listed below.

- No more use of PGresAttValue for this mechanism, and added
PGrowValue instead. PGresAttValue has been put back to
libpq-int.h

- pqAddTuple() is restored as original and new function
paAddRow() to use as RowProcessor. (Previous pqAddTuple
implement had been buggily mixed the two usage of
PGresAttValue)

- PQgetRowProcessorParam has been dropped. Contextual parameter
is passed as one of the parameters of RowProcessor().

- RowProcessor() returns int (as bool, is that libpq convension?)
instead of void *. (Actually, void * had already become useless
as of previous patch)

- PQsetRowProcessorErrMes() is changed to do strdup internally.

- The callers of RowProcessor() no more set null_field to
PGrowValue.value. Plus, the PGrowValue[] which RowProcessor()
receives has nfields + 1 elements to be able to make rough
estimate by cols->value[nfields].value - cols->value[0].value -
something. The somthing here is 4 * nfields for protocol3 and
4 * (non-null fields) for protocol2. I fear that this applies
only for textual transfer usage...

- PQregisterRowProcessor() sets the default handler when given
NULL. (pg_conn|pg_result).rowProcessor cannot be NULL for its
lifetime.

- initStoreInfo() and storeHandler() has been provided with
malloc error handling.

And more..

- getAnotherTuple()@fe-protocol2.c is not tested utterly.

- The uniformity of the size of columns in the test data prevents
realloc from execution in dblink... More test should be done.

regards,

=====
(*1) The benchmark is done as follows,

==test.sql
select dblink_connect('c', 'host=localhost dbname=test');
select * from dblink('c', 'select a,c from foo limit 2000000') as (a text b bytea) limit 1;
...(repeat 9 times more)
select dblink_disconnect('c');
==

$ for i in $(seq 1 10); do time psql test -f t.sql; done

test=# select count(*),
min(length(a)) as a_min, max(length(a)) as a_max,
min(length(c)) as c_min, max(length(c)) as c_max from foo;

count | a_min | a_max | c_min | c_max
---------+-------+-------+-------+-------
2000000 | 29 | 29 | 29 | 29
(1 row)

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
libpq_rowproc_20120130.patch	text/x-patch	19.1 KB
libpq_rowproc_doc_20120130.patch	text/x-patch	5.5 KB
dblink_use_rowproc_20120130.patch	text/x-patch	11.7 KB

From:	Marko Kreen <markokr(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)oss(dot)ntt(dot)co(dot)jp>
Cc:	mmoncure(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, greg(at)2ndquadrant(dot)com
Subject:	Re: Speed dblink using alternate libpq tuple storage
Date:	2012-01-30 18:15:39
Message-ID:	20120130181539.GA1041@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 30, 2012 at 06:06:57PM +0900, Kyotaro HORIGUCHI wrote:
> The gain of performance is more than expected. Measure script now
> does query via dblink ten times for stability of measuring, so
> the figures become about ten times longer than the previous ones.
>
> sec % to Original
> Original : 31.5 100.0%
> RowProcessor patch : 31.3 99.4%
> dblink patch : 24.6 78.1%
>
> RowProcessor patch alone makes no loss or very-little gain, and
> full patch gives us 22% gain for the benchmark(*1).

Excellent!

> - The callers of RowProcessor() no more set null_field to
> PGrowValue.value. Plus, the PGrowValue[] which RowProcessor()
> receives has nfields + 1 elements to be able to make rough
> estimate by cols->value[nfields].value - cols->value[0].value -
> something. The somthing here is 4 * nfields for protocol3 and
> 4 * (non-null fields) for protocol2. I fear that this applies
> only for textual transfer usage...

Excact estimate is not important here. And (nfields + 1) elem
feels bit too much magic, considering that most users probably
do not need it. Without it, the logic would be:

total = last.value - first.value + ((last.len > 0) ? last.len : 0)

which isn't too complex. So I think we can remove it.

= Problems =

* Remove the dubious memcpy() in pqAddRow()

* I think the dynamic arrays in getAnotherTuple() are not portable enough,
please do proper allocation for array. I guess in PQsetResultAttrs()?

= Minor notes =

These can be argued either way, if you don't like some
suggestion, you can drop it.

* Move PQregisterRowProcessor() into fe-exec.c, then we can make
pqAddRow static.

* Should PQclear() free RowProcessor error msg? It seems
it should not get outside from getAnotherTuple(), but
thats not certain. Perhaps it would be clearer to free
it here too.

* Remove the part of comment in getAnotherTuple():
* Buffer content may be shifted on reloading additional
* data. So we must set all pointers on every scan.

It's confusing why it needs to clarify that, as there
is nobody expecting it.

* PGrowValue documentation should mention that ->value pointer
is always valid.

* dblink: Perhaps some of those mallocs() could be replaced
with pallocs() or even StringInfo, which already does
the realloc dance? I'm not familiar with dblink, and
various struct lifetimes there so I don't know it that
actually makes sense or not.

It seems this patch is getting ReadyForCommitter soon...

--
marko