Quick Links

Re: profiling connection overhead

Lists:	pgsql-hackers

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	profiling connection overhead
Date:	2010-11-24 05:07:27
Message-ID:	AANLkTikX872517kk-SkLTgZ=o8rewHGqJQEeb2eH_TPh@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Per previous threats, I spent some time tonight running oprofile
(using the directions Tom Lane was foolish enough to provide me back
in May). I took testlibpq.c and hacked it up to just connect to the
server and then disconnect in a tight loop without doing anything
useful, hoping to measure the overhead of starting up a new
connection. Ha, ha, funny about that:

120899 18.0616 postgres AtProcExit_Buffers
56891 8.4992 libc-2.11.2.so memset
30987 4.6293 libc-2.11.2.so memcpy
26944 4.0253 postgres hash_search_with_hash_value
26554 3.9670 postgres AllocSetAlloc
20407 3.0487 libc-2.11.2.so _int_malloc
17269 2.5799 libc-2.11.2.so fread
13005 1.9429 ld-2.11.2.so do_lookup_x
11850 1.7703 ld-2.11.2.so _dl_fixup
10194 1.5229 libc-2.11.2.so _IO_file_xsgetn

In English: the #1 overhead here is actually something that happens
when processes EXIT, not when they start. Essentially all the time is
in two lines:

56920 6.6006 : for (i = 0; i < NBuffers; i++)
: {
98745 11.4507 : if (PrivateRefCount[i] != 0)

Non-default configs:

max_connections = 100
shared_buffers = 480MB
work_mem = 4MB
maintenance_work_mem = 128MB
checkpoint_segments = 30
random_page_cost = 2.0

Anything we can do about this? That's a lot of overhead, and it'd be
a lot worse on a big machine with 8GB of shared_buffers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 05:08:37
Message-ID:	AANLkTim166HefX0V=59=COk2==73Bz=whiUxqJjmGGLk@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 12:07 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Per previous threats, I spent some time tonight running oprofile
> (using the directions Tom Lane was foolish enough to provide me back
> in May). I took testlibpq.c and hacked it up to just connect to the
> server and then disconnect in a tight loop without doing anything
> useful, hoping to measure the overhead of starting up a new
> connection.

Oh, right: attachments.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
reconnect-opreport.txt	text/plain	38.0 KB
reconnect.c	text/x-csrc	997 bytes

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 07:10:37
Message-ID:	4CECBA6D.4020802@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 24.11.2010 07:07, Robert Haas wrote:
> Per previous threats, I spent some time tonight running oprofile
> (using the directions Tom Lane was foolish enough to provide me back
> in May). I took testlibpq.c and hacked it up to just connect to the
> server and then disconnect in a tight loop without doing anything
> useful, hoping to measure the overhead of starting up a new
> connection. Ha, ha, funny about that:
>
> 120899 18.0616 postgres AtProcExit_Buffers
> 56891 8.4992 libc-2.11.2.so memset
> 30987 4.6293 libc-2.11.2.so memcpy
> 26944 4.0253 postgres hash_search_with_hash_value
> 26554 3.9670 postgres AllocSetAlloc
> 20407 3.0487 libc-2.11.2.so _int_malloc
> 17269 2.5799 libc-2.11.2.so fread
> 13005 1.9429 ld-2.11.2.so do_lookup_x
> 11850 1.7703 ld-2.11.2.so _dl_fixup
> 10194 1.5229 libc-2.11.2.so _IO_file_xsgetn
>
> In English: the #1 overhead here is actually something that happens
> when processes EXIT, not when they start. Essentially all the time is
> in two lines:
>
> 56920 6.6006 : for (i = 0; i< NBuffers; i++)
> : {
> 98745 11.4507 : if (PrivateRefCount[i] != 0)

Oh, that's quite surprising.

> Anything we can do about this? That's a lot of overhead, and it'd be
> a lot worse on a big machine with 8GB of shared_buffers.

Micro-optimizing that search for the non-zero value helps a little bit
(attached). Reduces the percentage shown by oprofile from about 16% to
12% on my laptop.

For bigger gains, I think you need to somehow make the PrivateRefCount
smaller. Perhaps only use one byte for each buffer instead of int32, and
use some sort of an overflow list for the rare case that a buffer is
pinned more than 255 times. Or make it a hash table instead of a simple
lookup array. But whatever you do, you have to be very careful to not
add overhead to PinBuffer/UnPinBuffer, those can already be quite high
in oprofile reports of real applications. It might be worth
experimenting a bit, at the moment PrivateRefCount takes up 5MB of
memory per 1GB of shared_buffers. Machines with a high shared_buffers
setting have no shortage of memory, but a large array like that might
waste a lot of precious CPU cache.

Now, the other question is if this really matters. Even if we eliminate
that loop in AtProcExit_Buffers altogether, is connect/disconnect still
be so slow that you have to use a connection pooler if you do that a lot?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
optimize-AtProcExit_Buffers-1.patch	text/x-diff	701 bytes

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 12:02:53
Message-ID:	AANLkTinCxJ=KigstdmD58TQ9o-kPN0H+ZQ2MEFb77Xn-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Anything we can do about this? That's a lot of overhead, and it'd be
>> a lot worse on a big machine with 8GB of shared_buffers.
>
> Micro-optimizing that search for the non-zero value helps a little bit
> (attached). Reduces the percentage shown by oprofile from about 16% to 12%
> on my laptop.
>
> For bigger gains,

The first optimization that occurred to me was "remove the loop
altogether". I could maybe see needing to do something like this if
we're recovering from an error, but why do we need to do this (except
perhaps to fail an assertion) if we're exiting cleanly? Even a
session-lifetime buffer-pin leak would be quite disastrous, one would
think.

> Now, the other question is if this really matters. Even if we eliminate that
> loop in AtProcExit_Buffers altogether, is connect/disconnect still be so
> slow that you have to use a connection pooler if you do that a lot?

Oh, I'm sure this isn't going to be nearly enough to fix that problem,
but every little bit helps; and if we never do the first optimization,
we'll never get to #30 or wherever it is that it really starts to move
the needle.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 15:25:42
Message-ID:	18594.1290612342@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Nov 24, 2010 at 2:10 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> Micro-optimizing that search for the non-zero value helps a little bit
>> (attached). Reduces the percentage shown by oprofile from about 16% to 12%
>> on my laptop.

That "micro-optimization" looks to me like your compiler leaves
something to be desired.

> The first optimization that occurred to me was "remove the loop
> altogether".

Or make it execute only in assert-enabled mode, perhaps.

This check had some use back in the bad old days, but the ResourceOwner
mechanism has probably removed a lot of the argument for it.

The counter-argument might be that failing to remove a buffer pin would
be disastrous; but I can't see that it'd be worse than failing to remove
an LWLock, and we have no belt-and-suspenders-too loop for those.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 16:20:38
Message-ID:	AANLkTinLXOrut35=uNA1fUD7Oo1mVTbO1apCKgJXwhpd@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> The first optimization that occurred to me was "remove the loop
>> altogether".
>
> Or make it execute only in assert-enabled mode, perhaps.
>
> This check had some use back in the bad old days, but the ResourceOwner
> mechanism has probably removed a lot of the argument for it.

Yeah, that's what I was thinking - this could would have been a good
backstop when our cleanup mechanisms were not as robust as they seem
to be today. But making the check execute only in assert-enabled more
doesn't seem right, since the check actually acts to mask other coding
errors, rather than reveal them. Maybe we replace the check with one
that only occurs in an Assert-enabled build and just loops through and
does Assert(PrivateRefCount[i] == 0). I'm not sure exactly where this
gets called in the shutdown sequence, though - is it sensible to
Assert() here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 16:33:47
Message-ID:	20294.1290616427@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Or make it execute only in assert-enabled mode, perhaps.

> But making the check execute only in assert-enabled more
> doesn't seem right, since the check actually acts to mask other coding
> errors, rather than reveal them. Maybe we replace the check with one
> that only occurs in an Assert-enabled build and just loops through and
> does Assert(PrivateRefCount[i] == 0).

Yeah, that would be sensible. There is precedent for this elsewhere
too; I think there's a similar setup for checking buffer refcounts
during transaction cleanup.

> I'm not sure exactly where this
> gets called in the shutdown sequence, though - is it sensible to
> Assert() here?

Assert is sensible anywhere.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 18:01:32
Message-ID:	AANLkTim8JcxWWF3e=azuk4EdYnmc8hgp-bNGAOcpiUaZ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 11:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Wed, Nov 24, 2010 at 10:25 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Or make it execute only in assert-enabled mode, perhaps.
>
>> But making the check execute only in assert-enabled more
>> doesn't seem right, since the check actually acts to mask other coding
>> errors, rather than reveal them. Maybe we replace the check with one
>> that only occurs in an Assert-enabled build and just loops through and
>> does Assert(PrivateRefCount[i] == 0).
>
> Yeah, that would be sensible. There is precedent for this elsewhere
> too; I think there's a similar setup for checking buffer refcounts
> during transaction cleanup.
>
>> I'm not sure exactly where this
>> gets called in the shutdown sequence, though - is it sensible to
>> Assert() here?
>
> Assert is sensible anywhere.

OK, patch attached. Here's what oprofile output looks like with this applied:

3505 10.4396 libc-2.11.2.so memset
2051 6.1089 libc-2.11.2.so memcpy
1686 5.0217 postgres AllocSetAlloc
1642 4.8907 postgres hash_search_with_hash_value
1247 3.7142 libc-2.11.2.so _int_malloc
1096 3.2644 libc-2.11.2.so fread
855 2.5466 ld-2.11.2.so do_lookup_x
723 2.1535 ld-2.11.2.so _dl_fixup
645 1.9211 ld-2.11.2.so strcmp
620 1.8467 postgres MemoryContextAllocZero

Somehow I don't think I'm going to get much further with this without
figuring out how to get oprofile to cough up a call graph.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
AtProcExit_Buffers.patch	application/octet-stream	1.1 KB

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 18:06:58
Message-ID:	29281.1290622018@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> OK, patch attached.

Two comments:

1. A comment would help, something like "Assert we released all buffer pins".

2. AtProcExit_LocalBuffers should be redone the same way, for
consistency (it likely won't make any performance difference).
Note the comment for AtProcExit_LocalBuffers, too; that probably
needs to be changed along the lines of "If we missed any, and
assertions aren't enabled, we'll fail later in DropRelFileNodeBuffers
while trying to drop the temp rels".

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 18:20:36
Message-ID:	AANLkTimzMGWrAs2DTMA4n7tEt9mRHOOjDY_Rrg4WpB6F@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> OK, patch attached.
>
> Two comments:

Revised patch attached.

I tried configuring oprofile with --callgraph=10 and then running
oprofile with -c, but it gives kooky looking output I can't interpret.
For example:

6 42.8571 postgres record_in
8 57.1429 postgres pg_perm_setlocale
17035 5.7219 libc-2.11.2.so memcpy
17035 100.000 libc-2.11.2.so memcpy [self]

Not that helpful. :-(

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
AtProcExit_Buffers-v2.patch	application/octet-stream	2.3 KB

From:	Gerhard Heift <ml-postgresql-20081012-3518(at)gheift(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 18:47:13
Message-ID:	20101124184713.GA1698@gheift
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:
> On Wed, Nov 24, 2010 at 1:06 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> OK, patch attached.
> >
> > Two comments:
>
> Revised patch attached.
>
> I tried configuring oprofile with --callgraph=10 and then running
> oprofile with -c, but it gives kooky looking output I can't interpret.
> For example:
>
> 6 42.8571 postgres record_in
> 8 57.1429 postgres pg_perm_setlocale
> 17035 5.7219 libc-2.11.2.so memcpy
> 17035 100.000 libc-2.11.2.so memcpy [self]
>
> Not that helpful. :-(

Have a look at the wiki:
http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis

> Robert Haas

Regards,
Gerhard Heift

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 19:30:47
Message-ID:	AANLkTi=8EWfpTpoo7ZHnw1_1-KwrfyWmM1DO6txq3yTp@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 1:20 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I tried configuring oprofile with --callgraph=10 and then running
> oprofile with -c, but it gives kooky looking output I can't interpret.

It looks like the trick is to compile with -fno-omit-frame-pointer.
New profiling run:

27563 10.3470 libc-2.11.2.so memset
15162 5.6917 libc-2.11.2.so memcpy
13471 5.0569 postgres hash_search_with_hash_value
13465 5.0547 postgres AllocSetAlloc
9513 3.5711 libc-2.11.2.so _int_malloc
8729 3.2768 libc-2.11.2.so fread
6336 2.3785 ld-2.11.2.so do_lookup_x
5788 2.1728 ld-2.11.2.so _dl_fixup
4995 1.8751 postgres MemoryContextAllocZero
4978 1.8687 ld-2.11.2.so strcmp

Full results, and call graph, attached. The first obvious fact is
that most of the memset overhead appears to be coming from
InitCatCache.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
reconnect-callgraph.txt.bz2	application/x-bzip2	34.9 KB
reconnect-opreport.txt.bz2	application/x-bzip2	6.8 KB

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 19:32:20
Message-ID:	201011242032.20883.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 24 November 2010 19:01:32 Robert Haas wrote:
> Somehow I don't think I'm going to get much further with this without
> figuring out how to get oprofile to cough up a call graph.
I think to do that sensibly you need CFLAGS="-O2 -fno-omit-frame-pointer"...

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gerhard Heift <ml-postgresql-20081012-3518(at)gheift(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 19:59:27
Message-ID:	17754.1290628767@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gerhard Heift <ml-postgresql-20081012-3518(at)gheift(dot)de> writes:
> On Wed, Nov 24, 2010 at 01:20:36PM -0500, Robert Haas wrote:
>> I tried configuring oprofile with --callgraph=10 and then running
>> oprofile with -c, but it gives kooky looking output I can't interpret.

> Have a look at the wiki:
> http://wiki.postgresql.org/wiki/Profiling_with_OProfile#Additional_analysis

The critical piece of information is there now, but it wasn't a minute
ago.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 20:06:47
Message-ID:	17937.1290629207@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Revised patch attached.

The asserts in AtProcExit_LocalBuffers are a bit pointless since
you forgot to remove the code that forcibly zeroes LocalRefCount[]...
otherwise +1.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 20:14:22
Message-ID:	18141.1290629662@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Full results, and call graph, attached. The first obvious fact is
> that most of the memset overhead appears to be coming from
> InitCatCache.

AFAICT that must be the palloc0 calls that are zeroing out (mostly)
the hash bucket headers. I don't see any real way to make that cheaper
other than to cut the initial sizes of the hash tables (and add support
for expanding them later, which is lacking in catcache ATM). Not
convinced that that creates any net savings --- it might just save
some cycles at startup in exchange for more cycles later, in typical
backend usage.

Making those hashtables expansible wouldn't be a bad thing in itself,
mind you.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-11-24 20:47:32
Message-ID:	AANLkTikOt8aFyufvgBGgFsVZ0HT8ghRY7uyEvQUJ-E+B@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Full results, and call graph, attached. The first obvious fact is
>> that most of the memset overhead appears to be coming from
>> InitCatCache.
>
> AFAICT that must be the palloc0 calls that are zeroing out (mostly)
> the hash bucket headers. I don't see any real way to make that cheaper
> other than to cut the initial sizes of the hash tables (and add support
> for expanding them later, which is lacking in catcache ATM). Not
> convinced that that creates any net savings --- it might just save
> some cycles at startup in exchange for more cycles later, in typical
> backend usage.
>
> Making those hashtables expansible wouldn't be a bad thing in itself,
> mind you.

The idea I had was to go the other way and say, hey, if these hash
tables can't be expanded anyway, let's put them on the BSS instead of
heap-allocating them. Any new pages we request from the OS will be
zeroed anyway, but with palloc we then have to re-zero the allocated
block anyway because palloc can return a memory that's been used,
freed, and reused. However, for anything that only needs to be
allocated once and never freed, and whose size can be known at compile
time, that's not an issue.

In fact, it wouldn't be that hard to relax the "known at compile time"
constraint either. We could just declare:

char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];

...and then peel off chunks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 20:53:20
Message-ID:	201011242153.21097.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
> On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> Full results, and call graph, attached. The first obvious fact is
> >> that most of the memset overhead appears to be coming from
> >> InitCatCache.
> >
> > AFAICT that must be the palloc0 calls that are zeroing out (mostly)
> > the hash bucket headers. I don't see any real way to make that cheaper
> > other than to cut the initial sizes of the hash tables (and add support
> > for expanding them later, which is lacking in catcache ATM). Not
> > convinced that that creates any net savings --- it might just save
> > some cycles at startup in exchange for more cycles later, in typical
> > backend usage.
> >
> > Making those hashtables expansible wouldn't be a bad thing in itself,
> > mind you.
>
> The idea I had was to go the other way and say, hey, if these hash
> tables can't be expanded anyway, let's put them on the BSS instead of
> heap-allocating them. Any new pages we request from the OS will be
> zeroed anyway, but with palloc we then have to re-zero the allocated
> block anyway because palloc can return a memory that's been used,
> freed, and reused. However, for anything that only needs to be
> allocated once and never freed, and whose size can be known at compile
> time, that's not an issue.
>
> In fact, it wouldn't be that hard to relax the "known at compile time"
> constraint either. We could just declare:
>
> char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
>
> ...and then peel off chunks.
Won't this just cause loads of additional pagefaults after fork() when those
pages are used the first time and then a second time when first written to (to
copy it)?

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 20:54:53
Message-ID:	AANLkTikXMdR9-YsBq5oJkSk2Ua-t-78_E_CmBs-R=v0K@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
>> On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> >> Full results, and call graph, attached. The first obvious fact is
>> >> that most of the memset overhead appears to be coming from
>> >> InitCatCache.
>> >
>> > AFAICT that must be the palloc0 calls that are zeroing out (mostly)
>> > the hash bucket headers. I don't see any real way to make that cheaper
>> > other than to cut the initial sizes of the hash tables (and add support
>> > for expanding them later, which is lacking in catcache ATM). Not
>> > convinced that that creates any net savings --- it might just save
>> > some cycles at startup in exchange for more cycles later, in typical
>> > backend usage.
>> >
>> > Making those hashtables expansible wouldn't be a bad thing in itself,
>> > mind you.
>>
>> The idea I had was to go the other way and say, hey, if these hash
>> tables can't be expanded anyway, let's put them on the BSS instead of
>> heap-allocating them. Any new pages we request from the OS will be
>> zeroed anyway, but with palloc we then have to re-zero the allocated
>> block anyway because palloc can return a memory that's been used,
>> freed, and reused. However, for anything that only needs to be
>> allocated once and never freed, and whose size can be known at compile
>> time, that's not an issue.
>>
>> In fact, it wouldn't be that hard to relax the "known at compile time"
>> constraint either. We could just declare:
>>
>> char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
>>
>> ...and then peel off chunks.
> Won't this just cause loads of additional pagefaults after fork() when those
> pages are used the first time and then a second time when first written to (to
> copy it)?

Aren't we incurring those page faults anyway, for whatever memory
palloc is handing out? The heap is no different from bss; we just
move the pointer with sbrk().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 21:05:13
Message-ID:	20931.1290632713@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> The idea I had was to go the other way and say, hey, if these hash
>>> tables can't be expanded anyway, let's put them on the BSS instead of
>>> heap-allocating them.

>> Won't this just cause loads of additional pagefaults after fork() when those
>> pages are used the first time and then a second time when first written to (to
>> copy it)?

> Aren't we incurring those page faults anyway, for whatever memory
> palloc is handing out? The heap is no different from bss; we just
> move the pointer with sbrk().

I think you're missing the real point, which that the cost you're
measuring here probably isn't so much memset() as faulting in large
chunks of address space. Avoiding the explicit memset() likely will
save little in real runtime --- it'll just make sure the initial-touch
costs are more distributed and harder to measure. But in any case I
think this idea is a nonstarter because it gets in the way of making
those hashtables expansible, which we *do* need to do eventually.

(You might be able to confirm or disprove this theory if you ask
oprofile to count memory access stalls instead of CPU clock cycles...)

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 21:05:48
Message-ID:	201011242206.12964.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 24 November 2010 21:54:53 Robert Haas wrote:
> On Wed, Nov 24, 2010 at 3:53 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Wednesday 24 November 2010 21:47:32 Robert Haas wrote:
> >> On Wed, Nov 24, 2010 at 3:14 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> >> Full results, and call graph, attached. The first obvious fact is
> >> >> that most of the memset overhead appears to be coming from
> >> >> InitCatCache.
> >> >
> >> > AFAICT that must be the palloc0 calls that are zeroing out (mostly)
> >> > the hash bucket headers. I don't see any real way to make that
> >> > cheaper other than to cut the initial sizes of the hash tables (and
> >> > add support for expanding them later, which is lacking in catcache
> >> > ATM). Not convinced that that creates any net savings --- it might
> >> > just save some cycles at startup in exchange for more cycles later,
> >> > in typical backend usage.
> >> >
> >> > Making those hashtables expansible wouldn't be a bad thing in itself,
> >> > mind you.
> >>
> >> The idea I had was to go the other way and say, hey, if these hash
> >> tables can't be expanded anyway, let's put them on the BSS instead of
> >> heap-allocating them. Any new pages we request from the OS will be
> >> zeroed anyway, but with palloc we then have to re-zero the allocated
> >> block anyway because palloc can return a memory that's been used,
> >> freed, and reused. However, for anything that only needs to be
> >> allocated once and never freed, and whose size can be known at compile
> >> time, that's not an issue.
> >>
> >> In fact, it wouldn't be that hard to relax the "known at compile time"
> >> constraint either. We could just declare:
> >>
> >> char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
> >>
> >> ...and then peel off chunks.
> >
> > Won't this just cause loads of additional pagefaults after fork() when
> > those pages are used the first time and then a second time when first
> > written to (to copy it)?
>
> Aren't we incurring those page faults anyway, for whatever memory
> palloc is handing out? The heap is no different from bss; we just
> move the pointer with sbrk().
Yes, but only once. Also scrubbing a page is faster than copying it... (and
there were patches floating around to do that in advance, not sure if they got
integrated into mainline linux)

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, HeikkiLinnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 21:18:08
Message-ID:	55CFD67C-7D8E-44A0-970D-5635CBF6A2BE@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Nov 24, 2010, at 4:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>>
>>> Won't this just cause loads of additional pagefaults after fork() when
>>> those pages are used the first time and then a second time when first
>>> written to (to copy it)?
>>
>> Aren't we incurring those page faults anyway, for whatever memory
>> palloc is handing out? The heap is no different from bss; we just
>> move the pointer with sbrk().
> Yes, but only once. Also scrubbing a page is faster than copying it... (and
> there were patches floating around to do that in advance, not sure if they got
> integrated into mainline linux)

I'm not following - can you elaborate?

...Robert

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, HeikkiLinnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 21:25:45
Message-ID:	21943.1290633945@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Nov 24, 2010, at 4:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> Yes, but only once. Also scrubbing a page is faster than copying it... (and
>> there were patches floating around to do that in advance, not sure if they got
>> integrated into mainline linux)

> I'm not following - can you elaborate?

I think Andres is saying that bss space isn't optimized during a fork
operation: it'll be propagated to the child as copy-on-write pages.
Dunno if that's true or not, but if it is, it'd be a good reason to
avoid the scheme you're suggesting.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, HeikkiLinnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 21:28:43
Message-ID:	201011242228.44011.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 24 November 2010 22:18:08 Robert Haas wrote:
> On Nov 24, 2010, at 4:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >>> Won't this just cause loads of additional pagefaults after fork() when
> >>> those pages are used the first time and then a second time when first
> >>> written to (to copy it)?
> >>
> >> Aren't we incurring those page faults anyway, for whatever memory
> >> palloc is handing out? The heap is no different from bss; we just
> >> move the pointer with sbrk().
> >
> > Yes, but only once. Also scrubbing a page is faster than copying it...
> > (and there were patches floating around to do that in advance, not sure
> > if they got integrated into mainline linux)
> I'm not following - can you elaborate?
When forking the memory mapping of the process is copied - the actual pages
are not. When a page is first accessed the page fault handler will setup a
mapping to the "old" page and mark it as shared. When now written to it will
fault again and copy the page.

In contrast if you access a page the first time after an sbrk (or mmap, doesn't
matter) a new page will get scrubbed and and a mapping will get setup.

Andres

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, HeikkiLinnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 21:30:07
Message-ID:	201011242230.07515.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 24 November 2010 22:25:45 Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Nov 24, 2010, at 4:05 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> Yes, but only once. Also scrubbing a page is faster than copying it...
> >> (and there were patches floating around to do that in advance, not sure
> >> if they got integrated into mainline linux)
> >
> > I'm not following - can you elaborate?
>
> I think Andres is saying that bss space isn't optimized during a fork
> operation: it'll be propagated to the child as copy-on-write pages.
> Dunno if that's true or not, but if it is, it'd be a good reason to
> avoid the scheme you're suggesting.
Afair nearly all pages are propagated with copy-on-write semantics.

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 21:56:47
Message-ID:	AANLkTimoZyFQ6CHxoGTnGfRgHTZv-UyLw5GJ04Zx-vzb@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> (You might be able to confirm or disprove this theory if you ask
> oprofile to count memory access stalls instead of CPU clock cycles...)

I don't see an event for that.

# opcontrol --list-events | grep STALL
INSTRUCTION_FETCH_STALL: (counter: all)
DISPATCH_STALLS: (counter: all)
DISPATCH_STALL_FOR_BRANCH_ABORT: (counter: all)
DISPATCH_STALL_FOR_SERIALIZATION: (counter: all)
DISPATCH_STALL_FOR_SEGMENT_LOAD: (counter: all)
DISPATCH_STALL_FOR_REORDER_BUFFER_FULL: (counter: all)
DISPATCH_STALL_FOR_RESERVATION_STATION_FULL: (counter: all)
DISPATCH_STALL_FOR_FPU_FULL: (counter: all)
DISPATCH_STALL_FOR_LS_FULL: (counter: all)
DISPATCH_STALL_WAITING_FOR_ALL_QUIET: (counter: all)
DISPATCH_STALL_FOR_FAR_TRANSFER_OR_RESYNC: (counter: all)

# opcontrol --list-events | grep MEMORY
MEMORY_REQUESTS: (counter: all)
MEMORY_CONTROLLER_PAGE_TABLE_OVERFLOWS: (counter: all)
MEMORY_CONTROLLER_SLOT_MISSED: (counter: all)
MEMORY_CONTROLLER_TURNAROUNDS: (counter: all)
MEMORY_CONTROLLER_BYPASS_COUNTER_SATURATION: (counter: all)
CPU_IO_REQUESTS_TO_MEMORY_IO: (counter: all)
MEMORY_CONTROLLER_REQUESTS: (counter: all)

Ideas?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 22:03:48
Message-ID:	22686.1290636228@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> (You might be able to confirm or disprove this theory if you ask
>> oprofile to count memory access stalls instead of CPU clock cycles...)

> I don't see an event for that.

You probably want something involving cache misses. The event names
vary depending on just which CPU you've got.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 22:15:13
Message-ID:	201011242315.14297.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 24 November 2010 23:03:48 Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> (You might be able to confirm or disprove this theory if you ask
> >> oprofile to count memory access stalls instead of CPU clock cycles...)
> >
> > I don't see an event for that.
>
> You probably want something involving cache misses. The event names
> vary depending on just which CPU you've got.
Or some BUS OUTSTANDING event.

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 22:30:49
Message-ID:	AANLkTimPMraEvO9uJSjeOWUPBv_-iaCzVu=ubyjYqC5M@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 5:15 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Wednesday 24 November 2010 23:03:48 Tom Lane wrote:
>> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> > On Wed, Nov 24, 2010 at 4:05 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >> (You might be able to confirm or disprove this theory if you ask
>> >> oprofile to count memory access stalls instead of CPU clock cycles...)
>> >
>> > I don't see an event for that.
>>
>> You probably want something involving cache misses. The event names
>> vary depending on just which CPU you've got.
> Or some BUS OUTSTANDING event.

I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
some options:

# opcontrol --list-events | grep CACHE
DATA_CACHE_ACCESSES: (counter: all)
DATA_CACHE_MISSES: (counter: all)
DATA_CACHE_REFILLS_FROM_L2_OR_NORTHBRIDGE: (counter: all)
DATA_CACHE_REFILLS_FROM_NORTHBRIDGE: (counter: all)
DATA_CACHE_LINES_EVICTED: (counter: all)
LOCKED_INSTRUCTIONS_DCACHE_MISSES: (counter: all)
L2_CACHE_MISS: (counter: all)
L2_CACHE_FILL_WRITEBACK: (counter: all)
INSTRUCTION_CACHE_FETCHES: (counter: all)
INSTRUCTION_CACHE_MISSES: (counter: all)
INSTRUCTION_CACHE_REFILLS_FROM_L2: (counter: all)
INSTRUCTION_CACHE_REFILLS_FROM_SYSTEM: (counter: all)
INSTRUCTION_CACHE_VICTIMS: (counter: all)
INSTRUCTION_CACHE_INVALIDATED: (counter: all)
CACHE_BLOCK_COMMANDS: (counter: all)
READ_REQUEST_L3_CACHE: (counter: all)
L3_CACHE_MISSES: (counter: all)
IBS_FETCH_ICACHE_MISSES: (ext: ibs_fetch)
IBS_FETCH_ICACHE_HITS: (ext: ibs_fetch)
IBS_OP_DATA_CACHE_MISS: (ext: ibs_op)
IBS_OP_NB_LOCAL_CACHE: (ext: ibs_op)
IBS_OP_NB_REMOTE_CACHE: (ext: ibs_op)
IBS_OP_NB_CACHE_MODIFIED: (ext: ibs_op)
IBS_OP_NB_CACHE_OWNED: (ext: ibs_op)
IBS_OP_NB_LOCAL_CACHE_LAT: (ext: ibs_op)
IBS_OP_NB_REMOTE_CACHE_LAT: (ext: ibs_op)

# opcontrol --list-events | grep MISS | grep -v CACHE
L1_DTLB_MISS_AND_L2_DTLB_HIT: (counter: all)
L1_DTLB_AND_L2_DTLB_MISS: (counter: all)
L1_ITLB_MISS_AND_L2_ITLB_HIT: (counter: all)
L1_ITLB_MISS_AND_L2_ITLB_MISS: (counter: all)
MEMORY_CONTROLLER_SLOT_MISSED: (counter: all)
IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_HITS: (ext: ibs_fetch)
IBS_FETCH_L1_ITLB_MISSES_L2_ITLB_MISSES: (ext: ibs_fetch)
IBS_OP_L1_DTLB_MISS_L2_DTLB_HIT: (ext: ibs_op)
IBS_OP_L1_L2_DTLB_MISS: (ext: ibs_op)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 22:42:58
Message-ID:	23684.1290638578@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
> some options:

> DATA_CACHE_MISSES: (counter: all)
> L3_CACHE_MISSES: (counter: all)

Those two look promising, though I can't claim to be an expert.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-24 22:44:20
Message-ID:	AANLkTikK4mkM49z-Uj4GQiuzEUCOifBjWM-95Fr4N-1S@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 24, 2010 at 5:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I don't see anything for BUS OUTSTANDING. For CACHE and MISS I have
>> some options:
>
>> DATA_CACHE_MISSES: (counter: all)
>> L3_CACHE_MISSES: (counter: all)
>
> Those two look promising, though I can't claim to be an expert.

OK. Thanksgiving is about to interfere with my access to this
machine, but I'll pick this up next week.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 04:18:58
Message-ID:	201011280418.oAS4IwX08421@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> >> In fact, it wouldn't be that hard to relax the "known at compile time"
> >> constraint either. ?We could just declare:
> >>
> >> char lotsa_zero_bytes[NUM_ZERO_BYTES_WE_NEED];
> >>
> >> ...and then peel off chunks.
> > Won't this just cause loads of additional pagefaults after fork() when those
> > pages are used the first time and then a second time when first written to (to
> > copy it)?
>
> Aren't we incurring those page faults anyway, for whatever memory
> palloc is handing out? The heap is no different from bss; we just
> move the pointer with sbrk().

Here is perhaps more detail than you wanted, but ...

Basically in a forked process, the text/program is fixed, and the
initialized data and stack are copy on write (COW). Allocating a big
block of zero memory in data is unitialized data, and the behavior there
differs depending on whether the parent process faulted in those pages.
If it did, then they are COW, but if it did not, odds are the OS just
gives them to you clean and not shared. The pages have to be empty
because if it gave you anything else it could be giving you data from
another process. For details, see
http://docs.hp.com/en/5965-4641/ch01s11.html, Faulting In a Page of
Stack or Uninitialized Data.

As far as sbrk(), those pages are zero-filled also, again for security
reasons. You have to clear malloc()'ed memory (or call calloc()) not
because the OS gave you dirty pages but because you might be using
memory that you previously freed. If you have never freed memory (and
the postmaster/parent has not either), I bet all malloc'ed memory would
be zero-filled.

Not sure that information moves us forward. If the postmaster cleared
the memory, we would have COW in the child and probably be even slower.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 05:01:59
Message-ID:	AANLkTinXCmWWqBGxe2Z7Ub9j4orRs-9N-+RSK=BXJ674@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Not sure that information moves us forward. If the postmaster cleared
> the memory, we would have COW in the child and probably be even slower.

Well, we can determine the answers to these questions empirically. I
think some more scrutiny of the code with the points you and Andres
and Tom have raised is probably in order, and probably some more
benchmarking, too. I haven't had a chance to do that yet, however.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 16:41:46
Message-ID:	21964.1290962506@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>> Not sure that information moves us forward. If the postmaster cleared
>> the memory, we would have COW in the child and probably be even slower.

> Well, we can determine the answers to these questions empirically.

Not really. Per Bruce's description, a page would become COW the moment
the postmaster touched (either write or read) any variable on it. Since
we have no control over how the loader lays out static variables, the
actual behavior of a particular build would be pretty random and subject
to unexpected changes caused by seemingly unrelated edits.

Also, the referenced URL only purports to describe the behavior of
HPUX, which is not exactly a mainstream OS. I think it requires a
considerable leap of faith to assume that all or even most platforms
work the way this suggests, and not in the dumber fashion Andres
suggested. Has anybody here actually looked at the relevant Linux
or BSD kernel code?

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 19:38:07
Message-ID:	AANLkTi=OBqvJtrvbZu7=aSu-GwpBRBGOYnVRLRAQhSDN@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 28, 2010 at 11:41 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
>>> Not sure that information moves us forward. If the postmaster cleared
>>> the memory, we would have COW in the child and probably be even slower.
>
>> Well, we can determine the answers to these questions empirically.
>
> Not really. Per Bruce's description, a page would become COW the moment
> the postmaster touched (either write or read) any variable on it. Since
> we have no control over how the loader lays out static variables, the
> actual behavior of a particular build would be pretty random and subject
> to unexpected changes caused by seemingly unrelated edits.

Well, one big character array pretty much has to be laid out
contiguously, and it would be pretty surprising (but not entirely
impossible) to find that the linker randomly sprinkles symbols from
other files in between consecutive definitions in the same source
file. I think the next question to answer is to try to allocate blame
for the memset/memcpy overhead between page faults and the zeroing
itself. That seems like something we can easily member by writing a
test program that zeroes the same region twice and kicks out timing
numbers. If, as you and Andres are arguing, the actual zeroing is
minor, then we can forget this whole line of discussion and move on to
other possible optimizations. If that turns out not to be true then
we can worry about how best to avoid the zeroing. I have to believe
that's a solvable problem; the question is whether there's a benefit.

In a close race, I don't think we should get bogged down in
micro-optimization here, both because micro-optimizations may not gain
much and because what works well on one platform may not do much at
all on another. The more general issue here is what to do about our
high backend startup costs. Beyond trying to recycle backends for new
connections, as I've previous proposed and with all the problems it
entails, the only thing that looks promising here is to try to somehow
cut down on the cost of populating the catcache and relcache, not that
I have a very clear idea how to do that. This has to be a soluble
problem because other people have solved it. To some degree we're a
victim of our own flexible and extensible architecture here, but I
find it pretty unsatisfying to just say, OK, well, we're slow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 20:45:18
Message-ID:	201011282045.oASKjIq04185@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> > Not sure that information moves us forward. ?If the postmaster cleared
> > the memory, we would have COW in the child and probably be even slower.
>
> Well, we can determine the answers to these questions empirically. I
> think some more scrutiny of the code with the points you and Andres
> and Tom have raised is probably in order, and probably some more
> benchmarking, too. I haven't had a chance to do that yet, however.

Basically, my bet is if you allocated a large zero-data variable in the
postmaster but never accessed it from the postmaster, at most you would
copy-on-write (COW) fault in two page, one at the beginning that is
shared by accessed variables, and one at the end. The remaining pages
(4k default for x86) would be zero-filled and not COW shared.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 20:53:00
Message-ID:	19514.1290977580@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> The more general issue here is what to do about our
> high backend startup costs. Beyond trying to recycle backends for new
> connections, as I've previous proposed and with all the problems it
> entails, the only thing that looks promising here is to try to somehow
> cut down on the cost of populating the catcache and relcache, not that
> I have a very clear idea how to do that.

One comment to make here is that it would be a serious error to focus on
the costs of just starting and stopping a backend; you have to think
about cases where the backend does at least some useful work in between,
and that means actually *populating* those caches (to some extent) not
just initializing them. Maybe your wording above was chosen with that
in mind, but I think onlookers might easily overlook the point.

FWIW, today I've been looking into getting rid of the silliness in
build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs
from sortops instead of just using the index opfamilies directly.
It turns out that once you fix that, there is no need at all for
relcache to cache per-index operator data (the rd_operator arrays)
because that's the only code that uses 'em. I don't see any particular
change in the runtime of the regression tests from ripping out that
part of the cached data, but it ought to have at least some beneficial
effect on real startup time.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 23:23:09
Message-ID:	AANLkTinkK-X5mCmW+MJzxGiJ9MO8EEc5GgmYBPHnOUJJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 28, 2010 at 3:53 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> The more general issue here is what to do about our
>> high backend startup costs. Beyond trying to recycle backends for new
>> connections, as I've previous proposed and with all the problems it
>> entails, the only thing that looks promising here is to try to somehow
>> cut down on the cost of populating the catcache and relcache, not that
>> I have a very clear idea how to do that.
>
> One comment to make here is that it would be a serious error to focus on
> the costs of just starting and stopping a backend; you have to think
> about cases where the backend does at least some useful work in between,
> and that means actually *populating* those caches (to some extent) not
> just initializing them. Maybe your wording above was chosen with that
> in mind, but I think onlookers might easily overlook the point.

I did have that in mind, but I agree the point is worth mentioning.
So, for example, it wouldn't gain anything meaningful for us to
postpone catcache initialization until someone executes a query. It
would improve the synthetic benchmark, but that's it.

> FWIW, today I've been looking into getting rid of the silliness in
> build_index_pathkeys whereby it reconstructs pathkey opfamily OIDs
> from sortops instead of just using the index opfamilies directly.
> It turns out that once you fix that, there is no need at all for
> relcache to cache per-index operator data (the rd_operator arrays)
> because that's the only code that uses 'em. I don't see any particular
> change in the runtime of the regression tests from ripping out that
> part of the cached data, but it ought to have at least some beneficial
> effect on real startup time.

Wow. that's great. The fact that it simplifies the code is probably
the main point, but obviously whatever cycles we can save during
startup (and ongoing operation) are all to the good.

One possible way to get a real speedup here would be to look for ways
to trim the number of catcaches. But I'm not too convinced there's
much water to squeeze out of that rock. After our recent conversation
about KNNGIST, it occurred to me to wonder whether there's really any
point in pretending that a user can usefully add an AM, both due to
hard-wired planner knowledge and due to lack of any sort of extensible
XLOG support. If not, we could potentially turn pg_am into a
hardcoded lookup table rather than a modifiable catalog, which would
also likely be faster; and perhaps reference AMs elsewhere with
characters rather than OIDs. But even if this were judged a sensible
thing to do I'm not very sure that even a purpose-built synthetic
benchmark would be able to measure the speedup.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-28 23:41:58
Message-ID:	26799.1290987718@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> After our recent conversation
> about KNNGIST, it occurred to me to wonder whether there's really any
> point in pretending that a user can usefully add an AM, both due to
> hard-wired planner knowledge and due to lack of any sort of extensible
> XLOG support. If not, we could potentially turn pg_am into a
> hardcoded lookup table rather than a modifiable catalog, which would
> also likely be faster; and perhaps reference AMs elsewhere with
> characters rather than OIDs. But even if this were judged a sensible
> thing to do I'm not very sure that even a purpose-built synthetic
> benchmark would be able to measure the speedup.

Well, the lack of extensible XLOG support is definitely a big handicap
to building a *production* index AM as an add-on. But it's not such a
handicap for development. And I don't believe that the planner is
hardwired in any way that doesn't allow new index types. GIST and GIN
have both been added successfully without kluging the planner. It does
know a lot more about btree than other index types, but that doesn't
mean you can't add a new index type that doesn't behave like btree;
that's more reflective of where development effort has been spent.

So I would consider the above idea a step backwards, and I doubt it
would save anything meaningful anyway.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 00:08:10
Message-ID:	AANLkTikASxu1gkSGOedyVbAMR1Va-5wED8YuHnZDHQ7a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 28, 2010 at 6:41 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> After our recent conversation
>> about KNNGIST, it occurred to me to wonder whether there's really any
>> point in pretending that a user can usefully add an AM, both due to
>> hard-wired planner knowledge and due to lack of any sort of extensible
>> XLOG support. If not, we could potentially turn pg_am into a
>> hardcoded lookup table rather than a modifiable catalog, which would
>> also likely be faster; and perhaps reference AMs elsewhere with
>> characters rather than OIDs. But even if this were judged a sensible
>> thing to do I'm not very sure that even a purpose-built synthetic
>> benchmark would be able to measure the speedup.
>
> Well, the lack of extensible XLOG support is definitely a big handicap
> to building a *production* index AM as an add-on. But it's not such a
> handicap for development.

Realistically, it's hard for me to imagine that anyone would go to the
trouble of building it as a loadable module first and then converting
it to part of core later on. That'd hardly be less work.

> And I don't believe that the planner is
> hardwired in any way that doesn't allow new index types. GIST and GIN
> have both been added successfully without kluging the planner.

We have 9 boolean flags to indicate the capabilities (or lack thereof)
of AMs, and we only have 4 AMs. It seems altogether plausible to
assume that the next AM we add could require flags 10 and 11. Heck, I
think KNNGIST is going to require another flag... which will likely
never be set for any AM other than GIST.

> It does
> know a lot more about btree than other index types, but that doesn't
> mean you can't add a new index type that doesn't behave like btree;
> that's more reflective of where development effort has been spent.
>
> So I would consider the above idea a step backwards, and I doubt it
> would save anything meaningful anyway.

That latter point, as far as I'm concerned, is the real nail in the coffin.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 00:15:43
Message-ID:	27442.1290989743@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> One possible way to get a real speedup here would be to look for ways
> to trim the number of catcaches.

BTW, it's not going to help to remove catcaches that have a small
initial size, as the pg_am cache certainly does. If the bucket zeroing
cost is really something to minimize, it's only the caches with the
largest nbuckets counts that are worth considering --- and we certainly
can't remove those without penalty.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 00:33:01
Message-ID:	27759.1290990781@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

BTW, this might be premature to mention pending some tests about mapping
versus zeroing overhead, but it strikes me that there's more than one
way to skin a cat. I still think the idea of statically allocated space
sucks. But what if we rearranged things so that palloc0 doesn't consist
of palloc-then-memset, but rather push the zeroing responsibility down
into the allocator? In particular, I'm imagining that palloc0 with a
sufficiently large space request --- more than a couple pages --- could
somehow arrange to get space that's guaranteed zero already. And if the
request isn't large, zeroing it isn't where our problem is anyhow.

The most portable way to do that would be to use calloc insted of malloc,
and hope that libc is smart enough to provide freshly-mapped space.
It would be good to look and see whether glibc actually does so,
of course. If not we might end up having to mess with sbrk for
ourselves, and I'm not sure how pleasantly that interacts with malloc.

Another question that would be worth asking here is whether the
hand-baked MemSet macro still outruns memset on modern architectures.
I think it's been quite a few years since that was last tested.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 00:52:09
Message-ID:	AANLkTi=G8yC+JarrLT7y3NBF2DcKGm9FbAXW8rQjABMZ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The most portable way to do that would be to use calloc insted of malloc,
> and hope that libc is smart enough to provide freshly-mapped space.
> It would be good to look and see whether glibc actually does so,
> of course. If not we might end up having to mess with sbrk for
> ourselves, and I'm not sure how pleasantly that interacts with malloc.

It's *supposed* to interact fine. The only thing I wonder is that I
think malloc intentionally uses mmap for larger allocations but I'm
not clear what the advantages are. Is it because it's a cheaper way to
get zeroed bytes? Or just so that free has a hope of returning the
allocations to the OS?

> Another question that would be worth asking here is whether the
> hand-baked MemSet macro still outruns memset on modern architectures.
> I think it's been quite a few years since that was last tested.

I know glibc has some sexy memset macros for cases where the size is a
constant. I'm not sure there's been much of an advance in the general
case though. This would tend to imply we should consider going the
other direction of having the caller of palloc0 do the zeroing
instead. Or making palloc0 a macro which expands to include calling
memset with the parameter inlined.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 01:01:35
Message-ID:	28452.1290992495@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Another question that would be worth asking here is whether the
>> hand-baked MemSet macro still outruns memset on modern architectures.
>> I think it's been quite a few years since that was last tested.

> I know glibc has some sexy memset macros for cases where the size is a
> constant. I'm not sure there's been much of an advance in the general
> case though. This would tend to imply we should consider going the
> other direction of having the caller of palloc0 do the zeroing
> instead. Or making palloc0 a macro which expands to include calling
> memset with the parameter inlined.

Well, that was exactly the reason why we did it the way we do it.
However, I think it's probably only node allocations where the size
is likely to be constant and hence result in a win. Perhaps we should
implement makeNode() differently from the general case.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 01:34:09
Message-ID:	AANLkTik3h4VZYibcX8=jw49ahbnKeDzQyRjS-9GsjhNo@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> One possible way to get a real speedup here would be to look for ways
>> to trim the number of catcaches.
>
> BTW, it's not going to help to remove catcaches that have a small
> initial size, as the pg_am cache certainly does. If the bucket zeroing
> cost is really something to minimize, it's only the caches with the
> largest nbuckets counts that are worth considering --- and we certainly
> can't remove those without penalty.

Yeah, very true. What's a bit frustrating about the whole thing is
that we spend a lot of time pulling data into the caches that's
basically static and never likely to change anywhere, ever. I bet the
number of people for whom <(int4, int4) has any non-standard
properties is somewhere between slim and none; and it might well be
the case that formrdesc() is faster than reading the relcache init
file, if we didn't need to worry about deviation from canonical. This
is even more frustrating in the hypothetical situation where a backend
can switch databases, because we have to blow away all of these cache
entries that are 99.9% likely to be basically identical in the old and
new databases.

The relation descriptors for pg_class and pg_attribute are examples of
things it would be nice to hardwire and never need to update. We are
really pretty much screwed if there is any meaningful deviation from
what is expected, but relpages, reltuples, and relfrozenxid - and
maybe relacl or reloptions - can legitimately vary between databases.

Maybe we could speed things up a bit if we got rid of the pg_attribute
entries for the system attributes (except OID).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 04:51:43
Message-ID:	2532.1291006303@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> Yeah, very true. What's a bit frustrating about the whole thing is
> that we spend a lot of time pulling data into the caches that's
> basically static and never likely to change anywhere, ever.

True. I wonder if we could do something like the relcache init file
for the catcaches.

> Maybe we could speed things up a bit if we got rid of the pg_attribute
> entries for the system attributes (except OID).

I used to have high hopes for that idea, but the column privileges
patch broke it permanently.

regards, tom lane

From:	Dimitri Fontaine <dimitri(at)2ndQuadrant(dot)fr>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 10:04:03
Message-ID:	87mxoszbsc.fsf@hi-media-techno.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Well, the lack of extensible XLOG support is definitely a big handicap
>> to building a *production* index AM as an add-on. But it's not such a
>> handicap for development.
>
> Realistically, it's hard for me to imagine that anyone would go to the
> trouble of building it as a loadable module first and then converting
> it to part of core later on. That'd hardly be less work.

Well, it depends a lot on external factors. Like for example willing to
use the code before to spend the necessary QA time that is needed for it
to land in core. Two particular examples come to mind, which are tsearch
and KNN GiST. The main problem with integrating into core, AFAIUI, are
related to code maitenance, not at all with code stability and quality
of the addon itself.

It's just so much easier to develop an external module…

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 16:57:51
Message-ID:	AANLkTimE1cot6_8fyPhc+ZK=gZsKwvkWVTY+ESp_6gty@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> Yeah, very true. What's a bit frustrating about the whole thing is
>> that we spend a lot of time pulling data into the caches that's
>> basically static and never likely to change anywhere, ever.
>
> True. I wonder if we could do something like the relcache init file
> for the catcaches.

Maybe. It's hard to know exactly what to pull in, though, nor is it
clear to me how much it would really save. You've got to keep the
thing up to date somehow, too.

I finally got around to doing some testing of
page-faults-versus-actually-memory-initialization, using the attached
test program, compiled with warnings, but without optimization.
Typical results on MacOS X:

first run: 297299
second run: 99653

And on Fedora 12 (2.6.32.23-170.fc12.x86_64):

first run: 509309
second run: 114721

I guess the word "run" is misleading (I wrote the program in 5
minutes); it's just zeroing the same chunk twice and measuring the
times. The difference is presumably the page fault overhead, which
implies that faulting is two-thirds of the overhead on MacOS X and
three-quarters of the overhead on Linux. This makes me pretty
pessimistic about the chances of a meaningful speedup here.

>> Maybe we could speed things up a bit if we got rid of the pg_attribute
>> entries for the system attributes (except OID).
>
> I used to have high hopes for that idea, but the column privileges
> patch broke it permanently.

http://archives.postgresql.org/pgsql-hackers/2010-07/msg00151.php

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment	Content-Type	Size
zero.c	text/x-csrc	700 bytes

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 17:24:54
Message-ID:	201011291824.54762.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 29 November 2010 17:57:51 Robert Haas wrote:
> On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> Yeah, very true. What's a bit frustrating about the whole thing is
> >> that we spend a lot of time pulling data into the caches that's
> >> basically static and never likely to change anywhere, ever.
> >
> > True. I wonder if we could do something like the relcache init file
> > for the catcaches.
>
> Maybe. It's hard to know exactly what to pull in, though, nor is it
> clear to me how much it would really save. You've got to keep the
> thing up to date somehow, too.
>
> I finally got around to doing some testing of
> page-faults-versus-actually-memory-initialization, using the attached
> test program, compiled with warnings, but without optimization.
> Typical results on MacOS X:
>
> first run: 297299
> second run: 99653
>
> And on Fedora 12 (2.6.32.23-170.fc12.x86_64):
>
> first run: 509309
> second run: 114721
Hm. A quick test shows that its quite a bit faster if you allocate memory
with:
size_t s = 512*1024*1024;
char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
MAP_ANONYMOUS, -1, 0);

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 17:34:02
Message-ID:	AANLkTi=0FxbLHzb2bRH8nDE+K0k-Sd1MLufUTysDYKpA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Hm. A quick test shows that its quite a bit faster if you allocate memory
> with:
> size_t s = 512*1024*1024;
> char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
> MAP_ANONYMOUS, -1, 0);

Numbers?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 17:49:12
Message-ID:	201011291849.12406.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 29 November 2010 18:34:02 Robert Haas wrote:
> On Mon, Nov 29, 2010 at 12:24 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Hm. A quick test shows that its quite a bit faster if you allocate memory
> > with:
> > size_t s = 512*1024*1024;
> > char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
> > MAP_ANONYMOUS, -1, 0);
>
> Numbers?
malloc alloc: 43
malloc memset1: 438763
malloc memset2: 98764
total: 537570

mmap alloc: 296065
mmap memset1: 99203
mmap memset2: 100608
total: 495876

But you don't actually need the memset1 in the mmap case as MAP_ANONYMOUS
memory is already zeroed. We could actually use that knowledge even without
MAP_POPULATE if we somehow keep track whether an allocated memory region is
still zeroed.

Taking that into account its:

malloc alloc: 47
malloc memset1: 437819
malloc memset2: 98317
total: 536183
mmap alloc: 292904
mmap memset1: 1
mmap memset2: 99284
total: 392189

I am somewhat reluctant to believe thats the way to go.

Andres

Attachment	Content-Type	Size
zero.c	text/x-csrc	1.5 KB

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 17:50:08
Message-ID:	AANLkTimdgS0nJ4wEBKkX2AZJDejLtQYOCHa7_69jeHrG@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 29, 2010 at 9:24 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Monday 29 November 2010 17:57:51 Robert Haas wrote:
>> On Sun, Nov 28, 2010 at 11:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> >> Yeah, very true. What's a bit frustrating about the whole thing is
>> >> that we spend a lot of time pulling data into the caches that's
>> >> basically static and never likely to change anywhere, ever.
>> >
>> > True. I wonder if we could do something like the relcache init file
>> > for the catcaches.
>>
>> Maybe. It's hard to know exactly what to pull in, though, nor is it
>> clear to me how much it would really save. You've got to keep the
>> thing up to date somehow, too.
>>
>> I finally got around to doing some testing of
>> page-faults-versus-actually-memory-initialization, using the attached
>> test program, compiled with warnings, but without optimization.
>> Typical results on MacOS X:
>>
>> first run: 297299
>> second run: 99653
>>
>> And on Fedora 12 (2.6.32.23-170.fc12.x86_64):
>>
>> first run: 509309
>> second run: 114721
> Hm. A quick test shows that its quite a bit faster if you allocate memory
> with:
> size_t s = 512*1024*1024;
> char *bss = mmap(0, s, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE|
> MAP_ANONYMOUS, -1, 0);

Could you post the program?

Are you sure you haven't just moved the page-fault time to a part of
the code where it still exists, but just isn't being captured and
reported?

Cheers,

Jeff

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 17:50:12
Message-ID:	189.1291053012@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I guess the word "run" is misleading (I wrote the program in 5
> minutes); it's just zeroing the same chunk twice and measuring the
> times. The difference is presumably the page fault overhead, which
> implies that faulting is two-thirds of the overhead on MacOS X and
> three-quarters of the overhead on Linux.

Ah, cute solution to the measurement problem. I replicated the
experiment just as a cross-check:

Fedora 13 on x86_64 (recent Nehalem):
first run: 346767
second run: 103143

Darwin on x86_64 (not-so-recent Penryn):
first run: 341289
second run: 64535

HPUX on HPPA:
first run: 2191136
second run: 1199879

(On the last two machines I had to cut the array size to 256MB to avoid
swapping.) All builds with "gcc -O2".

> This makes me pretty
> pessimistic about the chances of a meaningful speedup here.

Yeah, this is confirmation that what you are seeing in the original test
is mostly about faulting pages in, not about the zeroing. I think it
would still be interesting to revisit the micro-optimization of
MemSet(), but it doesn't look like massive restructuring to avoid it
altogether is going to be worthwhile.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 18:10:07
Message-ID:	689.1291054207@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> Are you sure you haven't just moved the page-fault time to a part of
> the code where it still exists, but just isn't being captured and
> reported?

I'm a bit suspicious about that too. Another thing to keep in mind
is that Robert's original program doesn't guarantee that the char
array is maxaligned; though reasonable implementations of memset
should be able to use the same inner loop anyway for most of the
array.

I did some experimentation here and couldn't find any real difference in
runtime between the original program and substituting a malloc() call
for the static array allocation. Rolling in calloc in place of
malloc/memset made no particular difference either, which says that
Fedora 13's glibc does not have any optimization for that case as I'd
hoped.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-29 18:10:45
Message-ID:	AANLkTimfnzvahHYCVBNt4efDG0PZ+Feb5JUcuchG2mDt@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 29, 2010 at 12:50 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> (On the last two machines I had to cut the array size to 256MB to avoid
> swapping.)

You weren't kidding about that "not so recent" part. :-)

>> This makes me pretty
>> pessimistic about the chances of a meaningful speedup here.
>
> Yeah, this is confirmation that what you are seeing in the original test
> is mostly about faulting pages in, not about the zeroing. I think it
> would still be interesting to revisit the micro-optimization of
> MemSet(), but it doesn't look like massive restructuring to avoid it
> altogether is going to be worthwhile.

Yep. I think that what we've established here is that starting new
processes all time time is just plain expensive, and we're going to
have to start fewer of them if we want to make a meaningful
improvement.

My impression is that the process startup overhead is even higher on
Windows, although I am not now nor have I ever been a Windows
programmer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 01:09:39
Message-ID:	201011300109.oAU19eG05156@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Sat, Nov 27, 2010 at 11:18 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> >> Not sure that information moves us forward. If the postmaster cleared
> >> the memory, we would have COW in the child and probably be even slower.
>
> > Well, we can determine the answers to these questions empirically.
>
> Not really. Per Bruce's description, a page would become COW the moment
> the postmaster touched (either write or read) any variable on it. Since
> we have no control over how the loader lays out static variables, the
> actual behavior of a particular build would be pretty random and subject
> to unexpected changes caused by seemingly unrelated edits.

I believe all linkers will put initialized data ("data" segment) before
unitialized data ("bss" segment):

http://en.wikipedia.org/wiki/Data_segment

The only question is whether the linker has data and bss sharing the
same VM page (4k), or whether a new VM page is used when starting the
bss segment.

> Also, the referenced URL only purports to describe the behavior of
> HPUX, which is not exactly a mainstream OS. I think it requires a
> considerable leap of faith to assume that all or even most platforms
> work the way this suggests, and not in the dumber fashion Andres
> suggested. Has anybody here actually looked at the relevant Linux
> or BSD kernel code?

I have years ago, but not recently. You can see the sections on Linux
via objdump:

$ objdump --headers /bin/ls

/bin/ls: file format elf32-i386

Sections:
Idx Name Size VMA LMA File off Algn
...
24 .data 0000012c 080611a0 080611a0 000191a0 2**5
CONTENTS, ALLOC, LOAD, DATA
25 .bss 00000c40 080612e0 080612e0 000192cc 2**5
ALLOC

Based on this output, a new 4k page is not started for the 'bss'
segment. It basically uses 32-byte alignment.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 01:13:26
Message-ID:	201011300113.oAU1DQ305543@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> In a close race, I don't think we should get bogged down in
> micro-optimization here, both because micro-optimizations may not gain
> much and because what works well on one platform may not do much at
> all on another. The more general issue here is what to do about our
> high backend startup costs. Beyond trying to recycle backends for new
> connections, as I've previous proposed and with all the problems it
> entails, the only thing that looks promising here is to try to somehow
> cut down on the cost of populating the catcache and relcache, not that
> I have a very clear idea how to do that. This has to be a soluble
> problem because other people have solved it. To some degree we're a
> victim of our own flexible and extensible architecture here, but I
> find it pretty unsatisfying to just say, OK, well, we're slow.

Combining your last two sentences, I am not sure anyone with the
flexibility we have has solved the "cache populating" problem.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 01:28:56
Message-ID:	201011300128.oAU1SuG07096@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> BTW, this might be premature to mention pending some tests about mapping
> versus zeroing overhead, but it strikes me that there's more than one
> way to skin a cat. I still think the idea of statically allocated space
> sucks. But what if we rearranged things so that palloc0 doesn't consist
> of palloc-then-memset, but rather push the zeroing responsibility down
> into the allocator? In particular, I'm imagining that palloc0 with a
> sufficiently large space request --- more than a couple pages --- could
> somehow arrange to get space that's guaranteed zero already. And if the
> request isn't large, zeroing it isn't where our problem is anyhow.

> The most portable way to do that would be to use calloc insted of malloc,
> and hope that libc is smart enough to provide freshly-mapped space.
> It would be good to look and see whether glibc actually does so,
> of course. If not we might end up having to mess with sbrk for
> ourselves, and I'm not sure how pleasantly that interacts with malloc.

Yes, I was going to suggest trying calloc(), either because we can get
already-zeroed sbrk() memory, or because libc uses assembly language for
zeroing memory, as some good libc's do. I know most kernels also use
assembly for zeroing memory.

Yes, MemSet was found to be faster than calling a C function, but new
testing is certainly warranted.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 01:34:06
Message-ID:	201011300134.oAU1Y6J08070@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Sun, Nov 28, 2010 at 7:15 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> One possible way to get a real speedup here would be to look for ways
> >> to trim the number of catcaches.
> >
> > BTW, it's not going to help to remove catcaches that have a small
> > initial size, as the pg_am cache certainly does. ?If the bucket zeroing
> > cost is really something to minimize, it's only the caches with the
> > largest nbuckets counts that are worth considering --- and we certainly
> > can't remove those without penalty.
>
> Yeah, very true. What's a bit frustrating about the whole thing is
> that we spend a lot of time pulling data into the caches that's
> basically static and never likely to change anywhere, ever. I bet the
> number of people for whom <(int4, int4) has any non-standard
> properties is somewhere between slim and none; and it might well be
> the case that formrdesc() is faster than reading the relcache init
> file, if we didn't need to worry about deviation from canonical. This
> is even more frustrating in the hypothetical situation where a backend
> can switch databases, because we have to blow away all of these cache
> entries that are 99.9% likely to be basically identical in the old and
> new databases.

It is very tempting to look at optimizations here, but I am worried we
might head down the flat-files solution that caused continual problems
in the past.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 03:35:52
Message-ID:	201011300335.oAU3Zqd24335@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark wrote:
> On Mon, Nov 29, 2010 at 12:33 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > The most portable way to do that would be to use calloc insted of malloc,
> > and hope that libc is smart enough to provide freshly-mapped space.
> > It would be good to look and see whether glibc actually does so,
> > of course. ?If not we might end up having to mess with sbrk for
> > ourselves, and I'm not sure how pleasantly that interacts with malloc.
>
> It's *supposed* to interact fine. The only thing I wonder is that I
> think malloc intentionally uses mmap for larger allocations but I'm
> not clear what the advantages are. Is it because it's a cheaper way to
> get zeroed bytes? Or just so that free has a hope of returning the
> allocations to the OS?

Using mmap() so you can return large allocations to the OS is a neat
trick, certainly. I am not sure who implements that.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 07:21:29
Message-ID:	201011300821.30383.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Monday 29 November 2010 19:10:07 Tom Lane wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> > Are you sure you haven't just moved the page-fault time to a part of
> > the code where it still exists, but just isn't being captured and
> > reported?
>
> I'm a bit suspicious about that too. Another thing to keep in mind
> is that Robert's original program doesn't guarantee that the char
> array is maxaligned; though reasonable implementations of memset
> should be able to use the same inner loop anyway for most of the
> array.
Yes, I measured the time including mmap itself. I don't find it surprising its
taking measurably shorter as it can just put up the mappings without
explicitly faulting for each and every page. The benefit is too small to worry
though, so ...

The answer to Robert includes the timings + test program.

Andres

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 19:37:13
Message-ID:	1291145833.13957.12.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
> Rolling in calloc in place of
> malloc/memset made no particular difference either, which says that
> Fedora 13's glibc does not have any optimization for that case as I'd
> hoped.

glibc's calloc is either mmap of /dev/zero or malloc followed by memset.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Peter Eisentraut <peter_e(at)gmx(dot)net>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-11-30 20:49:07
Message-ID:	3599.1291150147@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> On mn, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
>> Rolling in calloc in place of
>> malloc/memset made no particular difference either, which says that
>> Fedora 13's glibc does not have any optimization for that case as I'd
>> hoped.

> glibc's calloc is either mmap of /dev/zero or malloc followed by memset.

Hmm. I would have expected to see a difference then. Do you know what
conditions are needed to cause the mmap to be used?

regards, tom lane

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-01 04:32:23
Message-ID:	AANLkTik2Y-03aZ-wxNJ3k=dVirCPvcce9-7DFFrFSPXw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/28/10, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> In a close race, I don't think we should get bogged down in
> micro-optimization here, both because micro-optimizations may not gain
> much and because what works well on one platform may not do much at
> all on another. The more general issue here is what to do about our
> high backend startup costs. Beyond trying to recycle backends for new
> connections, as I've previous proposed and with all the problems it
> entails,

Is there a particular discussion of that matter you could point me to?

> the only thing that looks promising here is to try to somehow
> cut down on the cost of populating the catcache and relcache, not that
> I have a very clear idea how to do that. This has to be a soluble
> problem because other people have solved it.

Oracle's backend start up time seems to be way higher than PG's.
Their main solution is something that is fundamentally a built in
connection pooler with some bells and whistles built in. I'm not
sure "other people" you had in mind--Oracle is generally the one that
pops to my mind.

> To some degree we're a
> victim of our own flexible and extensible architecture here, but I
> find it pretty unsatisfying to just say, OK, well, we're slow.

What about "well OK, we have PGbouncer"? Are there fixable
short-comings that it has which could make the issue less of an issue?

Cheers,

Jeff

From:	Peter Eisentraut <peter_e(at)gmx(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-01 05:14:56
Message-ID:	1291180496.601.1.camel@vanquo.pezone.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On tis, 2010-11-30 at 15:49 -0500, Tom Lane wrote:
> Peter Eisentraut <peter_e(at)gmx(dot)net> writes:
> > On mån, 2010-11-29 at 13:10 -0500, Tom Lane wrote:
> >> Rolling in calloc in place of
> >> malloc/memset made no particular difference either, which says that
> >> Fedora 13's glibc does not have any optimization for that case as I'd
> >> hoped.
>
> > glibc's calloc is either mmap of /dev/zero or malloc followed by memset.
>
> Hmm. I would have expected to see a difference then. Do you know what
> conditions are needed to cause the mmap to be used?

Check out the mallopt(3) man page. It contains a few tunable malloc
options that may be useful for your investigation.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-01 14:20:32
Message-ID:	AANLkTimDjra5vhyszUe=i9iqQf_6ZGFTB9e+b98ExJ9w@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On 11/28/10, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>
>> In a close race, I don't think we should get bogged down in
>> micro-optimization here, both because micro-optimizations may not gain
>> much and because what works well on one platform may not do much at
>> all on another. The more general issue here is what to do about our
>> high backend startup costs. Beyond trying to recycle backends for new
>> connections, as I've previous proposed and with all the problems it
>> entails,
>
> Is there a particular discussion of that matter you could point me to?
>
>> the only thing that looks promising here is to try to somehow
>> cut down on the cost of populating the catcache and relcache, not that
>> I have a very clear idea how to do that. This has to be a soluble
>> problem because other people have solved it.
>
> Oracle's backend start up time seems to be way higher than PG's.
> Their main solution is something that is fundamentally a built in
> connection pooler with some bells and whistles built in. I'm not
> sure "other people" you had in mind--Oracle is generally the one that
> pops to my mind.

Interesting. How about MySQL and SQL Server?

>> To some degree we're a
>> victim of our own flexible and extensible architecture here, but I
>> find it pretty unsatisfying to just say, OK, well, we're slow.
>
> What about "well OK, we have PGbouncer"? Are there fixable
> short-comings that it has which could make the issue less of an issue?

We do have pgbouncer, and pgpool-II, and that's a good thing. But it
also requires proxying every interaction with the database through an
intermediate piece of software, which is not free. An in-core
solution ought to be able to arrange for each new connection to be
directly attached to an existing backend, using file-descriptor
passing. Tom has previously complained that this isn't portable, but
a little research suggests that it is supported on at least Linux, Mac
OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the
percentage of our user base who could benefit seems like it would
likely be very high.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-01 14:24:07
Message-ID:	201012011524.08672.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wednesday 01 December 2010 15:20:32 Robert Haas wrote:
> On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On 11/28/10, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> To some degree we're a
> >> victim of our own flexible and extensible architecture here, but I
> >> find it pretty unsatisfying to just say, OK, well, we're slow.
> >
> > What about "well OK, we have PGbouncer"? Are there fixable
> > short-comings that it has which could make the issue less of an issue?
>
> We do have pgbouncer, and pgpool-II, and that's a good thing. But it
> also requires proxying every interaction with the database through an
> intermediate piece of software, which is not free. An in-core
> solution ought to be able to arrange for each new connection to be
> directly attached to an existing backend, using file-descriptor
> passing. Tom has previously complained that this isn't portable, but
> a little research suggests that it is supported on at least Linux, Mac
> OS X, FreeBSD, OpenBSD, Solaris, and Windows, so in practice the
> percentage of our user base who could benefit seems like it would
> likely be very high.
HPUX and AIX allow fd transfer as well. I still don't see what even remotely
relevant platform would be a problem.

Andres

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Jeff Janes" <jeff(dot)janes(at)gmail(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>
Cc:	"Andres Freund" <andres(at)anarazel(dot)de>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Bruce Momjian" <bruce(at)momjian(dot)us>, <pgsql-hackers(at)postgresql(dot)org>,"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: profiling connection overhead
Date:	2010-12-01 15:26:24
Message-ID:	4CF614C00200002500038000@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

>> Oracle's backend start up time seems to be way higher than PG's.

> Interesting. How about MySQL and SQL Server?

My recollection of Sybase ASE is that establishing a connection
doesn't start a backend or even a thread. It establishes a network
connection and associates network queues and a connection context
structure with it. "Engine" threads with CPU affinity (and a few
miscellaneous "worker" threads, too, if I remember right) do all the
work in a queue-based fashion.

Last I worked with MS SQL Server it was based on the Sybase code and
therefore worked the same way. I know they've made a lot of changes
in the last five years (including switching to MVCC and adding
snapshot isolation in addition to the already-existing serializable
isolation), so I don't know whether connection startup cost has
changed along the way.

-Kevin

From:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-05 01:04:43
Message-ID:	AANLkTimP0XOVPR5-xGr1=m40Mri6NqGUXgJUQCDEwuiO@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Dec 1, 2010 at 6:20 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Tue, Nov 30, 2010 at 11:32 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>> On 11/28/10, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>>
>>> In a close race, I don't think we should get bogged down in
>>> micro-optimization here, both because micro-optimizations may not gain
>>> much and because what works well on one platform may not do much at
>>> all on another. The more general issue here is what to do about our
>>> high backend startup costs. Beyond trying to recycle backends for new
>>> connections, as I've previous proposed and with all the problems it
>>> entails,
>>
>> Is there a particular discussion of that matter you could point me to?
>>
>>> the only thing that looks promising here is to try to somehow
>>> cut down on the cost of populating the catcache and relcache, not that
>>> I have a very clear idea how to do that. This has to be a soluble
>>> problem because other people have solved it.
>>
>> Oracle's backend start up time seems to be way higher than PG's.
>> Their main solution is something that is fundamentally a built in
>> connection pooler with some bells and whistles built in. I'm not
>> sure "other people" you had in mind--Oracle is generally the one that
>> pops to my mind.
>
> Interesting. How about MySQL and SQL Server?

I don't have experience with MS SQL Server, and don't know how it
performs on that front. I haven't really considered MySQL to be a
"real" RDBMS, more of just an indexing system, although I guess it is
steadily becoming more featurefull. It is indisputably faster at
making connections than PG, but still much slower than a connection
pooler.

>
>>> To some degree we're a
>>> victim of our own flexible and extensible architecture here, but I
>>> find it pretty unsatisfying to just say, OK, well, we're slow.
>>
>> What about "well OK, we have PGbouncer"? Are there fixable
>> short-comings that it has which could make the issue less of an issue?
>
> We do have pgbouncer, and pgpool-II, and that's a good thing. But it
> also requires proxying every interaction with the database through an
> intermediate piece of software, which is not free.

True, a simple in-memory benchmark with pgbench -S -c1 showed 10,000
tps connecting straight, and 7000 tps through pgbouncer. But if
people want to make and breaks 100s of connections per second, they
must not be doing very many queries per connection so I don't know how
relevant that per query slow-down is.

> An in-core
> solution ought to be able to arrange for each new connection to be
> directly attached to an existing backend, using file-descriptor
> passing.

But who would be doing the passing? For the postmaster to be doing
that would probably go against the minimalist design. It would have
to keep track of which backend is available, and which db and user it
is primed for. Perhaps a feature could be added to the backend to
allow it to get passed a FD from pgbouncer or pgpool-II and then hand
control back to the pooler upon "close" of the connection, as they
already have the infrastructure to keep pools around while the
postmaster does not. Are pgbouncer and pgpool close enough to "core"
to make such intimate collaboration with the backend OK?

Cheers,

Jeff

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-05 09:47:37
Message-ID:	4CFB5FB9.9000203@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/01/2010 05:32 AM, Jeff Janes wrote:
> On 11/28/10, Robert Haas<robertmhaas(at)gmail(dot)com> wrote:
>>
>> In a close race, I don't think we should get bogged down in
>> micro-optimization here, both because micro-optimizations may not gain
>> much and because what works well on one platform may not do much at
>> all on another. The more general issue here is what to do about our
>> high backend startup costs. Beyond trying to recycle backends for new
>> connections, as I've previous proposed and with all the problems it
>> entails,
>
> Is there a particular discussion of that matter you could point me to?
>
>> the only thing that looks promising here is to try to somehow
>> cut down on the cost of populating the catcache and relcache, not that
>> I have a very clear idea how to do that. This has to be a soluble
>> problem because other people have solved it.
>
> Oracle's backend start up time seems to be way higher than PG's.
> Their main solution is something that is fundamentally a built in
> connection pooler with some bells and whistles built in. I'm not
> sure "other people" you had in mind--Oracle is generally the one that
> pops to my mind.
>
>> To some degree we're a
>> victim of our own flexible and extensible architecture here, but I
>> find it pretty unsatisfying to just say, OK, well, we're slow.
>
>
> What about "well OK, we have PGbouncer"? Are there fixable
> short-comings that it has which could make the issue less of an issue?

well I would very much like to seen an integrated pooler in postgresql -
pgbouncer is a very nice piece of software (and might even be a base for
an integrated bouncer), but being not closely tied to the backend you
are loosing a lot.
One of the more obvious examples is that now that we have no flatfile
copy of pg_authid you have to use cruel hacks like:
http://www.depesz.com/index.php/2010/12/04/auto-refreshing-password-file-for-pgbouncer/

to get "automatic" management of roles. There are some other drawbacks
as well:

* no coordination of restarts/configuration changes between the cluster
and the pooler
* you have two pieces of config files to configure your pooling settings
(having all that available say in a catalog in pg would be awesome)
* you lose all of the advanced authentication features of pg (because
all connections need to go through the pooler) and also ip-based stuff
* no SSL support(in the case of pgbouncer)
* complexity in reseting backend state (we added some support for that
through explicit SQL level commands in recent releases but it still is a
hazard)

Stefan

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-05 18:59:13
Message-ID:	4CFBE101.3030809@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> * no coordination of restarts/configuration changes between the cluster
> and the pooler
> * you have two pieces of config files to configure your pooling settings
> (having all that available say in a catalog in pg would be awesome)
> * you lose all of the advanced authentication features of pg (because
> all connections need to go through the pooler) and also ip-based stuff
> * no SSL support(in the case of pgbouncer)
> * complexity in reseting backend state (we added some support for that
> through explicit SQL level commands in recent releases but it still is a
> hazard)

* pooler logs to separate file, for which there are (currently) no
anaysis tools
* pooling is incompatible with the use of ROLES for data security

The last is a major issue, and not one I think we can easily resolve.
MySQL has a pooling-friendly user system, because when you connect to
MySQL you basically always connect as the superuser and on connection it
switches you to your chosen login role. This, per Rob Wulsch, is one of
the things at the heart of allowing MySQL to support 100,000 low
frequency users per cheap hosting system.

As you might imagine, this behavior is also the source of a lot of
MySQL's security bugs. I don't see how we could imitate it without
getting the bugs as well.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Rob Wultsch <wultsch(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-05 19:45:17
Message-ID:	AANLkTinSC2sO-GSxAHW8X9dAekL-1PvP6JdXzQKyshRP@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 11:59 AM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> * no coordination of restarts/configuration changes between the cluster
>> and the pooler
>> * you have two pieces of config files to configure your pooling settings
>> (having all that available say in a catalog in pg would be awesome)
>> * you lose all of the advanced authentication features of pg (because
>> all connections need to go through the pooler) and also ip-based stuff
>> * no SSL support(in the case of pgbouncer)
>> * complexity in reseting backend state (we added some support for that
>> through explicit SQL level commands in recent releases but it still is a
>> hazard)
>
> More:
>
> * pooler logs to separate file, for which there are (currently) no anaysis
> tools
> * pooling is incompatible with the use of ROLES for data security
>
> The last is a major issue, and not one I think we can easily resolve. MySQL
> has a pooling-friendly user system, because when you connect to MySQL you
> basically always connect as the superuser and on connection it switches you
> to your chosen login role. This, per Rob Wulsch, is one of the things at
> the heart of allowing MySQL to support 100,000 low frequency users per cheap
> hosting system.
>
> As you might imagine, this behavior is also the source of a lot of MySQL's
> security bugs. I don't see how we could imitate it without getting the bugs
> as well.
>
>

I think you have read a bit more into what I have said than is
correct. MySQL can deal with thousands of users and separate schemas
on commodity hardware. There are many design decisions (some
questionable) that have made MySQL much better in a shared hosting
environment than pg and I don't know where the grants system falls
into that.

MySQL does not have that many security problems because of how grants
are stored. Most MySQL security issues are DOS sort of stuff based on
a authenticated user being able to cause a crash. The decoupled
backend storage and a less than awesome parser shared most of the
blame for these issues.

One thing I would suggest that the PG community keeps in mind while
talking about built in connection process caching, is that it is very
nice feature for memory leaks caused by a connection to not exist for
and continue growing forever.

NOTE: 100k is not a number that I would put much stock in. I don't
recall ever mentioning that number and it is not a number that would
be truthful for me to throw out.

--
Rob Wultsch
wultsch(at)gmail(dot)com

From:	Rob Wultsch <wultsch(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-05 20:17:25
Message-ID:	AANLkTi=htYCG=ExWXoWFrCqYs7D6pu9Lq9hiEL5F4KSr@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
> One thing I would suggest that the PG community keeps in mind while
> talking about built in connection process caching, is that it is very
> nice feature for memory leaks caused by a connection to not exist for
> and continue growing forever.

s/not exist for/not exist/

I have had issues with very slow leaks in MySQL building up over
months. It really sucks to have to go to management to ask for
downtime because of a slow memory leak.

--
Rob Wultsch
wultsch(at)gmail(dot)com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Rob Wultsch <wultsch(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 01:58:52
Message-ID:	AANLkTi=6Zq-yjPZ6vbB-ptBE-EgWDyhRL9mPkhWyW+es@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 3:17 PM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
> On Sun, Dec 5, 2010 at 12:45 PM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>> One thing I would suggest that the PG community keeps in mind while
>> talking about built in connection process caching, is that it is very
>> nice feature for memory leaks caused by a connection to not exist for
>> and continue growing forever.
>
> s/not exist for/not exist/
>
> I have had issues with very slow leaks in MySQL building up over
> months. It really sucks to have to go to management to ask for
> downtime because of a slow memory leak.

Apache has a very simple and effective solution to this problem - they
have a configuration option controlling the number of connections a
child process handles before it dies and a new one is spawned. I've
found that setting this to 1000 works excellently. Process startup
overhead decreases by three orders of magnitude, and only egregiously
bad leaks add up to enough to matter.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Rob Wultsch <wultsch(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 01:59:41
Message-ID:	AANLkTimqp83FcqXG3pj3Y4H7sYF_8tKxjhsf2-9+so0O@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
> I think you have read a bit more into what I have said than is
> correct. MySQL can deal with thousands of users and separate schemas
> on commodity hardware. There are many design decisions (some
> questionable) that have made MySQL much better in a shared hosting
> environment than pg and I don't know where the grants system falls
> into that.

Objection: Vague.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-06 02:18:16
Message-ID:	AANLkTimNNPe95Oc3TGo2k9h6AtiQKhK-fffBs4GBjjbn@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Dec 4, 2010 at 8:04 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> But who would be doing the passing? For the postmaster to be doing
> that would probably go against the minimalist design. It would have
> to keep track of which backend is available, and which db and user it
> is primed for. Perhaps a feature could be added to the backend to
> allow it to get passed a FD from pgbouncer or pgpool-II and then hand
> control back to the pooler upon "close" of the connection, as they
> already have the infrastructure to keep pools around while the
> postmaster does not. Are pgbouncer and pgpool close enough to "core"
> to make such intimate collaboration with the backend OK?

I am not sure. I'm afraid that might be adding complexity without
really solving anything, but maybe I'm a pessimist.

One possible way to do make an improvement in this area would be to
move the responsibility for accepting connections out of the
postmaster. Instead, you'd have a group of children that would all
call accept() on the socket, and the OS would arbitrarily pick one to
receive each new incoming connection. The postmaster would just be
responsible for making sure that there were enough children hanging
around. You could in fact make this change without doing anything
else, in which case it wouldn't save any work but would possibly
reduce connection latency a bit since more of the work could be done
before the connection actually arrived.

From there, you could go two ways.

One option would be to have backends that would otherwise terminate
normally instead do the equivalent of DISCARD ALL and then go back
around and try to accept() another incoming connection. If they get a
guy who wants the database to which they previously connected, profit.
If not, laboriously flush every cache in sight and rebind to the new
database.

Another option would be to have backends that would otherwise
terminate normally instead do the equivalent of DISCARD ALL and then
mark themselves as able to accept a new connection to the same
database to which they are already connected (but not any other
database). Following authentication, a backend that accepted a new
incoming connection looks through the pool of such backends and, if it
finds one, hands off the connection using file-descriptor passing and
then loops back around to accept() again. Otherwise it handles the
connection itself. This wouldn't offer much of an advantage over the
first option for a cluster that basically has just one database, or
for a cluster that has 1000 actively used databases. But it would be
much better for a system with three databases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Rob Wultsch <wultsch(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 02:35:25
Message-ID:	AANLkTikqvpwp4FXsLU4xR93DFgV0ArtZd_fm=yAhn5jG@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>> I think you have read a bit more into what I have said than is
>> correct. MySQL can deal with thousands of users and separate schemas
>> on commodity hardware. There are many design decisions (some
>> questionable) that have made MySQL much better in a shared hosting
>> environment than pg and I don't know where the grants system falls
>> into that.
>
> Objection: Vague.
>

I retract the remark, your honor.

At some point Hackers should look at pg vs MySQL multi tenantry but it
is way tangential today.

--
Rob Wultsch
wultsch(at)gmail(dot)com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Rob Wultsch <wultsch(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 02:44:02
Message-ID:	AANLkTi=BjRmBhAuSpPz-ATah9xNqjTrRE4ikL4TyV71Y@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Dec 5, 2010 at 9:35 PM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
> On Sun, Dec 5, 2010 at 6:59 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Sun, Dec 5, 2010 at 2:45 PM, Rob Wultsch <wultsch(at)gmail(dot)com> wrote:
>>> I think you have read a bit more into what I have said than is
>>> correct. MySQL can deal with thousands of users and separate schemas
>>> on commodity hardware. There are many design decisions (some
>>> questionable) that have made MySQL much better in a shared hosting
>>> environment than pg and I don't know where the grants system falls
>>> into that.
>>
>> Objection: Vague.
>
> I retract the remark, your honor.

Clarifying it would be fine, too... :-)

> At some point Hackers should look at pg vs MySQL multi tenantry but it
> is way tangential today.

My understanding is that our schemas work like MySQL databases; and
our databases are an even higher level of isolation. No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-06 17:38:42
Message-ID:	28327.1291657122@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> One possible way to do make an improvement in this area would be to
> move the responsibility for accepting connections out of the
> postmaster. Instead, you'd have a group of children that would all
> call accept() on the socket, and the OS would arbitrarily pick one to
> receive each new incoming connection. The postmaster would just be
> responsible for making sure that there were enough children hanging
> around. You could in fact make this change without doing anything
> else, in which case it wouldn't save any work but would possibly
> reduce connection latency a bit since more of the work could be done
> before the connection actually arrived.

This seems like potentially a good idea independent of anything else,
just to reduce connection latency: fork() (not to mention exec() on
Windows) now happens before not after receipt of the connection request.
However, I see a couple of stumbling blocks:

1. Does accept() work that way everywhere (Windows, I'm looking at you)

2. What do you do when max_connections is exceeded, and you don't have
anybody at all listening on the socket? Right now we are at least able
to send back an error message explaining the problem.

Another issue that would require some thought is what algorithm the
postmaster uses for deciding to spawn new children. But that doesn't
sound like a potential showstopper.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 17:49:16
Message-ID:	4CFD221C.3000302@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/06/2010 09:38 AM, Tom Lane wrote:
> Another issue that would require some thought is what algorithm the
> postmaster uses for deciding to spawn new children. But that doesn't
> sound like a potential showstopper.

We'd probably want a couple of different ones, optimized for different
connection patterns. Realistically.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Subject:	Re: profiling connection overhead
Date:	2010-12-06 17:55:22
Message-ID:	AANLkTi=stHiv3Lcnqb9_xseZ+=SX3amUEhdYYUz4mnEA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 6, 2010 at 12:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> One possible way to do make an improvement in this area would be to
>> move the responsibility for accepting connections out of the
>> postmaster. Instead, you'd have a group of children that would all
>> call accept() on the socket, and the OS would arbitrarily pick one to
>> receive each new incoming connection. The postmaster would just be
>> responsible for making sure that there were enough children hanging
>> around. You could in fact make this change without doing anything
>> else, in which case it wouldn't save any work but would possibly
>> reduce connection latency a bit since more of the work could be done
>> before the connection actually arrived.
>
> This seems like potentially a good idea independent of anything else,
> just to reduce connection latency: fork() (not to mention exec() on
> Windows) now happens before not after receipt of the connection request.
> However, I see a couple of stumbling blocks:
>
> 1. Does accept() work that way everywhere (Windows, I'm looking at you)

Not sure. It might be useful to look at what Apache does, but I don't
have time to do that ATM.

> 2. What do you do when max_connections is exceeded, and you don't have
> anybody at all listening on the socket? Right now we are at least able
> to send back an error message explaining the problem.

Sending back an error message explaining the problem seems like a
non-negotiable requirement. I'm not quite sure how to dance around
this. Perhaps if max_connections is exhausted, the postmaster itself
joins the accept() queue and launches a dead-end backend for each new
connection. Or perhaps we reserve one extra backend slot for a
probably-dead-end backend that will just sit there and mail rejection
notices; except that if it sees that a regular backend slot has opened
up it grabs it and turns itself into a regular backend.

> Another issue that would require some thought is what algorithm the
> postmaster uses for deciding to spawn new children. But that doesn't
> sound like a potential showstopper.

The obvious algorithm would be to try to keep N spare workers around.
Any time the number of unconnected backends drops below N the
postmaster starts spawning new ones until it gets back up to N. I
think the trick may not be the algorithm so much as finding a way to
make the signaling sufficiently robust and lightweight. For example,
I bet having each child that gets a new connection signal() the
postmaster is a bad plan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Rob Wultsch <wultsch(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 17:57:44
Message-ID:	4CFD2418.9050002@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> At some point Hackers should look at pg vs MySQL multi tenantry but it
>> is way tangential today.
>
> My understanding is that our schemas work like MySQL databases; and
> our databases are an even higher level of isolation. No?

That's correct. Drizzle is looking at implementing a feature like our
databases called "catalogs" (per the SQL spec).

Let me stress that not everyone is happy with the MySQL multi-tenantry
approach. But it does make multi-tenancy on a scale which you seldom
see with PG possible, even if it has problems. It's worth seeing
whether we can steal any of their optimization ideas without breaking PG.

I was specifically looking at the login model, which works around the
issue that we have: namely that different login ROLEs can't share a
connection pool. In MySQL, they can share the built-in connection
"pool" because role-switching effectively is a session variable.
AFAICT, anyway.

For that matter, if anyone knows any other DB which does multi-tenant
well/better, we should be looking at them too.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Rob Wultsch <wultsch(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 18:05:22
Message-ID:	AANLkTik_XfouJhjpSFBV29JvtT2NF4ONudt2rgdREONi@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 6, 2010 at 12:57 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> At some point Hackers should look at pg vs MySQL multi tenantry but it
>>> is way tangential today.
>>
>> My understanding is that our schemas work like MySQL databases; and
>> our databases are an even higher level of isolation. No?
>
> That's correct. Drizzle is looking at implementing a feature like our
> databases called "catalogs" (per the SQL spec).
>
> Let me stress that not everyone is happy with the MySQL multi-tenantry
> approach. But it does make multi-tenancy on a scale which you seldom see
> with PG possible, even if it has problems. It's worth seeing whether we can
> steal any of their optimization ideas without breaking PG.

Please make sure to articulate what you think is wrong with our existing model.

> I was specifically looking at the login model, which works around the issue
> that we have: namely that different login ROLEs can't share a connection
> pool. In MySQL, they can share the built-in connection "pool" because
> role-switching effectively is a session variable. AFAICT, anyway.

Please explain more precisely what is wrong with SET SESSION
AUTHORIZATION / SET ROLE.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-06 19:47:00
Message-ID:	4CFD3DB4.1060800@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Please explain more precisely what is wrong with SET SESSION
> AUTHORIZATION / SET ROLE.

1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
had any time to work on)

2) Users can always issue their own SET ROLE and then "hack into" other
users' data.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-07 02:09:56
Message-ID:	AANLkTik-uSgR9mfUbkiV6vNtUG4MsqTvVrf1EXDJNYCe@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>> Please explain more precisely what is wrong with SET SESSION
>> AUTHORIZATION / SET ROLE.
>
> 1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
> had any time to work on)
>
> 2) Users can always issue their own SET ROLE and then "hack into" other
> users' data.

Makes sense. It would be nice to fix those issues, independent of
anything else.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: profiling connection overhead
Date:	2010-12-07 02:37:01
Message-ID:	1291689244-sup-7391@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010:
> On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >
> >> Please explain more precisely what is wrong with SET SESSION
> >> AUTHORIZATION / SET ROLE.
> >
> > 1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
> > had any time to work on)
> >
> > 2) Users can always issue their own SET ROLE and then "hack into" other
> > users' data.
>
> Makes sense. It would be nice to fix those issues, independent of
> anything else.

It seems plausible to fix the first one, but how would you fix the
second one? You either allow SET ROLE (which you need, to support the
pooler changing authorization), or you don't. There doesn't seem to be
a usable middleground.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-07 02:48:55
Message-ID:	4CFDA097.8070702@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> It seems plausible to fix the first one, but how would you fix the
> second one? You either allow SET ROLE (which you need, to support the
> pooler changing authorization), or you don't. There doesn't seem to be
> a usable middleground.

Well, this is why such a pooler would *have* to be built into the
backend. It would need to be able to SET ROLE even though SET ROLE
would not be accepted over the client connection. We'd also need
bookkeeping to track the ROLE (and other GUCs) of each client connection
and reset them whenever that client connection switches back.

Mind you, I'm not entirely convinced that the end result of this would
be performant. And they would certainly be complicated. I think that
we should start by dealing with the simplest situation, ignoring SET
ROLE and GUC issues for now.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: profiling connection overhead
Date:	2010-12-07 02:55:11
Message-ID:	AANLkTi=b6dPiLTSpdvqVRDx0ZTUOob7BCNuAc4y2wXoB@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 6, 2010 at 9:37 PM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Robert Haas's message of lun dic 06 23:09:56 -0300 2010:
>> On Mon, Dec 6, 2010 at 2:47 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> >
>> >> Please explain more precisely what is wrong with SET SESSION
>> >> AUTHORIZATION / SET ROLE.
>> >
>> > 1) Session GUCS do not change with a SET ROLE (this is a TODO I haven't
>> > had any time to work on)
>> >
>> > 2) Users can always issue their own SET ROLE and then "hack into" other
>> > users' data.
>>
>> Makes sense. It would be nice to fix those issues, independent of
>> anything else.
>
> It seems plausible to fix the first one, but how would you fix the
> second one? You either allow SET ROLE (which you need, to support the
> pooler changing authorization), or you don't. There doesn't seem to be
> a usable middleground.

You could add a protocol message that does a "permanent" role switch
in a way that can't be undone except by another such protocol message.
Then connection poolers could simply refuse to proxy that particular
message.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: profiling connection overhead
Date:	2010-12-07 06:41:07
Message-ID:	4CFDD703.1060307@postnewspapers.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/12/10 10:48, Josh Berkus wrote:
>
>> It seems plausible to fix the first one, but how would you fix the
>> second one? You either allow SET ROLE (which you need, to support the
>> pooler changing authorization), or you don't. There doesn't seem to be
>> a usable middleground.
>
> Well, this is why such a pooler would *have* to be built into the
> backend. It would need to be able to SET ROLE even though SET ROLE
> would not be accepted over the client connection.

There's actually another way to do that that could be retrofitted onto
an existing external pooler. It's not lovely, but if the approach above
proved too hard...

SET ROLE could accept a cookie / one-time password that had to be passed
to RESET ROLE in order for RESET ROLE to accept the command.

SET ROLE fred WITH COOKIE 'goqu8Mi6choht8ie';
-- hand to the user
-- blah blah user work blah
-- returned by the user
RESET ROLE WITH COOKIE 'goqu8Mi6choht8ie';

The tricky bit might be that the user should still be permitted to SET
ROLE, but only to roles that the role the pooler switched them to
("fred") has rights to SET ROLE to, not to roles that the pooler user
its self has rights to switch to.

> We'd also need
> bookkeeping to track the ROLE (and other GUCs) of each client connection
> and reset them whenever that client connection switches back.

I'm really interested in this direction. Taken just a little further, it
could bring Pg to the point where query executors (backends) are
separated from connection state, so a given backend could pick up and
work on queries by several different connections in rapid succession.
The advantage there is that idle connections would become cheap,
low-overhead affairs.

As I (poorly) understand how Pg is designed it'd only be possible for a
backend to work on queries that act on the same database, it couldn't
really switch databases. That'd still be a real bonus especially for
newer users who don't realize they *need* a connection pool.

--
System & Network Administrator
POST Newspapers