profiling pgbench

Lists: pgsql-hackers
From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: profiling pgbench
Date: 2010-11-24 20:24:43
Message-ID: AANLkTin33G-ukWo4qy8cw=s7DBUVR9jESy3GDVCdFvVD@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I did some profiling of pgbench -j 36 -c 36 -T 500 banging on my
two-core desktop box - with synchronous_commit turned off to keep the
fsyncs from dominating the profile - and got these results:

29634 4.7124 postgres base_yyparse
27906 4.4377 postgres AllocSetAlloc
17751 2.8228 postgres SearchCatCache
13029 2.0719 postgres hash_search_with_hash_value
11253 1.7895 postgres core_yylex
9957 1.5834 libc-2.11.2.so memcmp
9237 1.4689 libc-2.11.2.so __strcmp_sse2
8628 1.3720 postgres ScanKeywordLookup
7984 1.2696 postgres GetSnapshotData
7144 1.1361 postgres MemoryContextAllocZeroAligned
6898 1.0969 postgres XLogInsert
6440 1.0241 postgres LWLockAcquire

Full report and call graph attached. The opannotate results are 2MB
bzip'd, so I'm not attaching those; send me an email off-list if you
want them.

I'd like to get access to a box with (a lot) more cores, to see
whether the lock stuff moves up in the profile. A big chunk of that
hash_search_with_hash_value overhead is coming from
LockAcquireExtended. The __strcmp_sse2 is almost entirely parsing
overhead. In general, I'm not sure there's much hope for reducing the
parsing overhead, although ScanKeywordLookup() can certainly be done
better. XLogInsert() is spending a lot of time doing CRC's.
LWLockAcquire() is dropping cycles in many different places.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
pgbench36-callgraph.txt.bz2 application/x-bzip2 75.9 KB
pgbench36-opreport.txt.bz2 application/x-bzip2 11.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: profiling pgbench
Date: 2010-11-24 21:11:55
Message-ID: 21620.1290633115@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I did some profiling of pgbench -j 36 -c 36 -T 500 banging on my
> two-core desktop box - with synchronous_commit turned off to keep the
> fsyncs from dominating the profile - and got these results:

> 29634 4.7124 postgres base_yyparse

Seems like pgbench is a poster child for the value of prepared
statements. Have you tried it with "-M prepared"?

I'd take this with a grain of salt as to whether it's representative of
real applications, of course.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: profiling pgbench
Date: 2010-11-24 21:14:04
Message-ID: 201011242214.04773.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
> I'd like to get access to a box with (a lot) more cores, to see
> whether the lock stuff moves up in the profile. A big chunk of that
> hash_search_with_hash_value overhead is coming from
> LockAcquireExtended. The __strcmp_sse2 is almost entirely parsing
> overhead. In general, I'm not sure there's much hope for reducing the
> parsing overhead, although ScanKeywordLookup() can certainly be done
> better. XLogInsert() is spending a lot of time doing CRC's.
> LWLockAcquire() is dropping cycles in many different places.
I can get you profiles of machines with up two 24 real cores, unfortunately I
can't give access away.

Regarding CRCs:
I spent some time optimizing these, as you might remember. The wall I hit
optimizing it benefit-wise is that the single CRC calls (4 for a non-indexed
single-row insert on a table with 1 column inside a transaction) are just too
damn small to get more efficient. Its causing pipeline stalls all over...
(21, 5, 1, 28 bytes).

I have a very preliminary patch calculating the CRC over the whole thing in
one go if it can do so (no switch, no xl buffers wraparound), but its highly
ugly as it needs to read from the xl insert buffers and then reinsert the crc
at the correct position.
While it shows a noticable improvement, that doesn't seem to be a good way to
go. It could be made to work properly though.

I played around with some ideas to do that more nicely, but none were
gratifying.

Recarding LWLockAcquire costs:
Yes, its pretty noticeable - on loads of different usages. On a bunch of
production machines its the second (begind XLogInsert) on some the most
expensive function. Most of the time

All of those machines are Nehalems though, so the image may be a bit
distorted.

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: profiling pgbench
Date: 2010-11-24 21:19:09
Message-ID: 201011242219.09433.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
> > I'd like to get access to a box with (a lot) more cores, to see
> > whether the lock stuff moves up in the profile. A big chunk of that
> > hash_search_with_hash_value overhead is coming from
> > LockAcquireExtended. The __strcmp_sse2 is almost entirely parsing
> > overhead. In general, I'm not sure there's much hope for reducing the
> > parsing overhead, although ScanKeywordLookup() can certainly be done
> > better. XLogInsert() is spending a lot of time doing CRC's.
> > LWLockAcquire() is dropping cycles in many different places.
>
> I can get you profiles of machines with up two 24 real cores, unfortunately
> I can't give access away.
>
> Regarding CRCs:
> I spent some time optimizing these, as you might remember. The wall I hit
> optimizing it benefit-wise is that the single CRC calls (4 for a
> non-indexed single-row insert on a table with 1 column inside a
> transaction) are just too damn small to get more efficient. Its causing
> pipeline stalls all over... (21, 5, 1, 28 bytes).
>
> I have a very preliminary patch calculating the CRC over the whole thing in
> one go if it can do so (no switch, no xl buffers wraparound), but its
> highly ugly as it needs to read from the xl insert buffers and then
> reinsert the crc at the correct position.
> While it shows a noticable improvement, that doesn't seem to be a good way
> to go. It could be made to work properly though.
>
> I played around with some ideas to do that more nicely, but none were
> gratifying.
>
> Recarding LWLockAcquire costs:
> Yes, its pretty noticeable - on loads of different usages. On a bunch of
> production machines its the second (begind XLogInsert) on some the most
> expensive function. Most of the time
AllocSetAlloc is the third, battling with hash_search_with_hash value. To
complete that sentence...

Andres


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: profiling pgbench
Date: 2010-11-24 22:33:22
Message-ID: AANLkTimioMewG=fyNdfWOU=VTN+qPuw5LemgxDPBCNZ8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
>>
>> Recarding LWLockAcquire costs:
>> Yes, its pretty noticeable - on loads of different usages. On a bunch of
>> production machines its the second (begind XLogInsert) on some the most
>> expensive function. Most of the time

> AllocSetAlloc is the third, battling with hash_search_with_hash value. To
> complete that sentence...

I've played a bit with hash_search_with_hash_value and found that most
of the time is spent on shared hash tables, not private ones. And the
time attributed to it for the shared hash tables mostly seems to be
due to the time it takes to fight cache lines away from other CPUs. I
suspect the same thing is true of LWLockAcquire.

Cheers,

Jeff


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: profiling pgbench
Date: 2010-11-24 22:34:46
Message-ID: AANLkTikK=ygsba=tP_eEi75Hdc1rp21Ko8KYiG_EuUT=@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
>>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
>>>
>>> Recarding LWLockAcquire costs:
>>> Yes, its pretty noticeable - on loads of different usages. On a bunch of
>>> production machines its the second (begind XLogInsert) on some the most
>>> expensive function. Most of the time
>
>> AllocSetAlloc is the third, battling with hash_search_with_hash value. To
>> complete that sentence...
>
> I've played a bit with hash_search_with_hash_value and found that most
> of the time is spent on shared hash tables, not private ones.  And the
> time attributed to it for the shared hash tables mostly seems to be
> due to the time it takes to fight cache lines away from other CPUs.  I
> suspect the same thing is true of LWLockAcquire.

How do you get stats on that?

How big is a typical cache line on modern CPUs?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Subject: Re: profiling pgbench
Date: 2010-11-24 22:38:27
Message-ID: 201011242338.27584.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wednesday 24 November 2010 23:34:46 Robert Haas wrote:
> On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> > On Wed, Nov 24, 2010 at 1:19 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> >> On Wednesday 24 November 2010 22:14:04 Andres Freund wrote:
> >>> On Wednesday 24 November 2010 21:24:43 Robert Haas wrote:
> >>>
> >>> Recarding LWLockAcquire costs:
> >>> Yes, its pretty noticeable - on loads of different usages. On a bunch
> >>> of production machines its the second (begind XLogInsert) on some the
> >>> most expensive function. Most of the time
> >>
> >> AllocSetAlloc is the third, battling with hash_search_with_hash value.
> >> To complete that sentence...
> >
> > I've played a bit with hash_search_with_hash_value and found that most
> > of the time is spent on shared hash tables, not private ones. And the
> > time attributed to it for the shared hash tables mostly seems to be
> > due to the time it takes to fight cache lines away from other CPUs. I
> > suspect the same thing is true of LWLockAcquire.
>
> How do you get stats on that?
Btw, if you have some recent kernel I would try to use perf - I find the event
mappings easier to understand there, but maybe thats just me jumping onto
bandwagons.

> How big is a typical cache line on modern CPUs?
for the line size
cat /sys/devices/system/cpu/cpu0/cache/index*/coherency_line_size

for the overall size
cat /sys/devices/system/cpu/cpu0/cache/index*/size

Andres


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: profiling pgbench
Date: 2010-11-24 22:49:42
Message-ID: 23876.1290638982@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> I've played a bit with hash_search_with_hash_value and found that most
> of the time is spent on shared hash tables, not private ones. And the
> time attributed to it for the shared hash tables mostly seems to be
> due to the time it takes to fight cache lines away from other CPUs. I
> suspect the same thing is true of LWLockAcquire.

That squares with some behavior I've seen. If you run opannotate
you often see ridiculously high time percentages attributed to extremely
trivial C statements. The explanation seems to be that those places are
where chunks of memory are first touched, and have to be pulled into the
CPU's cache (and, if in shared memory, pulled away from some other CPU).

regards, tom lane


From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: profiling pgbench
Date: 2010-11-24 23:14:10
Message-ID: AANLkTikS5KrZStzvZzzN4z5nyAMkFYffffkjwty_kq6L@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 24, 2010 at 2:34 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Nov 24, 2010 at 5:33 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
>>
>> I've played a bit with hash_search_with_hash_value and found that most
>> of the time is spent on shared hash tables, not private ones.  And the
>> time attributed to it for the shared hash tables mostly seems to be
>> due to the time it takes to fight cache lines away from other CPUs.  I
>> suspect the same thing is true of LWLockAcquire.
>
> How do you get stats on that?

I replicated hash_search_with_hash_value function definition many
times with different names, and changed different parts of the code to
invoke different function names. (I don't trust profilers to
correctly distribute times over call graphs. I think most of them
just assume all function calls take the same time, and don't
separately measure the time of calls coming from different parts of
the code.)

Then I took one of them that is heavily used and leaves the hash table
unchanged, and had it invoke the same call several times in a row.
The profiling confirmed that it was called 3 times more often, but the
time spent in it increased by far less than 3 times, I think the
increase in time in that function was only 10% or so (and in
non-profiling code, the total run time did not increase by a
noticeable amount).

The only way I could explain this, other than redundant calls being
optimized away (which the profiler call-counts disputes), is caching
effects.

--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -95,7 +95,21 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
BufferLookupEnt *result;

result = (BufferLookupEnt *)
- hash_search_with_hash_value(SharedBufHash,
+ hash_search_with_hash_value5(SharedBufHash,
+
(void *) tagPtr,
+
hashcode,
+
HASH_FIND,
+ NULL);
+
+ result = (BufferLookupEnt *)
+ hash_search_with_hash_value5(SharedBufHash,
+
(void *) tagPtr,
+
hashcode,
+
HASH_FIND,
+ NULL);
+
+ result = (BufferLookupEnt *)
+ hash_search_with_hash_value5(SharedBufHash,

(void *) tagPtr,

hashcode,

HASH_FIND,

> How big is a typical cache line on modern CPUs?

That I don't know. I'm more of an experimentalist.

Cheers,

Jeff


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: profiling pgbench
Date: 2010-11-25 02:26:56
Message-ID: 53F8B80B-BD14-461F-A119-6AE74911651B@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Nov 24, 2010, at 5:49 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
>> I've played a bit with hash_search_with_hash_value and found that most
>> of the time is spent on shared hash tables, not private ones. And the
>> time attributed to it for the shared hash tables mostly seems to be
>> due to the time it takes to fight cache lines away from other CPUs. I
>> suspect the same thing is true of LWLockAcquire.
>
> That squares with some behavior I've seen. If you run opannotate
> you often see ridiculously high time percentages attributed to extremely
> trivial C statements. The explanation seems to be that those places are
> where chunks of memory are first touched, and have to be pulled into the
> CPU's cache (and, if in shared memory, pulled away from some other CPU).

Does it hurt that, for example, the BufMappingLocks are consecutive in memory? They appear to be among the more heavily contended locks even on my 2-core box.

...Robert