Re: testing HS/SR - 1 vs 2 performance

Lists: pgsql-hackers
From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-12 14:00:51
Message-ID: fc4665550ee82d7c7412e4ed325d37c8.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

resending this message, as it seems to have bounced.

(below, I did fix the typo in the pseudocode loop)

---------------------------------------- Original Message ----------------------------------------
Subject: Re: [HACKERS] testing HS/SR - 1 vs 2 performance
From: "Erik Rijkers" <er(at)xs4all(dot)nl>
Date: Mon, April 12, 2010 14:22
To: pgsql-hackers(at)postgresql(dot)org
--------------------------------------------------------------------------------------------------

On Sat, April 10, 2010 01:23, Erik Rijkers wrote:
> Using 9.0devel cvs HEAD, 2010.04.08.
>
> I am trying to understand the performance difference
> between primary and standby under a standard pgbench
> read-only test.
>
> server has 32 GB, 2 quadcores.
>
> primary:
> tps = 34606.747930 (including connections establishing)
> tps = 34527.078068 (including connections establishing)
> tps = 34654.297319 (including connections establishing)
>
> standby:
> tps = 700.346283 (including connections establishing)
> tps = 717.576886 (including connections establishing)
> tps = 740.522472 (including connections establishing)
>
> transaction type: SELECT only
> scaling factor: 1000
> query mode: simple
> number of clients: 20
> number of threads: 1
> duration: 900 s
>
> both instances have
> max_connections = 100
> shared_buffers = 256MB
> checkpoint_segments = 50
> effective_cache_size= 16GB
>
> See also:
>
> http://archives.postgresql.org/pgsql-testers/2010-04/msg00005.php
> (differences with scale 10_000)
>

To my surprise, I have later seen the opposite behaviour with the standby giving fast runs, and
the primary slow.

FWIW, I've overnight run a larget set of tests. (against same 9.0devel
instances as the ones from the earlier email).

These results are generally more balanced.

for scale in 10 100 500 1000
pgbench ... # initialise
sleep ((scale / 10) * 60)
for clients in 1 5 10 20
for port in 6565 6566 --> primaryport standbyport
for run in `seq 1 3`
pgbench ...
sleep ((scale / 10) * 60)
done
done
done
done

(so below, alternating 3 primary, followed by 3 standby runs)

scale: 10 clients: 1 tps = 15219.019272 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 15301.847615 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 15238.907436 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12129.928289 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12151.711589 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 12203.494512 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 5 tps = 60248.120599 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 60827.949875 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 61167.447476 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50750.385403 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50600.891436 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 5 tps = 50486.857610 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 10 clients: 10 tps = 60307.739327 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 60264.230349 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 60146.370598 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 50455.537671 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 49877.000813 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 50097.949766 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 20 tps = 43355.220657 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 43352.725422 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 43496.085623 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37169.126299 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37100.260450 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 37342.758507 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 1 tps = 12514.185089 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 12542.842198 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 12595.688640 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10435.681851 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10456.983353 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 10434.213044 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 5 tps = 48682.166988 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 48656.883485 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 48687.894655 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41901.629933 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41953.386791 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 5 tps = 41787.962712 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 100 clients: 10 tps = 48704.247239 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 48941.190050 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 48603.077936 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42948.666272 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42767.793899 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 42612.670983 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 20 tps = 36350.454258 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 36373.088111 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 36490.886781 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32235.811228 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32253.837906 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 32144.189047 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 1 tps = 11733.254970 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11726.665739 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11617.622548 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9769.861175 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9878.465752 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 9808.236216 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 5 tps = 45185.900553 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 45170.334037 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 45136.596374 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39231.863815 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39336.889619 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 5 tps = 39269.483772 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 500 clients: 10 tps = 45468.080680 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 45727.159963 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 45399.241367 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40759.108042 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40783.287718 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 40858.007847 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 20 tps = 34729.742313 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34705.119029 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34617.517224 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31252.355034 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31234.885791 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 31273.307637 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 1 tps = 220.024691 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 294.855794 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 375.152757 pgbench -h /tmp -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 295.965959 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 1036.517110 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 9167.012603 pgbench -h /tmp -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 5 tps = 1241.224282 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1894.806301 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 18532.885549 pgbench -h /tmp -p 6565 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1497.491279 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 1480.164166 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 5 tps = 3470.769236 pgbench -h /tmp -p 6566 -n -S -c 5 -T 900 -j 1
scale: 1000 clients: 10 tps = 2414.552333 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 19248.609443 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 45059.231609 pgbench -h /tmp -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 1648.526373 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 3659.800008 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 35900.769857 pgbench -h /tmp -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 20 tps = 2462.855864 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 27168.407568 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 34438.802096 pgbench -h /tmp -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 2933.220489 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 25586.972428 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 30926.189621 pgbench -h /tmp -p 6566 -n -S -c 20 -T 900 -j 1


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-13 05:44:19
Message-ID: 4BC404B3.2080300@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I could reproduce this on my laptop, standby is about 20% slower. I ran
oprofile, and what stands out as the difference between the master and
standby is that on standby about 20% of the CPU time is spent in
hash_seq_search(). The callpath is GetSnapshotDat() ->
KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
difference in performance.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-13 18:09:46
Message-ID: 4BC4B36A.3090803@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> I could reproduce this on my laptop, standby is about 20% slower. I ran
> oprofile, and what stands out as the difference between the master and
> standby is that on standby about 20% of the CPU time is spent in
> hash_seq_search(). The callpath is GetSnapshotDat() ->
> KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
> difference in performance.

The slowdown is proportional to the max_connections setting in the
standby. 20% slowdown might still be acceptable, but if you increase
max_connections to say 1000, things get really slow. I wouldn't
recommend max_connections=1000, of course, but I think we need to do
something about this. Changing the KnownAssignedXids data structure from
hash table into something that's quicker to scan. Preferably something
with O(N), where N is the number of entries in the data structure, not
the maximum number of entries it can hold as it is with the hash table
currently.

A quick fix would be to check if there's any entries in the hash table
before scanning it. That would eliminate the overhead when there's no
in-progress transactions in the master. But as soon as there's even one,
the overhead comes back.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Dimitri Fontaine <dfontaine(at)hi-media(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-14 08:24:45
Message-ID: 87fx2ywieq.fsf@hi-media-techno.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Changing the KnownAssignedXids data structure from
> hash table into something that's quicker to scan. Preferably something
> with O(N), where N is the number of entries in the data structure, not
> the maximum number of entries it can hold as it is with the hash table
> currently.

So that's pretty good news RedBlack Trees made it in 9.0, isn't it? :)

> A quick fix would be to check if there's any entries in the hash table
> before scanning it. That would eliminate the overhead when there's no
> in-progress transactions in the master. But as soon as there's even one,
> the overhead comes back.

Does not sound like typical, does it?
--
dim


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-14 12:08:13
Message-ID: 1271246893.8305.1294.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
> Heikki Linnakangas wrote:
> > I could reproduce this on my laptop, standby is about 20% slower. I ran
> > oprofile, and what stands out as the difference between the master and
> > standby is that on standby about 20% of the CPU time is spent in
> > hash_seq_search(). The callpath is GetSnapshotDat() ->
> > KnownAssignedXidsGetAndSetXmin() -> hash_seq_search(). That explains the
> > difference in performance.
>
> The slowdown is proportional to the max_connections setting in the
> standby. 20% slowdown might still be acceptable, but if you increase
> max_connections to say 1000, things get really slow. I wouldn't
> recommend max_connections=1000, of course, but I think we need to do
> something about this. Changing the KnownAssignedXids data structure from
> hash table into something that's quicker to scan. Preferably something
> with O(N), where N is the number of entries in the data structure, not
> the maximum number of entries it can hold as it is with the hash table
> currently.

There's a tradeoff here to consider. KnownAssignedXids faces two
workloads: one for each WAL record where we check if the xid is already
known assigned, one for snapshots. The current implementation is
optimised towards recovery performance, not snapshot performance.

> A quick fix would be to check if there's any entries in the hash table
> before scanning it. That would eliminate the overhead when there's no
> in-progress transactions in the master. But as soon as there's even one,
> the overhead comes back.

Any fix should be fairly quick because of the way its modularised - with
something like this in mind.

I'll try a circular buffer implementation, with fastpath.

Have something in a few hours.

--
Simon Riggs www.2ndQuadrant.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 08:29:54
Message-ID: 4BC82002.3010307@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
>> A quick fix would be to check if there's any entries in the hash table
>> before scanning it. That would eliminate the overhead when there's no
>> in-progress transactions in the master. But as soon as there's even one,
>> the overhead comes back.
>
> Any fix should be fairly quick because of the way its modularised - with
> something like this in mind.
>
> I'll try a circular buffer implementation, with fastpath.

I started experimenting with a sorted array based implementation on
Tuesday but got carried away with other stuff. I now got back to that
and cleaned it up.

How does the attached patch look like? It's probably similar to what you
had in mind.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
knownassignedxids-array-2.patch text/x-diff 14.6 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 09:10:01
Message-ID: 1271409001.8305.7773.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Tue, 2010-04-13 at 21:09 +0300, Heikki Linnakangas wrote:
> >> A quick fix would be to check if there's any entries in the hash table
> >> before scanning it. That would eliminate the overhead when there's no
> >> in-progress transactions in the master. But as soon as there's even one,
> >> the overhead comes back.
> >
> > Any fix should be fairly quick because of the way its modularised - with
> > something like this in mind.
> >
> > I'll try a circular buffer implementation, with fastpath.
>
> I started experimenting with a sorted array based implementation on
> Tuesday but got carried away with other stuff. I now got back to that
> and cleaned it up.
>
> How does the attached patch look like? It's probably similar to what you
> had in mind.

It looks like a second version of what I'm working on and about to
publish. I'll take that as a compliment!

My patch is attached here also, for discussion.

The two patches look same in their main parts, though I have quite a few
extra tweaks in there, which you can read about in comments. One tweak I
don't have is the use of the presence array that allows a sensible
bsearch, so I'll to alter my patch to use that idea but keep the rest of
my code.

--
Simon Riggs www.2ndQuadrant.com

Attachment Content-Type Size
circular_knownassigned.patch text/x-patch 30.1 KB

From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 11:47:42
Message-ID: 4BC84E5E.2060006@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
>> How does the attached patch look like? It's probably similar to what you
>> had in mind.
>
> It looks like a second version of what I'm working on and about to
> publish. I'll take that as a compliment!
>
> My patch is attached here also, for discussion.
>
> The two patches look same in their main parts, though I have quite a few
> extra tweaks in there, which you can read about in comments.

Yeah. Yours seems a lot more complex with all those extra tweaks, I
would suggest to keep it simple. I did realize one bug in my patch: I
didn't handle xid wraparound correctly in the binary search, need to use
TransactionIdFollows instead of plan >.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 12:00:22
Message-ID: 1271419222.8305.8131.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-16 at 14:47 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Fri, 2010-04-16 at 11:29 +0300, Heikki Linnakangas wrote:
> >> How does the attached patch look like? It's probably similar to what you
> >> had in mind.
> >
> > It looks like a second version of what I'm working on and about to
> > publish. I'll take that as a compliment!
> >
> > My patch is attached here also, for discussion.
> >
> > The two patches look same in their main parts, though I have quite a few
> > extra tweaks in there, which you can read about in comments.
>
> Yeah. Yours seems a lot more complex with all those extra tweaks, I
> would suggest to keep it simple. I did realize one bug in my patch: I
> didn't handle xid wraparound correctly in the binary search, need to use
> TransactionIdFollows instead of plan >.

Almost done, yes, much simpler. I wrote a lot of that in the wee small
hours last night, so the difference is amusing.

And I spotted that bug, plus the off by one error also. Just rewritten
all other parts, so no worries.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 14:39:52
Message-ID: 8919.1271428792@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> I didn't handle xid wraparound correctly in the binary search, need to use
> TransactionIdFollows instead of plan >.

I think you're outsmarting yourself there. A binary search will in fact
*not work* with circular xid comparison (this is exactly why there's no
btree opclass for XID). You need to use plain >, and make sure the
array you're searching is ordered that way too. The other way might
accidentally fail to malfunction if you only tested ranges of XIDs that
weren't long enough to wrap around, but that doesn't make it right.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 14:52:49
Message-ID: 1271429569.8305.8477.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> > I didn't handle xid wraparound correctly in the binary search, need to use
> > TransactionIdFollows instead of plan >.
>
> I think you're outsmarting yourself there. A binary search will in fact
> *not work* with circular xid comparison (this is exactly why there's no
> btree opclass for XID). You need to use plain >, and make sure the
> array you're searching is ordered that way too. The other way might
> accidentally fail to malfunction if you only tested ranges of XIDs that
> weren't long enough to wrap around, but that doesn't make it right.

I don't understand the exact, please explain more.

I'm not using bsearch() just a quick chop based upon xid comparison,
which looks to me like it will work.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 15:10:45
Message-ID: 9479.1271430645@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote:
>> I think you're outsmarting yourself there. A binary search will in fact
>> *not work* with circular xid comparison (this is exactly why there's no
>> btree opclass for XID).

> I don't understand the exact, please explain more.
> I'm not using bsearch() just a quick chop based upon xid comparison,
> which looks to me like it will work.

Implementing it yourself doesn't get you out of the fact that it won't
work. Consider

1 2 3 ... 3000000000 ... 3999999999

and suppose that 3000000000 is the array's middle element. If you
search for 100, your first probe will conclude that it is to the right
of 3000000000, which is the wrong direction. Binary search, or indeed
the entire concept that the array is "sorted" in the first place,
depends on a transitive comparison operator. TransactionIdFollows does
not satisfy the transitive law.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-16 21:00:09
Message-ID: 1271451609.8305.9155.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-16 at 13:00 +0100, Simon Riggs wrote:
> Almost done

Here's the finished patch.

Internal changes only, no functional or user visible changes, so no
docs, just detailed explanatory comments.

Expectation is
* performance no longer depends upon max_connections
* recovery will be about same or slightly faster
* snapshots should be about equivalent primary/standby

Main changes are
* lock free add to sorted array (equivalent to GetNewTransactionId)
* bsearch for xid removal at commit/abort/TransactionIdIsInProgress
* faster, more compact approach to snapshots, with self-trimming
* some minor API changes to facilitate above
* new code to ignore deletion failures only when !consistent

Erik,

Could you retest please?

--
Simon Riggs www.2ndQuadrant.com

Attachment Content-Type Size
knownassigned_sortedarray.diff text/x-patch 32.0 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 14:16:55
Message-ID: 1271513815.8305.11249.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-16 at 11:10 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Fri, 2010-04-16 at 10:39 -0400, Tom Lane wrote:
> >> I think you're outsmarting yourself there. A binary search will in fact
> >> *not work* with circular xid comparison (this is exactly why there's no
> >> btree opclass for XID).
>
> > I don't understand the exact, please explain more.
> > I'm not using bsearch() just a quick chop based upon xid comparison,
> > which looks to me like it will work.
>
> Implementing it yourself doesn't get you out of the fact that it won't
> work. Consider
>
> 1 2 3 ... 3000000000 ... 3999999999
>
> and suppose that 3000000000 is the array's middle element. If you
> search for 100, your first probe will conclude that it is to the right
> of 3000000000, which is the wrong direction. Binary search, or indeed
> the entire concept that the array is "sorted" in the first place,
> depends on a transitive comparison operator. TransactionIdFollows does
> not satisfy the transitive law.

AFAICS the example you give isn't correct.

We would lay out the values like this

W-3 W-2 W-1 W 3 4 5

where W is the wrap value and in this example we have 7 values in the
current array, with tail at W-3 and head at 5. Note the gap between W
and 3 where we would skip the values 1 and 2 because they are special.
Each element's xid is TransactionIdAdvanced(previous element).

So when we search for value 3 we would start from W, then decide it is
to the right, which is correct and continue from there.

The values are laid out in TransactionIdFollows order, not in numeric
order, hence we need to use TransactionIdFollows to decide which way to
branch.

As long as it works I'm not worried if the array is not technically
"sorted".

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 15:13:46
Message-ID: 29074.1271517226@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> AFAICS the example you give isn't correct.

> We would lay out the values like this

> W-3 W-2 W-1 W 3 4 5

> where W is the wrap value

Stop right there, you're already failing to think clearly. There is no
unique "wrap value", all values act the same in circular XID space.

The fundamental problem here is that if the set of XIDs involved spans a
sufficiently long range, your array will have a[0] < a[1] < ... < a[n]
but it will fail to be true that a[0] < a[n]. If that's also true for,
say, a[0] vs the midpoint element, then a binary search for a[0] will
fail because it will make the wrong decision while probing the midpoint.

> The values are laid out in TransactionIdFollows order,

They *cannot* be "laid out in TransactionIdFollows order". It's not
possible, because that relationship isn't a total ordering.

Now it might be the case that this is OK for HS purposes because the set
of XIDs that are relevant at any one instant should never span more than
half of the XID space. But I'd just as soon not use that assumption
here if it's unnecessary. It'd be cheaper anyway to sort and search the
array using plain <, so why are you so eager to use
TransactionIdFollows?

regards, tom lane


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 15:25:46
Message-ID: y2u603c8f071004170825zc8ae5f0fv19f999fe72e6f666@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 17, 2010 at 11:13 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> AFAICS the example you give isn't correct.
>
>> We would lay out the values like this
>
>> W-3 W-2 W-1 W 3 4 5
>
>> where W is the wrap value
>
> Stop right there, you're already failing to think clearly.  There is no
> unique "wrap value", all values act the same in circular XID space.

Or to put it a different way, what happens when the "wrap value"
changes? An array that was in order under one wrap value can cease to
be in order under another.

...Robert


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 19:20:00
Message-ID: 1271532000.8305.11925.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > AFAICS the example you give isn't correct.
>
> > We would lay out the values like this
>
> > W-3 W-2 W-1 W 3 4 5
>
> > where W is the wrap value
>
> Stop right there, you're already failing to think clearly. There is no
> unique "wrap value", all values act the same in circular XID space.
>
> The fundamental problem here is that if the set of XIDs involved spans a
> sufficiently long range, your array will have a[0] < a[1] < ... < a[n]
> but it will fail to be true that a[0] < a[n]. If that's also true for,
> say, a[0] vs the midpoint element, then a binary search for a[0] will
> fail because it will make the wrong decision while probing the midpoint.

I understand that the xids are circular, but there is only one point
where the next xid has a lower integer value but is still "later" from
the perspective of TransactionIdFollows. Apart from that single
discontinuity all other values are monotonic. From your insistence I
presume I must be missing something, but I just don't see it. Perhaps we
are just misunderstanding each other about the "sufficiently long
range".

> > The values are laid out in TransactionIdFollows order,
>
> They *cannot* be "laid out in TransactionIdFollows order". It's not
> possible, because that relationship isn't a total ordering.
>
> Now it might be the case that this is OK for HS purposes because the set
> of XIDs that are relevant at any one instant should never span more than
> half of the XID space.

Agree that is true.

> But I'd just as soon not use that assumption
> here if it's unnecessary.

I understand what your saying. I think it is necessary here.

> It'd be cheaper anyway to sort and search the
> array using plain <, so why are you so eager to use
> TransactionIdFollows?

The array grows to the right and is laid out one xid per element,
resulting in a sequence of values that are transactionid-monotonic. As
the values increase there will eventually reach the discontinuity where
they cease being normally monotonic. Doing it this way means that we can
add rows past the head of the array and then move the head atomically,
so that we can make adding xids lock-free.

If we try to actually sort the values then the algorithm is both more
complex and requires locking. It would be easier to just remember where
the discontinuity is and act accordingly.

So I'm not eager to use either way, but I only currently see one way
that would work.

If there's a different way, that gives the same or better algorithmic
characteristics, I will be happy to code it.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 19:45:47
Message-ID: 5747.1271533547@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote:
>> It'd be cheaper anyway to sort and search the
>> array using plain <, so why are you so eager to use
>> TransactionIdFollows?

> The array grows to the right and is laid out one xid per element,
> resulting in a sequence of values that are transactionid-monotonic.

How do you know that just adding items at the right will produce a
sorted array? It seems quite unlikely to me that transactions can be
guaranteed to arrive at this code in XID order. I think you need to do
an explicitly sorted insertion.

> ... Doing it this way means that we can
> add rows past the head of the array and then move the head atomically,
> so that we can make adding xids lock-free.

... and even without that issue, this seems like utter fantasy. How
are you going to do that "atomically"? Have you considered what will
happen on weak-memory-ordering machines like PPC, in particular?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 20:14:33
Message-ID: 1271535273.8305.12088.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2010-04-17 at 15:45 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Sat, 2010-04-17 at 11:13 -0400, Tom Lane wrote:
> >> It'd be cheaper anyway to sort and search the
> >> array using plain <, so why are you so eager to use
> >> TransactionIdFollows?
>
> > The array grows to the right and is laid out one xid per element,
> > resulting in a sequence of values that are transactionid-monotonic.
>
> How do you know that just adding items at the right will produce a
> sorted array? It seems quite unlikely to me that transactions can be
> guaranteed to arrive at this code in XID order. I think you need to do
> an explicitly sorted insertion.

Xids don't arrive in sequence, but "known assigned xids" are added in
sequence because we infer the existence of the intermediate xids and
assuming they are running for the snapshot.

> > ... Doing it this way means that we can
> > add rows past the head of the array and then move the head atomically,
> > so that we can make adding xids lock-free.
>
> ... and even without that issue, this seems like utter fantasy. How
> are you going to do that "atomically"? Have you considered what will
> happen on weak-memory-ordering machines like PPC, in particular?

We search the array between tail and head. If the head moves by integer
overwrite just as already happens for xid assignment, then we would use
the new head for the search. The code is careful to fetch only once.

I would freely admit I know absolutely nothing about details of
weak-memory-ordering machines and have not considered them at all. How
would what I have proposed fail to work, yet what we already rely on
work correctly? Do the circumstances differ?

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 20:48:21
Message-ID: 7200.1271537301@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sat, 2010-04-17 at 15:45 -0400, Tom Lane wrote:
>> How do you know that just adding items at the right will produce a
>> sorted array?

> Xids don't arrive in sequence, but "known assigned xids" are added in
> sequence because we infer the existence of the intermediate xids and
> assuming they are running for the snapshot.

Hm. Okay, maybe that will work.

>> ... and even without that issue, this seems like utter fantasy. How
>> are you going to do that "atomically"? Have you considered what will
>> happen on weak-memory-ordering machines like PPC, in particular?

> We search the array between tail and head. If the head moves by integer
> overwrite just as already happens for xid assignment, then we would use
> the new head for the search. The code is careful to fetch only once.

... but this will not. You need to use a lock, because there is
otherwise no guarantee that other processors see the write into the
array element before they see the change in the head pointer.

> I would freely admit I know absolutely nothing about details of
> weak-memory-ordering machines and have not considered them at all. How
> would what I have proposed fail to work, yet what we already rely on
> work correctly? Do the circumstances differ?

Yes. We have memory ordering instructions inserted in the lock
acquisition/release code. Trying to access and modify a shared-memory
data structure without any locking will not work.

There are some places where we suppose that a *single* write into shared
memory can safely be done without a lock, if we're not too concerned
about how soon other transactions will see the effects. But what you
are proposing here requires more than one related write.

I've been burnt by this myself:
http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 22:46:06
Message-ID: 1271544366.8305.12491.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2010-04-17 at 16:48 -0400, Tom Lane wrote:

> > We search the array between tail and head. If the head moves by integer
> > overwrite just as already happens for xid assignment, then we would use
> > the new head for the search. The code is careful to fetch only once.
>
> ... but this will not. You need to use a lock, because there is
> otherwise no guarantee that other processors see the write into the
> array element before they see the change in the head pointer.
>
> > I would freely admit I know absolutely nothing about details of
> > weak-memory-ordering machines and have not considered them at all. How
> > would what I have proposed fail to work, yet what we already rely on
> > work correctly? Do the circumstances differ?
>
> Yes. We have memory ordering instructions inserted in the lock
> acquisition/release code. Trying to access and modify a shared-memory
> data structure without any locking will not work.
>
> There are some places where we suppose that a *single* write into shared
> memory can safely be done without a lock, if we're not too concerned
> about how soon other transactions will see the effects. But what you
> are proposing here requires more than one related write.
>
> I've been burnt by this myself:
> http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php

W O W - thank you for sharing.

What I'm not clear on is why you've used a spinlock everywhere when only
weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
macro that does does nada when the hardware already protects us? (i.e. a
spinlock only for the hardware that needs it).

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-17 22:52:26
Message-ID: 9502.1271544746@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> What I'm not clear on is why you've used a spinlock everywhere when only
> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> macro that does does nada when the hardware already protects us? (i.e. a
> spinlock only for the hardware that needs it).

Well, we could certainly consider that, if we had enough places where
there was a demonstrable benefit from it. I couldn't measure any real
slowdown from adding a spinlock in that sinval code, so I didn't propose
doing so at the time --- and I'm pretty dubious that this code is
sufficiently performance-critical to justify the work, either.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-18 07:24:36
Message-ID: 1271575476.8305.13528.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > What I'm not clear on is why you've used a spinlock everywhere when only
> > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> > macro that does does nada when the hardware already protects us? (i.e. a
> > spinlock only for the hardware that needs it).
>
> Well, we could certainly consider that, if we had enough places where
> there was a demonstrable benefit from it. I couldn't measure any real
> slowdown from adding a spinlock in that sinval code, so I didn't propose
> doing so at the time --- and I'm pretty dubious that this code is
> sufficiently performance-critical to justify the work, either.

OK, I'll put a spinlock around access to the head of the array.

Thanks for your input.

--
Simon Riggs www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-18 11:01:05
Message-ID: 1271588465.8305.13998.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > > What I'm not clear on is why you've used a spinlock everywhere when only
> > > weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> > > macro that does does nada when the hardware already protects us? (i.e. a
> > > spinlock only for the hardware that needs it).
> >
> > Well, we could certainly consider that, if we had enough places where
> > there was a demonstrable benefit from it. I couldn't measure any real
> > slowdown from adding a spinlock in that sinval code, so I didn't propose
> > doing so at the time --- and I'm pretty dubious that this code is
> > sufficiently performance-critical to justify the work, either.
>
> OK, I'll put a spinlock around access to the head of the array.

v2 patch attached

--
Simon Riggs www.2ndQuadrant.com

Attachment Content-Type Size
knownassigned_sortedarray.v2.diff text/x-patch 35.0 KB

From: David Fetter <david(at)fetter(dot)org>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-18 20:16:11
Message-ID: 20100418201611.GD17710@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
> On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > > > What I'm not clear on is why you've used a spinlock everywhere
> > > > when only weak-memory thang CPUs are a problem. Why not have a
> > > > weak-memory-protect macro that does does nada when the
> > > > hardware already protects us? (i.e. a spinlock only for the
> > > > hardware that needs it).
> > >
> > > Well, we could certainly consider that, if we had enough places
> > > where there was a demonstrable benefit from it. I couldn't
> > > measure any real slowdown from adding a spinlock in that sinval
> > > code, so I didn't propose doing so at the time --- and I'm
> > > pretty dubious that this code is sufficiently
> > > performance-critical to justify the work, either.
> >
> > OK, I'll put a spinlock around access to the head of the array.
>
> v2 patch attached

If you've committed this, or any other patch you've sent here,
*please* mention so on the same thread.

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: David Fetter <david(at)fetter(dot)org>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-18 20:22:21
Message-ID: 1271622141.8305.15238.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote:
> On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
> > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > > > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > > > > What I'm not clear on is why you've used a spinlock everywhere
> > > > > when only weak-memory thang CPUs are a problem. Why not have a
> > > > > weak-memory-protect macro that does does nada when the
> > > > > hardware already protects us? (i.e. a spinlock only for the
> > > > > hardware that needs it).
> > > >
> > > > Well, we could certainly consider that, if we had enough places
> > > > where there was a demonstrable benefit from it. I couldn't
> > > > measure any real slowdown from adding a spinlock in that sinval
> > > > code, so I didn't propose doing so at the time --- and I'm
> > > > pretty dubious that this code is sufficiently
> > > > performance-critical to justify the work, either.
> > >
> > > OK, I'll put a spinlock around access to the head of the array.
> >
> > v2 patch attached
>
> If you've committed this, or any other patch you've sent here,
> *please* mention so on the same thread.

I haven't yet. I've written two patches - this is a major module rewrite
and is still under discussion. The other patch has nothing to do with
this (though I did accidentally include a couple of changes from this
patch and immediately revoked them).

I will wait awhile to see if anybody has some independent test results.

--
Simon Riggs www.2ndQuadrant.com


From: David Fetter <david(at)fetter(dot)org>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-18 20:23:21
Message-ID: 20100418202321.GF17710@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 18, 2010 at 09:22:21PM +0100, Simon Riggs wrote:
> On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote:
> > On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
> > > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> > > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> > > > > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > > > > > What I'm not clear on is why you've used a spinlock
> > > > > > everywhere when only weak-memory thang CPUs are a problem.
> > > > > > Why not have a weak-memory-protect macro that does does
> > > > > > nada when the hardware already protects us? (i.e. a
> > > > > > spinlock only for the hardware that needs it).
> > > > >
> > > > > Well, we could certainly consider that, if we had enough
> > > > > places where there was a demonstrable benefit from it. I
> > > > > couldn't measure any real slowdown from adding a spinlock in
> > > > > that sinval code, so I didn't propose doing so at the time
> > > > > --- and I'm pretty dubious that this code is sufficiently
> > > > > performance-critical to justify the work, either.
> > > >
> > > > OK, I'll put a spinlock around access to the head of the
> > > > array.
> > >
> > > v2 patch attached
> >
> > If you've committed this, or any other patch you've sent here,
> > *please* mention so on the same thread.
>
> I haven't yet. I've written two patches - this is a major module
> rewrite and is still under discussion. The other patch has nothing
> to do with this (though I did accidentally include a couple of
> changes from this patch and immediately revoked them).
>
> I will wait awhile to see if anybody has some independent test
> results.

Thanks :)

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-20 11:50:23
Message-ID: o2g603c8f071004200450qcfd9cb0l981203c8d16cc907@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 18, 2010 at 4:22 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sun, 2010-04-18 at 13:16 -0700, David Fetter wrote:
>> On Sun, Apr 18, 2010 at 12:01:05PM +0100, Simon Riggs wrote:
>> > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
>> > > On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
>> > > > Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> > > > > What I'm not clear on is why you've used a spinlock everywhere
>> > > > > when only weak-memory thang CPUs are a problem. Why not have a
>> > > > > weak-memory-protect macro that does does nada when the
>> > > > > hardware already protects us? (i.e. a spinlock only for the
>> > > > > hardware that needs it).
>> > > >
>> > > > Well, we could certainly consider that, if we had enough places
>> > > > where there was a demonstrable benefit from it.  I couldn't
>> > > > measure any real slowdown from adding a spinlock in that sinval
>> > > > code, so I didn't propose doing so at the time --- and I'm
>> > > > pretty dubious that this code is sufficiently
>> > > > performance-critical to justify the work, either.
>> > >
>> > > OK, I'll put a spinlock around access to the head of the array.
>> >
>> > v2 patch attached
>>
>> If you've committed this, or any other patch you've sent here,
>> *please* mention so on the same thread.
>
> I haven't yet. I've written two patches - this is a major module rewrite
> and is still under discussion. The other patch has nothing to do with
> this (though I did accidentally include a couple of changes from this
> patch and immediately revoked them).
>
> I will wait awhile to see if anybody has some independent test results.

So, does anyone have a few cycles to test this out? We are down to
handful of remaining open items, so getting this tested and committed
sooner = beta sooner.

...Robert


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 03:09:43
Message-ID: 4BCE6C77.6060909@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
>
> So, does anyone have a few cycles to test this out? We are down to
> handful of remaining open items, so getting this tested and committed
> sooner = beta sooner.
>
>
>

I did some testing of this patch (v2). Unfortunately I don't have access
to hardware capable of doing tests at the same scale as Erik used.
However I was still able to show a consistent difference (I think)
between standby performance with and without the patch applied.

Setup:

host: 2.7 Ghz dual core amd64 with 4G ram and 1 sata drive,
code: cvs head from 2010-04-14.
pgbench: scale=100, 4 clients, 10000 (select) transactions each.

Results:

Master performance (with and without patch applied ):
tps = 10903.612340 - 14070.109951 (including connections establishing)

Standby performance without patch (:
tps = 8288.119913 - 9722.245178 (including connections establishing)

Standby performance with patch applied:
tps = 11592.922637 - 14065.553214 (including connections establishing)

I performed 8 runs of each, and results would start at the low range and
climb up to the high one, where they would stabilize. In between runs I
cleared the os buffer cache and (partially) reloaded it by selecting
counts from the pgbench tables (i.e I was trying to ensure each run had
the same or similar os cache contents).

Overall looks like the patch gets standby read only performance close to
the master - at least in the case where there are minimal master
transactions being tracked by the standby (I had to leave the master
idle whilst running the standby case, as they shared the machine). Hope
this info is useful.

regards

Mark


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 03:46:06
Message-ID: z2s603c8f071004202046tbc68b77dle3a617f1567ee41c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Apr 20, 2010 at 11:09 PM, Mark Kirkwood
<mark(dot)kirkwood(at)catalyst(dot)net(dot)nz> wrote:
> Robert Haas wrote:
>>
>> So, does anyone have a few cycles to test this out?  We are down to
>> handful of remaining open items, so getting this tested and committed
>> sooner = beta sooner.
>>
>>
>>
>
> I did some testing of this patch (v2). Unfortunately I don't have access to
> hardware capable of doing tests at the same scale as Erik used. However I
> was still able to show a consistent difference (I think) between standby
> performance with and without the patch applied.
>
> Setup:
>
> host: 2.7 Ghz dual core amd64 with 4G ram and 1 sata drive,
> code: cvs head from 2010-04-14.
> pgbench:  scale=100, 4 clients, 10000 (select) transactions each.
>
> Results:
>
> Master performance (with and without patch applied ):
> tps = 10903.612340 - 14070.109951 (including connections establishing)
>
> Standby performance without patch (:
> tps = 8288.119913 - 9722.245178 (including connections establishing)
>
> Standby performance with patch applied:
> tps = 11592.922637 - 14065.553214 (including connections establishing)
>
> I performed 8 runs of each, and results would start at the low range and
> climb up to the high one, where they would stabilize. In between runs I
> cleared the os buffer cache and (partially) reloaded it by selecting counts
> from the pgbench tables (i.e I was trying to ensure each run had the same or
> similar os cache contents).
>
> Overall looks like the patch gets standby read only performance close to the
> master - at least in the case where there are minimal master transactions
> being tracked by the standby (I had to leave the master idle whilst running
> the standby case, as they shared the machine). Hope this info is useful.

Thanks, that sounds promising.

...Robert


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, David Fetter <david(at)fetter(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 06:39:16
Message-ID: 1271831956.8305.26815.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-04-21 at 15:09 +1200, Mark Kirkwood wrote:

> I did some testing of this patch (v2). Unfortunately I don't have access
> to hardware capable of doing tests at the same scale as Erik used.
> However I was still able to show a consistent difference (I think)
> between standby performance with and without the patch applied.

...

> Overall looks like the patch gets standby read only performance close to
> the master - at least in the case where there are minimal master
> transactions being tracked by the standby (I had to leave the master
> idle whilst running the standby case, as they shared the machine). Hope
> this info is useful.

Thanks very much for the report; always good to get confirmation.

--
Simon Riggs www.2ndQuadrant.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 12:20:32
Message-ID: 4BCEED90.4060806@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
>> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>>> What I'm not clear on is why you've used a spinlock everywhere when only
>>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
>>>> macro that does does nada when the hardware already protects us? (i.e. a
>>>> spinlock only for the hardware that needs it).
>>> Well, we could certainly consider that, if we had enough places where
>>> there was a demonstrable benefit from it. I couldn't measure any real
>>> slowdown from adding a spinlock in that sinval code, so I didn't propose
>>> doing so at the time --- and I'm pretty dubious that this code is
>>> sufficiently performance-critical to justify the work, either.
>> OK, I'll put a spinlock around access to the head of the array.
>
> v2 patch attached

The locking seems overly complex to me. Looking at
KnownAssignedXidsAdd(), isn't it possible for two processes to call it
concurrently in exclusive_lock==false mode and get the same 'head'
value, and step on each others toes? I guess KnownAssignedXidsAdd() is
only called by the startup process, but it seems like an accident
waiting to happen.

Spinlocks are fast, if you have to add an if-branch to decide whether to
acquire it, I suspect you've lost any performance gain to be had
already. Let's keep it simple. And acquiring ProcArrayLock in exclusive
mode while adding entries to the array seems OK to me as well. It only
needs to be held very briefly, and getting this correct and keeping it
simple is much more important at this point than squeezing out every
possible CPU cycle from the system.

Just require that the caller holds ProcArrayLock in exclusive mode when
calling an operation that modifies the array, and in shared mode when
reading it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 12:27:12
Message-ID: 4BCEEF20.1000602@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
>> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
>>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>>>> What I'm not clear on is why you've used a spinlock everywhere when only
>>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
>>>> macro that does does nada when the hardware already protects us? (i.e. a
>>>> spinlock only for the hardware that needs it).
>>> Well, we could certainly consider that, if we had enough places where
>>> there was a demonstrable benefit from it. I couldn't measure any real
>>> slowdown from adding a spinlock in that sinval code, so I didn't propose
>>> doing so at the time --- and I'm pretty dubious that this code is
>>> sufficiently performance-critical to justify the work, either.
>> OK, I'll put a spinlock around access to the head of the array.
>
> v2 patch attached

Given the discussion about the cyclic nature of XIDs, it would be good
to add an assertion that when a new XID is added to the array, it is

a) larger than the biggest value already in the array
(TransactionIdFollows(new, head)), and
b) not too far from the smallest value in the array to confuse binary
search (TransactionIdFollows(new, tail)).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 13:31:28
Message-ID: 1271856688.8305.27942.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-04-21 at 15:20 +0300, Heikki Linnakangas wrote:
> Simon Riggs wrote:
> > On Sun, 2010-04-18 at 08:24 +0100, Simon Riggs wrote:
> >> On Sat, 2010-04-17 at 18:52 -0400, Tom Lane wrote:
> >>> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> >>>> What I'm not clear on is why you've used a spinlock everywhere when only
> >>>> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> >>>> macro that does does nada when the hardware already protects us? (i.e. a
> >>>> spinlock only for the hardware that needs it).
> >>> Well, we could certainly consider that, if we had enough places where
> >>> there was a demonstrable benefit from it. I couldn't measure any real
> >>> slowdown from adding a spinlock in that sinval code, so I didn't propose
> >>> doing so at the time --- and I'm pretty dubious that this code is
> >>> sufficiently performance-critical to justify the work, either.
> >> OK, I'll put a spinlock around access to the head of the array.
> >
> > v2 patch attached
>
> The locking seems overly complex to me.

> Looking at
> KnownAssignedXidsAdd(), isn't it possible for two processes to call it
> concurrently in exclusive_lock==false mode and get the same 'head'
> value, and step on each others toes? I guess KnownAssignedXidsAdd() is
> only called by the startup process, but it seems like an accident
> waiting to happen.

Not at all. That assumption is also used elsewhere, so it is safe to use
here.

> Spinlocks are fast, if you have to add an if-branch to decide whether to
> acquire it, I suspect you've lost any performance gain to be had
> already.

I think you're misreading the code. One caller already holds exclusive
lock, one does not. The if test is to determine whether to acquire the
lock or not.

> Let's keep it simple. And acquiring ProcArrayLock in exclusive
> mode while adding entries to the array seems OK to me as well. It only
> needs to be held very briefly, and getting this correct and keeping it
> simple is much more important at this point than squeezing out every
> possible CPU cycle from the system.

I don't understand what you're saying: you say I'm wasting a few cycles
on an if test and should change that, but at the same time you say I
shouldn't worry about a few cycles.

> Just require that the caller holds ProcArrayLock in exclusive mode when
> calling an operation that modifies the array, and in shared mode when
> reading it.

The point of the code you're discussing is to remove the exclusive lock
requests, not to save a few cycles. Spinlocks are fast, as you say.

Exclusive lock requests block under a heavy load of shared lock holders,
we know that already. It is worth removing the contention that can occur
by minimising the number of exclusive locks required. This patch shows
how and I don't see any reason from what you have said to avoid
committing it. I'm willing to hear some sound reasons, if any exist.

--
Simon Riggs www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 13:37:46
Message-ID: 1271857066.8305.27964.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote:

> Given the discussion about the cyclic nature of XIDs, it would be good
> to add an assertion that when a new XID is added to the array, it is
>
> a) larger than the biggest value already in the array
> (TransactionIdFollows(new, head)), and
> b) not too far from the smallest value in the array to confuse binary
> search (TransactionIdFollows(new, tail)).

We discussed this in November. You convinced me it isn't possible for
older xids to stay in the standby because anti-wraparound vacuums would
conflict and kick them out. The primary can't run with wrapped xids and
neither can the standby. I think that is correct.

Adding an assertion isn't going to do much because it's unlikely anybody
is going to be running for 2^31 transactions with asserts enabled.

Worrying about things like this seems strange when real and negative
behaviours are right in our faces elsewhere. Performance and scalability
are real world concerns.

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 13:51:00
Message-ID: v2w603c8f071004210651q9620ae52tcce7e34a18c111e0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 21, 2010 at 9:37 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote:
>
>> Given the discussion about the cyclic nature of XIDs, it would be good
>> to add an assertion that when a new XID is added to the array, it is
>>
>> a) larger than the biggest value already in the array
>> (TransactionIdFollows(new, head)), and
>> b) not too far from the smallest value in the array to confuse binary
>> search (TransactionIdFollows(new, tail)).
>
> We discussed this in November. You convinced me it isn't possible for
> older xids to stay in the standby because anti-wraparound vacuums would
> conflict and kick them out. The primary can't run with wrapped xids and
> neither can the standby. I think that is correct.
>
> Adding an assertion isn't going to do much because it's unlikely anybody
> is going to be running for 2^31 transactions with asserts enabled.
>
> Worrying about things like this seems strange when real and negative
> behaviours are right in our faces elsewhere. Performance and scalability
> are real world concerns.

I think the assert is a good idea. If there's no real problem here,
the assert won't trip. It's just a safety precaution.

...Robert


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 14:11:00
Message-ID: s2v603c8f071004210711p6223333fx4c069983a991e0f6@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 21, 2010 at 8:20 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> The locking seems overly complex to me.

I tend to agree.

! /*
! * Callers must hold either ProcArrayLock in Exclusive mode or
! * ProcArrayLock in Shared mode *and* known_assigned_xids_lck
! * to update these values.
! */

I'm not convinced that this is either (a) correct or (b) performant.

...Robert


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 14:12:43
Message-ID: 1271859163.8305.28038.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote:
> >
> > Adding an assertion isn't going to do much because it's unlikely anybody
> > is going to be running for 2^31 transactions with asserts enabled.
> >

> I think the assert is a good idea. If there's no real problem here,
> the assert won't trip. It's just a safety precaution.

If you believe that, then I think you should add this to all the other
places in the current server where that assumption is made without
assertion being added. As a safety precaution.

--
Simon Riggs www.2ndQuadrant.com


From: marcin mank <marcin(dot)mank(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 14:22:50
Message-ID: x2lb1b9fac61004210722j5bce99at4d50b1a2a7e2b5d5@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 21, 2010 at 4:12 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote:
>> >
>> > Adding an assertion isn't going to do much because it's unlikely anybody
>> > is going to be running for 2^31 transactions with asserts enabled.
>> >
>
>> I think the assert is a good idea.  If there's no real problem here,
>> the assert won't trip.  It's just a safety precaution.
>
> If you believe that, then I think you should add this to all the other
> places in the current server where that assumption is made without
> assertion being added. As a safety precaution.
>

Is that not a good idea that (at least for dev-builds, like with
enable-cassert) the xid counter start at like 2^31 - 1000 ? It could
help catch some bugs.

Greetings
Marcin Mańk


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: marcin mank <marcin(dot)mank(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 14:49:06
Message-ID: 1271861346.8305.28130.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, 2010-04-21 at 16:22 +0200, marcin mank wrote:

> Is that not a good idea that (at least for dev-builds, like with
> enable-cassert) the xid counter start at like 2^31 - 1000 ? It could
> help catch some bugs.

It is a good idea, I'm sure that would help catch bugs.

It wouldn't help here because the case in doubt is whether it's possible
to have an xid still showing in memory arrays from the last time the
cycle wrapped. It isn't. These things aren't random. These numbers are
extracted directly from activity that was occurring on the primary and
regularly checked and cleaned as the standby runs.

So you'll need to do 2^31 transactions to prove this isn't true, which
isn't ever going to happen in testing with an assert build and nobody
with that many transactions would run an assert build anyway.

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 14:53:05
Message-ID: x2u603c8f071004210753gd23ac37hd61e27915c6e0b1a@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Apr 21, 2010 at 10:12 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Wed, 2010-04-21 at 09:51 -0400, Robert Haas wrote:
>> >
>> > Adding an assertion isn't going to do much because it's unlikely anybody
>> > is going to be running for 2^31 transactions with asserts enabled.
>> >
>
>> I think the assert is a good idea.  If there's no real problem here,
>> the assert won't trip.  It's just a safety precaution.
>
> If you believe that, then I think you should add this to all the other
> places in the current server where that assumption is made without
> assertion being added. As a safety precaution.

I feel like this conversation is getting a little heated. We are just
trying to solve a technical problem here. Perhaps I am misreading -
tone doesn't come through very well in email.

I think the assumptions that are being made in this particular case
are different from the ones made elsewhere in the server. Generally,
we don't assume transaction IDs are arriving in ascending order - in
fact, we usually explicitly have to deal with the fact that they might
not be. So if we have a situation where we ARE relying on them
arriving in order, because we have extrinsic reasons why we know it
has to happen that way, adding an assertion to make sure that things
are happening the way we expect doesn't seem out of line. This code
is fairly complex.

There is arguably less value in asserting that the newly added xid
follows the tail as well as the head, but I still like the idea. Not
sure whether that's rational or not.

...Robert


From: Florian Pflug <fgp(at)phlo(dot)org>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: marcin mank <marcin(dot)mank(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-21 15:13:29
Message-ID: A19FEFA9-6AE5-429C-A177-1BE5928C1914@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Apr 21, 2010, at 16:49 , Simon Riggs wrote:
> On Wed, 2010-04-21 at 16:22 +0200, marcin mank wrote:
>
>> Is that not a good idea that (at least for dev-builds, like with
>> enable-cassert) the xid counter start at like 2^31 - 1000 ? It could
>> help catch some bugs.
>
> It is a good idea, I'm sure that would help catch bugs.
>
> It wouldn't help here because the case in doubt is whether it's possible
> to have an xid still showing in memory arrays from the last time the
> cycle wrapped. It isn't. These things aren't random. These numbers are
> extracted directly from activity that was occurring on the primary and
> regularly checked and cleaned as the standby runs.
>
> So you'll need to do 2^31 transactions to prove this isn't true, which
> isn't ever going to happen in testing with an assert build and nobody
> with that many transactions would run an assert build anyway.

ISTM that there's no need to actually execute 2^31 transactions to trigger this bug (if it actually exists), it'd be sufficient to increment the xid counter by more than one each time a xid is assigned, no?

Or would that trip snapshot creation on the standby?

best regards,
Florian Pflug


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-22 05:57:43
Message-ID: 4BCFE557.8040802@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas wrote:
> On Wed, Apr 21, 2010 at 9:37 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Wed, 2010-04-21 at 15:27 +0300, Heikki Linnakangas wrote:
>>
>>> Given the discussion about the cyclic nature of XIDs, it would be good
>>> to add an assertion that when a new XID is added to the array, it is
>>>
>>> a) larger than the biggest value already in the array
>>> (TransactionIdFollows(new, head)), and
>>> b) not too far from the smallest value in the array to confuse binary
>>> search (TransactionIdFollows(new, tail)).
>> We discussed this in November. You convinced me it isn't possible for
>> older xids to stay in the standby because anti-wraparound vacuums would
>> conflict and kick them out. The primary can't run with wrapped xids and
>> neither can the standby. I think that is correct.
>>
>> Adding an assertion isn't going to do much because it's unlikely anybody
>> is going to be running for 2^31 transactions with asserts enabled.
>>
>> Worrying about things like this seems strange when real and negative
>> behaviours are right in our faces elsewhere. Performance and scalability
>> are real world concerns.
>
> I think the assert is a good idea. If there's no real problem here,
> the assert won't trip. It's just a safety precaution.

Right. And assertions also act as documentation, they are a precise and
compact way to document invariants we assume to hold. A comment
explaining why the cyclic nature of XIDs is not a problem would be nice
too, in addition or instead of the assertions.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-22 06:56:39
Message-ID: 1271919399.8305.30561.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2010-04-22 at 08:57 +0300, Heikki Linnakangas wrote:
> >
> > I think the assert is a good idea. If there's no real problem here,
> > the assert won't trip. It's just a safety precaution.
>
> Right. And assertions also act as documentation, they are a precise and
> compact way to document invariants we assume to hold. A comment
> explaining why the cyclic nature of XIDs is not a problem would be nice
> too, in addition or instead of the assertions.

Agreed. I was going to reply just that earlier but have been distracted
on other things.

--
Simon Riggs www.2ndQuadrant.com


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-22 18:39:14
Message-ID: d5a7014dc00b609f19804608d6f070b2.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, April 18, 2010 13:01, Simon Riggs wrote:
>>
>> OK, I'll put a spinlock around access to the head of the array.
>
> v2 patch attached
>

knownassigned_sortedarray.v2.diff applied to cvs HEAD (2010.04.21 22:36)

I have done a few smaller tests (scale 500, clients 1, 20):

init:
pgbench -h /tmp -p 6565 -U rijkers -i -s 500 replicas

4x primary, clients 1:
scale: 500 clients: 1 tps = 11496.372655 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11580.141685 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11478.294747 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 11741.432016 pgbench -p 6565 -n -S -c 1 -T 900 -j 1

4x standby, clients 1:
scale: 500 clients: 1 tps = 727.217672 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 785.431011 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 825.291817 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 868.107638 pgbench -p 6566 -n -S -c 1 -T 900 -j 1

4x primary, clients 20:
scale: 500 clients: 20 tps = 34963.054102 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34818.985407 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34964.545013 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 34959.210687 pgbench -p 6565 -n -S -c 20 -T 900 -j 1

4x standby, clients 20:
scale: 500 clients: 20 tps = 1099.808192 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 905.926703 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 943.531989 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 1082.215913 pgbench -p 6566 -n -S -c 20 -T 900 -j 1

This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the
original post, btw). In that earlier instance, the extreme slowness disappeared later, after many
hours maybe even days (without bouncing either primary or standby).

I have no idea what could cause this; is no one else is seeing this ?

(if I have time I'll repeat on other hardware in the weekend)

any comment is welcome...

Erik Rijkers


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-22 21:41:29
Message-ID: 4BD0C289.9090208@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Erik Rijkers wrote:
> This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the
> original post, btw). In that earlier instance, the extreme slowness disappeared later, after many
> hours maybe even days (without bouncing either primary or standby).
>

Any possibility the standby is built with assertions turned out? That's
often the cause of this type of difference between pgbench results on
two systems, which easy to introduce when everyone is building from
source. You should try this on both systems:

psql -c "show debug_assertions"

just to rule that out.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us


From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-22 21:54:55
Message-ID: 4BD0C5AF.6070206@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Smith wrote:
> Erik Rijkers wrote:
>> This is the same behaviour (i.e. extreme slow standby) that I saw
>> earlier (and which caused the
>> original post, btw). In that earlier instance, the extreme slowness
>> disappeared later, after many
>> hours maybe even days (without bouncing either primary or standby).
>>
>
> Any possibility the standby is built with assertions turned out?
> That's often the cause of this type of difference between pgbench
> results on two systems, which easy to introduce when everyone is
> building from source. You should try this on both systems:
>
> psql -c "show debug_assertions"
>
>
>
Or even:

pg_config --configure

on both systems might be worth checking.

regards

Mark


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Mark Kirkwood" <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Cc: "Greg Smith" <greg(at)2ndquadrant(dot)com>, "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-22 22:32:44
Message-ID: 6fea28b53bef59d729042bfb4fc43fbf.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, April 22, 2010 23:54, Mark Kirkwood wrote:
> Greg Smith wrote:
>> Erik Rijkers wrote:
>>> This is the same behaviour (i.e. extreme slow standby) that I saw
>>> earlier (and which caused the
>>> original post, btw). In that earlier instance, the extreme slowness
>>> disappeared later, after many
>>> hours maybe even days (without bouncing either primary or standby).
>>>
>>
>> Any possibility the standby is built with assertions turned out?
>> That's often the cause of this type of difference between pgbench
>> results on two systems, which easy to introduce when everyone is
>> building from source. You should try this on both systems:
>>
>> psql -c "show debug_assertions"
>>
>>
>>
> Or even:
>
> pg_config --configure
>
> on both systems might be worth checking.

(these instances are on a single server, btw)

primary:

$ pg_config
BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin
DOCDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
HTMLDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
INCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
PKGINCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
INCLUDEDIR-SERVER = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include/server
LIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
PKGLIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
LOCALEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/locale
MANDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/man
SHAREDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share
SYSCONFDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/etc
PGXS = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary' '--with-pgport=6565'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Wendif-labels -fno-strict-aliasing -fwrapv
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib'
LDFLAGS_SL =
LIBS = -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline -ltermcap -lcrypt -ldl -lm
VERSION = PostgreSQL 9.0devel-sr_primary
[data:port:db PGPORT=6565 PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data
PGDATABASE=replicas]
2010.04.22 20:55:28 rijkers(at)denkraam:~/src/perl/85devel [0]
$ time ./run_test_suite.sh
[data:port:db PGPORT=6565 PGDATA=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data
PGDATABASE=replicas]
2010.04.22 21:00:26 rijkers(at)denkraam:~/src/perl/85devel [1]

standby:

$ pg_config
BINDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin
DOCDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
HTMLDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/doc
INCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
PKGINCLUDEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include
INCLUDEDIR-SERVER = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/include/server
LIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
PKGLIBDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib
LOCALEDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/locale
MANDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share/man
SHAREDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/share
SYSCONFDIR = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/etc
PGXS = /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib/pgxs/src/makefiles/pgxs.mk
CONFIGURE = '--prefix=/var/data1/pg_stuff/pg_installations/pgsql.sr_primary' '--with-pgport=6565'
'--enable-depend' '--with-openssl' '--with-perl' '--with-libxml' '--with-libxslt'
CC = gcc
CPPFLAGS = -D_GNU_SOURCE -I/usr/include/libxml2
CFLAGS = -O2 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement
-Wendif-labels -fno-strict-aliasing -fwrapv
CFLAGS_SL = -fpic
LDFLAGS = -Wl,-rpath,'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/lib'
LDFLAGS_SL =
LIBS = -lpgport -lxslt -lxml2 -lssl -lcrypto -lz -lreadline -ltermcap -lcrypt -ldl -lm
VERSION = PostgreSQL 9.0devel-sr_primary

$ grep -Ev '(^[[:space:]]*#)|(^$)' pgsql.sr_*ry/data/postgresql.conf
pgsql.sr_primary/data/postgresql.conf:data_directory =
'/var/data1/pg_stuff/pg_installations/pgsql.sr_primary/data'
pgsql.sr_primary/data/postgresql.conf:port = 6565
pgsql.sr_primary/data/postgresql.conf:max_connections = 100
pgsql.sr_primary/data/postgresql.conf:shared_buffers = 256MB
pgsql.sr_primary/data/postgresql.conf:checkpoint_segments = 50
pgsql.sr_primary/data/postgresql.conf:archive_mode = 'on'
pgsql.sr_primary/data/postgresql.conf:archive_command= 'cp %p
/var/data1/pg_stuff/dump/replication_archive/%f'
pgsql.sr_primary/data/postgresql.conf:max_wal_senders = 5
pgsql.sr_primary/data/postgresql.conf:effective_cache_size= 16GB
pgsql.sr_primary/data/postgresql.conf:datestyle = 'iso, mdy'
pgsql.sr_primary/data/postgresql.conf:lc_messages = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:lc_monetary = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:lc_numeric = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:lc_time = 'en_US.UTF-8'
pgsql.sr_primary/data/postgresql.conf:default_text_search_config = 'pg_catalog.english'

pgsql.sr_slavery/data/postgresql.conf:data_directory =
'/var/data1/pg_stuff/pg_installations/pgsql.sr_slavery/data'
pgsql.sr_slavery/data/postgresql.conf:port = 6566
pgsql.sr_slavery/data/postgresql.conf:max_connections = 100
pgsql.sr_slavery/data/postgresql.conf:shared_buffers = 256MB
pgsql.sr_slavery/data/postgresql.conf:checkpoint_segments = 50
pgsql.sr_slavery/data/postgresql.conf:archive_mode = 'on'
pgsql.sr_slavery/data/postgresql.conf:archive_command= 'cp %p
/var/data1/pg_stuff/dump/replication_archive/%f'
pgsql.sr_slavery/data/postgresql.conf:max_wal_senders = 5
pgsql.sr_slavery/data/postgresql.conf:effective_cache_size= 16GB
pgsql.sr_slavery/data/postgresql.conf:datestyle = 'iso, mdy'
pgsql.sr_slavery/data/postgresql.conf:lc_messages = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:lc_monetary = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:lc_numeric = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:lc_time = 'en_US.UTF-8'
pgsql.sr_slavery/data/postgresql.conf:default_text_search_config = 'pg_catalog.english'

I only now notice archive_mode = 'on' on the standby. That doesn't make sense of course - I'll
remove that.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-22 22:45:47
Message-ID: 1271976347.4161.61.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote:
> On Sun, April 18, 2010 13:01, Simon Riggs wrote:

> any comment is welcome...

Please can you re-run with -l and post me the file of times

Please also rebuild using --enable-profile so we can see what's
happening.

Can you also try the enclosed patch which implements prefetching during
replay of btree delete records. (Need to set effective_io_concurrency)

Thanks for your further help.

--
Simon Riggs www.2ndQuadrant.com

Attachment Content-Type Size
xlog_prefetch.patch text/x-patch 5.1 KB

From: Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 00:44:11
Message-ID: 4BD0ED5B.8040104@catalyst.net.nz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Erik Rijkers wrote:
>
> This is the same behaviour (i.e. extreme slow standby) that I saw earlier (and which caused the
> original post, btw). In that earlier instance, the extreme slowness disappeared later, after many
> hours maybe even days (without bouncing either primary or standby).
>
> I have no idea what could cause this; is no one else is seeing this ?
>
> (if I have time I'll repeat on other hardware in the weekend)
>
> any comment is welcome...
>
>
>

I wonder if what you are seeing is perhaps due to the tables on the
primary being almost completely cached (from the initial create) and
those on the standby being at best partially so. That would explain why
the standby performance catches up after a while ( when its tables are
equivalently cached).

One way to test this is to 'pre-cache' the standby by selecting every
row from its tables before running the pgbench test.

regards

Mark


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 15:14:41
Message-ID: 1272035681.4161.700.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2010-04-22 at 23:45 +0100, Simon Riggs wrote:
> On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote:
> > On Sun, April 18, 2010 13:01, Simon Riggs wrote:
>
> > any comment is welcome...
>
> Please can you re-run with -l and post me the file of times

Erik has sent me details of a test run. My analysis of that is:

I'm seeing the response time profile on the standby as
99% <110us
99.9% <639us
99.99% <615ms

0.052% (52 samples) are >5ms elapsed and account for 24 s, which is
about 45% of elapsed time.

Of the 52 samples >5ms, 50 of them are >100ms and 2 >1s.

99% of transactions happen in similar times between primary and standby,
everything dragged down by rare but severe spikes.

We're looking for something that would delay something that normally
takes <0.1ms into something that takes >100ms, yet does eventually
return. That looks like a severe resource contention issue.

This effect happens when running just a single read-only session on
standby from pgbench. No confirmation as yet as to whether recovery is
active or dormant, and what other activitity if any occurs on standby
server at same time. So no other clues as yet as to what the contention
might be, except that we note the standby is writing data and the
database is large.

> Please also rebuild using --enable-profile so we can see what's
> happening.
>
> Can you also try the enclosed patch which implements prefetching during
> replay of btree delete records. (Need to set effective_io_concurrency)

As yet, no confirmation that the attached patch is even relevant. It was
just a wild guess at some tuning, while we wait for further info.

> Thanks for your further help.

"Some kind of contention" is best we can say at present.

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 15:32:19
Message-ID: j2q603c8f071004230832rdd1791c8pfedb6e2fd5c0468@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 23, 2010 at 11:14 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Thu, 2010-04-22 at 23:45 +0100, Simon Riggs wrote:
>> On Thu, 2010-04-22 at 20:39 +0200, Erik Rijkers wrote:
>> > On Sun, April 18, 2010 13:01, Simon Riggs wrote:
>>
>> > any comment is welcome...
>>
>> Please can you re-run with -l and post me the file of times
>
> Erik has sent me details of a test run. My analysis of that is:
>
> I'm seeing the response time profile on the standby as
> 99% <110us
> 99.9% <639us
> 99.99% <615ms
>
> 0.052% (52 samples) are >5ms elapsed and account for 24 s, which is
> about 45% of elapsed time.
>
> Of the 52 samples >5ms, 50 of them are >100ms and 2 >1s.
>
> 99% of transactions happen in similar times between primary and standby,
> everything dragged down by rare but severe spikes.
>
> We're looking for something that would delay something that normally
> takes <0.1ms into something that takes >100ms, yet does eventually
> return. That looks like a severe resource contention issue.

Wow. Good detective work.

...Robert


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 17:16:09
Message-ID: s2je51f66da1004231016pc823d9b6yc18a76b32bb3c70e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/18/10, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sat, 2010-04-17 at 16:48 -0400, Tom Lane wrote:
> > There are some places where we suppose that a *single* write into shared
> > memory can safely be done without a lock, if we're not too concerned
> > about how soon other transactions will see the effects. But what you
> > are proposing here requires more than one related write.
> >
> > I've been burnt by this myself:
> > http://archives.postgresql.org/pgsql-committers/2008-06/msg00228.php
>
>
> W O W - thank you for sharing.
>
> What I'm not clear on is why you've used a spinlock everywhere when only
> weak-memory thang CPUs are a problem. Why not have a weak-memory-protect
> macro that does does nada when the hardware already protects us? (i.e. a
> spinlock only for the hardware that needs it).

Um, you have been burned by exactly this on x86 also:

http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php

--
marko


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <markokr(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 17:38:48
Message-ID: 12382.1272044328@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Marko Kreen <markokr(at)gmail(dot)com> writes:
> Um, you have been burned by exactly this on x86 also:
> http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php

Yeah, we never did figure out exactly how come you were observing that
failure on Intel-ish hardware. I was under the impression that Intel
machines didn't have weak-memory-ordering behavior.

I wonder whether your compiler had rearranged the code in ProcArrayAdd
so that the increment happened before the array element store at the
machine-code level. I think it would be entitled to do that under
standard C semantics, since that ProcArrayStruct pointer isn't marked
volatile.

regards, tom lane


From: Marko Kreen <markokr(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 18:22:55
Message-ID: r2we51f66da1004231122zc9789db6td117e913ea72ca76@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 4/23/10, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Marko Kreen <markokr(at)gmail(dot)com> writes:
> > Um, you have been burned by exactly this on x86 also:
> > http://archives.postgresql.org/pgsql-hackers/2009-03/msg01265.php
>
>
> Yeah, we never did figure out exactly how come you were observing that
> failure on Intel-ish hardware. I was under the impression that Intel
> machines didn't have weak-memory-ordering behavior.
>
> I wonder whether your compiler had rearranged the code in ProcArrayAdd
> so that the increment happened before the array element store at the
> machine-code level. I think it would be entitled to do that under
> standard C semantics, since that ProcArrayStruct pointer isn't marked
> volatile.

Sounds likely.

Which seems to hint its better to handle all processors as weak ordered
and then work with explicit locks/memory barriers, than to sprinkle
code with 'volatile' to supress optimizations on intel and then still
fail on non-intel.

--
marko


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 22:39:48
Message-ID: 1272062388.4161.848.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
> >
> > 99% of transactions happen in similar times between primary and standby,
> > everything dragged down by rare but severe spikes.
> >
> > We're looking for something that would delay something that normally
> > takes <0.1ms into something that takes >100ms, yet does eventually
> > return. That looks like a severe resource contention issue.
>
> Wow. Good detective work.

While we haven't fully established the source of those problems, I am
now happy that these test results don't present any reason to avoid
commiting the main patch tested by Erik (not the smaller additional one
I sent). I expect to commit that on Sunday.

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 23:07:15
Message-ID: s2y603c8f071004231607jf4468c06y185817dc274e0c9e@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Apr 23, 2010 at 6:39 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
>> >
>> > 99% of transactions happen in similar times between primary and standby,
>> > everything dragged down by rare but severe spikes.
>> >
>> > We're looking for something that would delay something that normally
>> > takes <0.1ms into something that takes >100ms, yet does eventually
>> > return. That looks like a severe resource contention issue.
>>
>> Wow.  Good detective work.
>
> While we haven't fully established the source of those problems, I am
> now happy that these test results don't present any reason to avoid
> commiting the main patch tested by Erik (not the smaller additional one
> I sent). I expect to commit that on Sunday.

Both Heikki and I objected to that patch. And apparently it doesn't
fix the problem, either. So, -1 from me.

...Robert


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-23 23:17:22
Message-ID: 2335e90195b505710d1154b5cac73cfb.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, April 24, 2010 00:39, Simon Riggs wrote:
> On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
>> >
>> > 99% of transactions happen in similar times between primary and standby,
>> > everything dragged down by rare but severe spikes.
>> >
>> > We're looking for something that would delay something that normally
>> > takes <0.1ms into something that takes >100ms, yet does eventually
>> > return. That looks like a severe resource contention issue.
>>
>> Wow. Good detective work.
>
> While we haven't fully established the source of those problems, I am
> now happy that these test results don't present any reason to avoid
> commiting the main patch tested by Erik (not the smaller additional one
> I sent). I expect to commit that on Sunday.
>

yes, that (main) patch seems to have largely
closed the gap between primary and standby; here
are some results from a lower scale (10):

scale: 10
clients: 10, 20, 40, 60, 90
for each: 4x primary, 4x standby: (6565=primary, 6566=standby)
-----
scale: 10 clients: 10 tps = 27624.339871 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 27604.261750 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 28015.093466 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 28422.561280 pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 10 clients: 10 tps = 27254.806526 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 27686.470866 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 28078.904035 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 27101.622337 pgbench -p 6566 -n -S -c 10 -T 900 -j 1

-----
scale: 10 clients: 20 tps = 23106.795587 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 23101.681155 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22893.364004 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 23038.577109 pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 10 clients: 20 tps = 22903.578552 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22970.691946 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22999.473318 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22884.854749 pgbench -p 6566 -n -S -c 20 -T 900 -j 1

-----
scale: 10 clients: 40 tps = 23522.499429 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23611.319191 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23616.905302 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23572.213990 pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 10 clients: 40 tps = 23714.721220 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23711.781175 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23691.867023 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23691.699231 pgbench -p 6566 -n -S -c 40 -T 900 -j 1

-----
scale: 10 clients: 60 tps = 21987.497095 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 21950.344204 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22006.461447 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 21824.071303 pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 10 clients: 60 tps = 22149.415231 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22211.064402 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22164.238081 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22174.585736 pgbench -p 6566 -n -S -c 60 -T 900 -j 1

-----
scale: 10 clients: 90 tps = 18751.213002 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18757.115811 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18692.942329 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18765.390154 pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 10 clients: 90 tps = 18929.462104 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18999.851184 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18972.321607 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18924.058827 pgbench -p 6566 -n -S -c 90 -T 900 -j 1

The higher scales still have that other standby-slowness. It may be
caching effects (as Mark Kirkwood suggested): the idea being that the
primary data is pre-cached because of the initial create; standby data
needs to be first-time-read from disk.

Does that make sense?

I will try to confirm this.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-24 09:17:44
Message-ID: 1272100664.4161.1173.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-23 at 19:07 -0400, Robert Haas wrote:
> On Fri, Apr 23, 2010 at 6:39 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
> >> >
> >> > 99% of transactions happen in similar times between primary and standby,
> >> > everything dragged down by rare but severe spikes.
> >> >
> >> > We're looking for something that would delay something that normally
> >> > takes <0.1ms into something that takes >100ms, yet does eventually
> >> > return. That looks like a severe resource contention issue.
> >>
> >> Wow. Good detective work.
> >
> > While we haven't fully established the source of those problems, I am
> > now happy that these test results don't present any reason to avoid
> > commiting the main patch tested by Erik (not the smaller additional one
> > I sent). I expect to commit that on Sunday.
>
> Both Heikki and I objected to that patch.

Please explain your objection, based upon the patch and my explanations.

> And apparently it doesn't
> fix the problem, either. So, -1 from me.

There is an issue observed in Erik's later tests, but my interpretation
of the results so far is that the sorted array patch successfully
removes the initially reported loss of performance.

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 10:53:04
Message-ID: l2p603c8f071004250353o86f37300xadfa3331602549f8@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Both Heikki and I objected to that patch.
>
> Please explain your objection, based upon the patch and my explanations.

Well, we objected to the locking. Having reread the patch a few times
though, I think I'm starting to wrap my head around it so, I don't
know, maybe it's OK. Have you tested grabbing the ProcArrayLock in
exclusive mode instead of having a separate spinlock, to see how that
performs?

>> And apparently it doesn't
>> fix the problem, either.  So, -1 from me.
>
> There is an issue observed in Erik's later tests, but my interpretation
> of the results so far is that the sorted array patch successfully
> removes the initially reported loss of performance.

Is it possible the remaining spikes are due to fights over the spinlock?

...Robert


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 12:50:00
Message-ID: 1272199800.4161.1662.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 06:53 -0400, Robert Haas wrote:
> On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> >> Both Heikki and I objected to that patch.
> >
> > Please explain your objection, based upon the patch and my explanations.
>
> Well, we objected to the locking. Having reread the patch a few times
> though, I think I'm starting to wrap my head around it so, I don't
> know, maybe it's OK. Have you tested grabbing the ProcArrayLock in
> exclusive mode instead of having a separate spinlock, to see how that
> performs?

They both perform well. Heikki and I agree that spinlocks perform well.

The point of taking the ProcArrayLock in Shared rather than Exclusive
mode is the reduction in contention that gives. Adding a
KnownAssignedXid is exactly analogous to starting a transaction. We go
to some trouble to avoid taking a lock when we start a transaction and I
want to do a similar thing on the standby. I think it's even more
important we reduce contention on the standby than it is on the primary
because if the Startup process is delayed then it slows down recovery.

> >> And apparently it doesn't
> >> fix the problem, either. So, -1 from me.
> >
> > There is an issue observed in Erik's later tests, but my interpretation
> > of the results so far is that the sorted array patch successfully
> > removes the initially reported loss of performance.
>
> Is it possible the remaining spikes are due to fights over the spinlock?

That one specific spinlock? I considered it. I think its doubtful.

The spikes all involved a jump from <5ms to much more than that, often
as long as a second. That looks like a pure I/O problem or an LWlock
held across a slow I/O or platform specific issues of some kind.

Erik's tests were with 1 backend, so that would mean that at most 2
processes might want the spinlock you mention. The spinlock is only
taken at snapshot time, so once per transaction in Erik's tests. So the
if the spinlock were the problem then one or other of the processes
would need to grab and hold the spinlock for a very long time and then
release it sometime later. Since the spinlock code has been with us for
some time, I doubt there is anything wrong there.

I don't have any evidence as to whether the problem is solely on Erik's
system, nor whether it is an issue already in the codebase or introduced
by this patch. It seems unlikely to me that this patch introduces the
problem, and even more unlikely that the issue is caused by two
processes fighting over that particular spinlock.

The patch is very effective at reducing overall cost of taking snapshots
and so I think it needs to be applied to allow more performance data to
be collected. I don't suppose this is the last performance tuning we
will need to do.

--
Simon Riggs www.2ndQuadrant.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 15:37:04
Message-ID: w2n603c8f071004250837hb6b318cez229e9c6a48a90d51@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 25, 2010 at 8:50 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Sun, 2010-04-25 at 06:53 -0400, Robert Haas wrote:
>> On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> >> Both Heikki and I objected to that patch.
>> >
>> > Please explain your objection, based upon the patch and my explanations.
>>
>> Well, we objected to the locking.  Having reread the patch a few times
>> though, I think I'm starting to wrap my head around it so, I don't
>> know, maybe it's OK.  Have you tested grabbing the ProcArrayLock in
>> exclusive mode instead of having a separate spinlock, to see how that
>> performs?
>
> They both perform well. Heikki and I agree that spinlocks perform well.
>
> The point of taking the ProcArrayLock in Shared rather than Exclusive
> mode is the reduction in contention that gives. Adding a
> KnownAssignedXid is exactly analogous to starting a transaction. We go
> to some trouble to avoid taking a lock when we start a transaction and I
> want to do a similar thing on the standby. I think it's even more
> important we reduce contention on the standby than it is on the primary
> because if the Startup process is delayed then it slows down recovery.
>
>> >> And apparently it doesn't
>> >> fix the problem, either.  So, -1 from me.
>> >
>> > There is an issue observed in Erik's later tests, but my interpretation
>> > of the results so far is that the sorted array patch successfully
>> > removes the initially reported loss of performance.
>>
>> Is it possible the remaining spikes are due to fights over the spinlock?
>
> That one specific spinlock? I considered it. I think its doubtful.
>
> The spikes all involved a jump from <5ms to much more than that, often
> as long as a second. That looks like a pure I/O problem or an LWlock
> held across a slow I/O or platform specific issues of some kind.
>
> Erik's tests were with 1 backend, so that would mean that at most 2
> processes might want the spinlock you mention. The spinlock is only
> taken at snapshot time, so once per transaction in Erik's tests. So the
> if the spinlock were the problem then one or other of the processes
> would need to grab and hold the spinlock for a very long time and then
> release it sometime later. Since the spinlock code has been with us for
> some time, I doubt there is anything wrong there.
>
> I don't have any evidence as to whether the problem is solely on Erik's
> system, nor whether it is an issue already in the codebase or introduced
> by this patch. It seems unlikely to me that this patch introduces the
> problem, and even more unlikely that the issue is caused by two
> processes fighting over that particular spinlock.
>
> The patch is very effective at reducing overall cost of taking snapshots
> and so I think it needs to be applied to allow more performance data to
> be collected. I don't suppose this is the last performance tuning we
> will need to do.

OK, fair enough. I withdraw my objections.

...Robert


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 15:50:17
Message-ID: 2867.1272210617@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> Both Heikki and I objected to that patch.
>>
>> Please explain your objection, based upon the patch and my explanations.

> Well, we objected to the locking.

Not only is the locking overly complex, but it's outright wrong;
or at least the comments are. KnownAssignedXidsAdd in particular
has a comment that is 100% misleading, and an actual behavior that
I find extremely likely to provoke bugs (eg, consider caller calling
it twice under the illusion it stlll has only shared lock). I'm not
even convinced that it works correctly, since it will happily write
into KnownAssignedXids[] while holding only shared ProcArrayLock.
What happens when two processes are doing that concurrently?

This needs a redesign before it can be considered committable. I don't
really care whether it makes things faster; it's too complicated and too
poorly documented to be maintainable.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 16:43:57
Message-ID: 3963.1272213837@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> [ v2 patch ]

BTW, while I'm looking at this, why bother with the separate
KnownAssignedXidsValid[] array? Wouldn't it be cheaper
(certainly so in storage, probably so in access/update times)
to have just the KnownAssignedXids[] array and store
InvalidTransactionId in unused entries?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 16:46:38
Message-ID: 1272213998.4161.1791.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Sat, Apr 24, 2010 at 5:17 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> > Both Heikki and I objected to that patch.
> >>
> >> Please explain your objection, based upon the patch and my explanations.
>
> > Well, we objected to the locking.
>
> Not only is the locking overly complex, but it's outright wrong;
> or at least the comments are. KnownAssignedXidsAdd in particular
> has a comment that is 100% misleading, and an actual behavior that
> I find extremely likely to provoke bugs (eg, consider caller calling
> it twice under the illusion it stlll has only shared lock).

I will reword that comment.

> I'm not
> even convinced that it works correctly, since it will happily write
> into KnownAssignedXids[] while holding only shared ProcArrayLock.
> What happens when two processes are doing that concurrently?

Heikki and I discussed this upthread. Only the Startup process ever
calls KnownAssignedXids, so the situation you describe does not happen.
We currently rely on the fact that we have only a single writer in other
places in Hot Standby, so this is not a restriction we are imposing on
the system just to support this change.

Both you and Heikki have previously spoken against having recovery run
in parallel, so the likelihood of having multiple readers in the future
seems low.

If we were to allow multiple callers of the routine in future, the use
of spinlocks to protect the head of the queue means we can safely
reserve slots and so KnownAssignedXidsAdd can easily be made to accept
multiple writers in the future. I think its possible to future proof
that now without significant change or performance drop, so I'll do
that.

> This needs a redesign before it can be considered committable. I don't
> really care whether it makes things faster; it's too complicated and too
> poorly documented to be maintainable.

It is sufficiently complex to achieve its objective, which is keeping
track of xids during recovery, using the same locking as we do during
normal running. This is the third re-write of the code, developed over
more than 20 months and specifically modularised by me to make future
changes drop-in replacements. Initially we had an insert-sorted array,
then a hash table and now this current proposal. Recently, Heikki and I
independently came up with almost exactly the same design, with the
algorithmic characteristics we need for performance. The call points,
frequency and workload of calls are documented and well understood by
multiple people.

The only remaining point of discussion has been the code that takes
ProcArrayLock in shared mode. The difference being discussed here is
whether we access the head and tail of the array with both Shared
+spinlock or just Exclusive. The patch would be almost identical with
that change, but would not meet its locking objectives. So making the
locking stricter isn't going to make it more maintainable by a
measurable amount.

There are more than 60 lines of header comment explaining in detail how
this works, with a full algorithmic analysis. The remaining code is
commented to project standards, with easily more than 100 lines of
comments.

I will recheck every comment and see if other comments can be usefully
added, and further review the code to see if there's anything that can
be done to improve it further. I was going to do this anyway, since I
will be adding an agreed Assert() into the patch.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 16:51:58
Message-ID: 4123.1272214318@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote:
>> This needs a redesign before it can be considered committable. I don't
>> really care whether it makes things faster; it's too complicated and too
>> poorly documented to be maintainable.

> There are more than 60 lines of header comment explaining in detail how
> this works, with a full algorithmic analysis. The remaining code is
> commented to project standards, with easily more than 100 lines of
> comments.

If the comments were correct, I wouldn't be complaining. They're
misleading or outright wrong on many points. In particular, I don't
think you actually understand the weak-memory-ordering issue, because
the comments about that are entirely wrong. I don't think they are
adequate on other points either --- for example, I now understand why my
complaint of a few minutes ago about KnownAssignedXidsValid is wrong,
but the comments certainly didn't help me on that.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 16:54:59
Message-ID: 1272214499.4161.1808.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 12:43 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > [ v2 patch ]
>
> BTW, while I'm looking at this, why bother with the separate
> KnownAssignedXidsValid[] array? Wouldn't it be cheaper
> (certainly so in storage, probably so in access/update times)
> to have just the KnownAssignedXids[] array and store
> InvalidTransactionId in unused entries?

Well, that was exactly my first design.

Heikki came up with the additional array and I think it is an inspired
suggestion, because it allows the search() to use a simple binary
search. I attempted to write the binary-search-with-holes and it was
going to be very ugly code, so I went for another algorithm which was
also quite cool, but not as cool as Heikki's idea - which I was going to
give credit for.

The array adds 520*max_connections bytes of shared memory, but seems a
good tradeoff for the additional performance it allows. I guess we could
use a bit array, but that's slower to access. The bit array would be
9*max_connections bytes of shared memory, so a 40kB saving.

--
Simon Riggs www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 17:00:58
Message-ID: 1272214858.4161.1819.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 12:51 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Sun, 2010-04-25 at 11:50 -0400, Tom Lane wrote:
> >> This needs a redesign before it can be considered committable. I don't
> >> really care whether it makes things faster; it's too complicated and too
> >> poorly documented to be maintainable.
>
> > There are more than 60 lines of header comment explaining in detail how
> > this works, with a full algorithmic analysis. The remaining code is
> > commented to project standards, with easily more than 100 lines of
> > comments.
>
> If the comments were correct, I wouldn't be complaining. They're
> misleading or outright wrong on many points. In particular, I don't
> think you actually understand the weak-memory-ordering issue, because
> the comments about that are entirely wrong.

The comments says "on CPUs with
+ * weak-memory ordering we can't reliably move pointers atomically, so
the
+ * rule is that updates of head and tail of the array require
ProcArrayLock
+ * in exclusive mode or (shared mode and known_assigned_xids_lck
spinlock)"

I will reword this, so it is clear that I'm talking about the head and
tail of the array, not pointers in general.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 17:33:24
Message-ID: 4927.1272216804@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sun, 2010-04-25 at 12:51 -0400, Tom Lane wrote:
>> If the comments were correct, I wouldn't be complaining. They're
>> misleading or outright wrong on many points. In particular, I don't
>> think you actually understand the weak-memory-ordering issue, because
>> the comments about that are entirely wrong.

> The comments says "on CPUs with
> + * weak-memory ordering we can't reliably move pointers atomically, so
> the
> + * rule is that updates of head and tail of the array require
> ProcArrayLock
> + * in exclusive mode or (shared mode and known_assigned_xids_lck
> spinlock)"

> I will reword this, so it is clear that I'm talking about the head and
> tail of the array, not pointers in general.

It's not about whether the pointers can be assigned atomically; on most
hardware they can. It's about whether other processors will see that
happen in the correct sequence relative to the changes in the array
elements.

If you like I'll have a go at rewriting the comments for this patch,
because I am currently thinking that the problem is not so much with
the code as with the poor explanation of what it's doing. Sometimes
the author is too close to the code to understand why other people
have a hard time understanding it.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 17:45:02
Message-ID: 1272217502.4161.1865.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote:

> If you like I'll have a go at rewriting the comments for this patch,
> because I am currently thinking that the problem is not so much with
> the code as with the poor explanation of what it's doing. Sometimes
> the author is too close to the code to understand why other people
> have a hard time understanding it.

That would help me, thank you.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 17:51:00
Message-ID: 5699.1272217860@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote:
>> If you like I'll have a go at rewriting the comments for this patch,
>> because I am currently thinking that the problem is not so much with
>> the code as with the poor explanation of what it's doing. Sometimes
>> the author is too close to the code to understand why other people
>> have a hard time understanding it.

> That would help me, thank you.

OK. You said you were currently working some more on the patch, so
I'll wait for v3 and then work on it.

regards, tom lane


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Erik Rijkers" <er(at)xs4all(dot)nl>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 18:25:16
Message-ID: ead881fc14bfcfb0d5a344458f2a2afb.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, April 24, 2010 01:17, Erik Rijkers wrote:
> On Sat, April 24, 2010 00:39, Simon Riggs wrote:
>> On Fri, 2010-04-23 at 11:32 -0400, Robert Haas wrote:
>>> >
>>> > 99% of transactions happen in similar times between primary and standby,
>>> > everything dragged down by rare but severe spikes.
>>> >
>>> > We're looking for something that would delay something that normally
>>> > takes <0.1ms into something that takes >100ms, yet does eventually
>>> > return. That looks like a severe resource contention issue.
>>>
>>> Wow. Good detective work.
>>
>> While we haven't fully established the source of those problems, I am
>> now happy that these test results don't present any reason to avoid
>> commiting the main patch tested by Erik (not the smaller additional one
>> I sent). I expect to commit that on Sunday.
>>
>
> yes, that (main) patch seems to have largely
> closed the gap between primary and standby; here
> are some results from a lower scale (10):

FWIW, here are some more results from pgbench comparing
primary and standby (both with Simon's patch).

Sorry if it's too much data, but to me at least it was illuminating;
I now understand the effects of the different parameters better.

pgbench results, all -S (SELECT-only).

scale: 10, 100, 500, 1000
clients: 1, 10, 20, 40, 60, 90

for each: 4x primary, 4x standby, for 15 minutes:

(port 6565=primary, 6566=standby)

scale: 10 clients: 1 tps = 6404.768132 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 6435.072983 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 6458.958549 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 6432.088099 pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 10 clients: 1 tps = 6442.519035 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 6449.837931 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 6419.879026 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 10 clients: 1 tps = 6428.290175 pgbench -p 6566 -n -S -c 1 -T 900 -j 1

---
scale: 10 clients: 10 tps = 27624.339871 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 27604.261750 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 28015.093466 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 28422.561280 pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 10 clients: 10 tps = 27254.806526 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 27686.470866 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 28078.904035 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 10 clients: 10 tps = 27101.622337 pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 10 clients: 20 tps = 23106.795587 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 23101.681155 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22893.364004 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 23038.577109 pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 10 clients: 20 tps = 22903.578552 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22970.691946 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22999.473318 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 10 clients: 20 tps = 22884.854749 pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 10 clients: 40 tps = 23522.499429 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23611.319191 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23616.905302 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23572.213990 pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 10 clients: 40 tps = 23714.721220 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23711.781175 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23691.867023 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 10 clients: 40 tps = 23691.699231 pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 10 clients: 60 tps = 21987.497095 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 21950.344204 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22006.461447 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 21824.071303 pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 10 clients: 60 tps = 22149.415231 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22211.064402 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22164.238081 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 10 clients: 60 tps = 22174.585736 pgbench -p 6566 -n -S -c 60 -T 900 -j 1

----
scale: 10 clients: 90 tps = 18751.213002 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18757.115811 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18692.942329 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18765.390154 pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 10 clients: 90 tps = 18929.462104 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18999.851184 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18972.321607 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 10 clients: 90 tps = 18924.058827 pgbench -p 6566 -n -S -c 90 -T 900 -j 1

----
scale: 100 clients: 1 tps = 5854.489684 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 5852.679267 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 5844.005437 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 5867.593004 pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 100 clients: 1 tps = 1111.368215 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 1187.678163 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 1332.544440 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 100 clients: 1 tps = 1488.574446 pgbench -p 6566 -n -S -c 1 -T 900 -j 1

----
scale: 100 clients: 10 tps = 28948.718439 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 28865.280357 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 28819.426232 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 28868.070716 pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 100 clients: 10 tps = 1839.477133 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 2056.574852 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 2501.715270 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 100 clients: 10 tps = 3114.288871 pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 100 clients: 20 tps = 21561.395664 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 21482.860950 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 21447.896216 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 21450.846824 pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 100 clients: 20 tps = 4309.312692 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 6786.460948 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 16552.535443 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 100 clients: 20 tps = 21513.519274 pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 100 clients: 40 tps = 20812.709326 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 100 clients: 40 tps = 20750.911233 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 100 clients: 40 tps = 20764.921432 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 100 clients: 40 tps = 20877.705726 pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 100 clients: 40 tps = 20818.028682 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 100 clients: 40 tps = 20876.921640 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 100 clients: 40 tps = 20912.855749 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 100 clients: 40 tps = 20821.271974 pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 100 clients: 60 tps = 18746.870683 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 100 clients: 60 tps = 18752.631011 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 100 clients: 60 tps = 18717.634243 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 100 clients: 60 tps = 18776.138286 pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 100 clients: 60 tps = 18850.133061 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 100 clients: 60 tps = 18893.463427 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 100 clients: 60 tps = 18918.599124 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 100 clients: 60 tps = 18903.389147 pgbench -p 6566 -n -S -c 60 -T 900 -j 1

----
scale: 100 clients: 90 tps = 15915.637099 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 100 clients: 90 tps = 15945.810305 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 100 clients: 90 tps = 15894.238941 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 100 clients: 90 tps = 15926.422587 pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 100 clients: 90 tps = 16025.802748 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 100 clients: 90 tps = 16049.977788 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 100 clients: 90 tps = 16053.612396 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 100 clients: 90 tps = 16040.376994 pgbench -p 6566 -n -S -c 90 -T 900 -j 1

----
scale: 500 clients: 1 tps = 5646.892935 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 5642.544001 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 5633.282536 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 5627.330477 pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 500 clients: 1 tps = 5625.659921 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 5645.898120 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 5644.897739 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 500 clients: 1 tps = 5640.990172 pgbench -p 6566 -n -S -c 1 -T 900 -j 1

----
scale: 500 clients: 10 tps = 28123.467099 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 28230.031309 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 28054.072374 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 28129.229553 pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 500 clients: 10 tps = 28187.938585 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 28372.950289 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 28161.909825 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 500 clients: 10 tps = 28224.336096 pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 500 clients: 20 tps = 20932.340860 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 20990.560386 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 20926.001588 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 20903.012936 pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 500 clients: 20 tps = 20956.917213 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 21007.634021 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 20982.386759 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 500 clients: 20 tps = 21002.399320 pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 500 clients: 40 tps = 19829.483373 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 500 clients: 40 tps = 19830.187082 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 500 clients: 40 tps = 19757.072979 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 500 clients: 40 tps = 19734.710111 pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 500 clients: 40 tps = 19862.051804 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 500 clients: 40 tps = 19834.071543 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 500 clients: 40 tps = 19885.788466 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 500 clients: 40 tps = 19961.407411 pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 500 clients: 60 tps = 17829.256242 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 500 clients: 60 tps = 17816.152137 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 500 clients: 60 tps = 17828.589859 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 500 clients: 60 tps = 17748.044912 pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 500 clients: 60 tps = 17926.708964 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 500 clients: 60 tps = 17943.758910 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 500 clients: 60 tps = 17917.713556 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 500 clients: 60 tps = 17932.597396 pgbench -p 6566 -n -S -c 60 -T 900 -j 1

----
scale: 500 clients: 90 tps = 15044.660043 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 500 clients: 90 tps = 15010.685976 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 500 clients: 90 tps = 15045.828142 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 500 clients: 90 tps = 14976.636159 pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 500 clients: 90 tps = 15154.402300 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 500 clients: 90 tps = 15131.532084 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 500 clients: 90 tps = 15154.148091 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 500 clients: 90 tps = 15130.337148 pgbench -p 6566 -n -S -c 90 -T 900 -j 1

----
scale: 1000 clients: 1 tps = 222.307686 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 261.163445 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 307.343862 pgbench -p 6565 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 367.285821 pgbench -p 6565 -n -S -c 1 -T 900 -j 1

scale: 1000 clients: 1 tps = 550.457516 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 4347.697078 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 5601.036491 pgbench -p 6566 -n -S -c 1 -T 900 -j 1
scale: 1000 clients: 1 tps = 5604.021799 pgbench -p 6566 -n -S -c 1 -T 900 -j 1

----
scale: 1000 clients: 10 tps = 3370.534977 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 23156.038162 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 28029.755911 pgbench -p 6565 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 28179.609982 pgbench -p 6565 -n -S -c 10 -T 900 -j 1

scale: 1000 clients: 10 tps = 2880.404087 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 13593.595875 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 28033.649640 pgbench -p 6566 -n -S -c 10 -T 900 -j 1
scale: 1000 clients: 10 tps = 28115.126500 pgbench -p 6566 -n -S -c 10 -T 900 -j 1

----
scale: 1000 clients: 20 tps = 4876.882880 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 21262.273826 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 20403.533099 pgbench -p 6565 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 20753.634216 pgbench -p 6565 -n -S -c 20 -T 900 -j 1

scale: 1000 clients: 20 tps = 3024.129307 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 18786.862049 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 20807.372965 pgbench -p 6566 -n -S -c 20 -T 900 -j 1
scale: 1000 clients: 20 tps = 20896.373925 pgbench -p 6566 -n -S -c 20 -T 900 -j 1

----
scale: 1000 clients: 40 tps = 5910.319107 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 1000 clients: 40 tps = 19705.088545 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 1000 clients: 40 tps = 19658.237802 pgbench -p 6565 -n -S -c 40 -T 900 -j 1
scale: 1000 clients: 40 tps = 19593.955166 pgbench -p 6565 -n -S -c 40 -T 900 -j 1

scale: 1000 clients: 40 tps = 6555.937102 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 1000 clients: 40 tps = 19845.691237 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 1000 clients: 40 tps = 19755.112514 pgbench -p 6566 -n -S -c 40 -T 900 -j 1
scale: 1000 clients: 40 tps = 19765.648969 pgbench -p 6566 -n -S -c 40 -T 900 -j 1

----
scale: 1000 clients: 60 tps = 7104.451848 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 1000 clients: 60 tps = 17638.393625 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 1000 clients: 60 tps = 17621.464080 pgbench -p 6565 -n -S -c 60 -T 900 -j 1
scale: 1000 clients: 60 tps = 17538.189171 pgbench -p 6565 -n -S -c 60 -T 900 -j 1

scale: 1000 clients: 60 tps = 6721.909857 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 1000 clients: 60 tps = 17798.542731 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 1000 clients: 60 tps = 17797.145061 pgbench -p 6566 -n -S -c 60 -T 900 -j 1
scale: 1000 clients: 60 tps = 17739.437864 pgbench -p 6566 -n -S -c 60 -T 900 -j 1

----
scale: 1000 clients: 90 tps = 6682.605788 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 1000 clients: 90 tps = 14853.076617 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 1000 clients: 90 tps = 14851.040207 pgbench -p 6565 -n -S -c 90 -T 900 -j 1
scale: 1000 clients: 90 tps = 14859.575164 pgbench -p 6565 -n -S -c 90 -T 900 -j 1

scale: 1000 clients: 90 tps = 6208.586575 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 1000 clients: 90 tps = 15065.536476 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 1000 clients: 90 tps = 15033.388676 pgbench -p 6566 -n -S -c 90 -T 900 -j 1
scale: 1000 clients: 90 tps = 15030.808207 pgbench -p 6566 -n -S -c 90 -T 900 -j 1

Hardware:
2x Xeon (quadcore) 5482; 32 GB; Areca 1280ML
CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5

Both instances on a raid volume of 12 disks (sata 7200)
(and of course, server otherwise idle)

hth,

Erik Rijkers


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Erik Rijkers" <er(at)xs4all(dot)nl>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 18:55:02
Message-ID: 6906.1272221702@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Erik Rijkers" <er(at)xs4all(dot)nl> writes:
> FWIW, here are some more results from pgbench comparing
> primary and standby (both with Simon's patch).

That seems weird. Why do most of the runs show primary and standby
as having comparable speed, but a few show the standby as much slower?
The parameters for those runs don't seem obviously different from cases
where it's fast. I think there might have been something else going on
on the standby during those runs. Or do you think those represent
cases where the mystery slowdown event happened?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 19:07:28
Message-ID: 1272222448.4161.1959.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 20:25 +0200, Erik Rijkers wrote:

> Sorry if it's too much data, but to me at least it was illuminating;
> I now understand the effects of the different parameters better.

That's great, many thanks.

A few observations

* Standby performance is actually slightly above normal running. This is
credible because of the way snapshots are now taken. We don't need to
scan the procarray looking for write transactions, since we know
everything is read only. So we scan just the knownassignedxids, which if
no activity from primary will be zero-length, so snapshots will actually
get taken much faster in this case on standby. The snapshot performance
on standby is O(n) where n is the number of write transactions
"currently" on primary (transfer delays blur the word "currently").

* The results for scale factor < 100 are fine, and the results for >100
with few connections get thrown out by long transaction times. With
larger numbers of connections the wait problems seem to go away. Looks
like Erik (and possibly Hot Standby in general) has an I/O problem,
though "from what" is not yet determined. It could be just hardware, or
might be hardware plus other factors.

--
Simon Riggs www.2ndQuadrant.com


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 21:52:19
Message-ID: 32176e37d35c69ee4a2295a2a4b08812.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, April 25, 2010 20:55, Tom Lane wrote:
>
> That seems weird. Why do most of the runs show primary and standby
> as having comparable speed, but a few show the standby as much slower?
> The parameters for those runs don't seem obviously different from cases
> where it's fast. I think there might have been something else going on
> on the standby during those runs. Or do you think those represent
> cases where the mystery slowdown event happened?
>

the strange case is the scale 100 standby's slow start, followed by
a steady increase during -c 1, then -c 10, and finally getting up to speed
with -c 20 (and up). And these slow-but-growing standby series are interspersed
with normal (high-speed) primary series.

I'll try to repeat this pattern on other hardware; although
if my tests were run with faulty hardware I wouldn't know how/why
that would give the above effect (such a 'regular aberration').

testing is more difficult than I thought...

Erik Rijkers


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-25 23:18:48
Message-ID: 20633.1272237528@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> [ v2 patch ]

I've been studying this some more while making notes for improved
comments, and I've about come to the conclusion that having readers
move the tail pointer (at the end of KnownAssignedXidsGetAndSetXmin)
is overly tricky and probably not a performance improvement anyway.
The code is in fact wrong as it stands: it's off-by-one about setting
the new tail value. And there's potential for contention with multiple
readers all wanting to move the tail pointer at once. And most
importantly, KnownAssignedXidsSearch can't move the tail pointer so
we might expend many inefficient searches while never moving the tail
pointer.

I think we should get rid of that and just have the two functions that
can mark entries invalid (which they must do with exclusive lock)
advance the tail pointer when they invalidate the current tail element.
Then we have the very simple rule that only the startup process ever
changes this data structure.

regards, tom lane


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-26 06:52:51
Message-ID: x2n3f0b79eb1004252352pe7892a9eo2f0111eee55dfe3@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 26, 2010 at 3:25 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
> FWIW, here are some more results from pgbench comparing
> primary and standby (both with Simon's patch).

Was there a difference in CPU utilization between the primary
and standby?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-26 07:21:35
Message-ID: 1272266495.4161.2852.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 19:18 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > [ v2 patch ]
>
> I've been studying this some more while making notes for improved
> comments, and I've about come to the conclusion that having readers
> move the tail pointer (at the end of KnownAssignedXidsGetAndSetXmin)
> is overly tricky and probably not a performance improvement anyway.
> The code is in fact wrong as it stands: it's off-by-one about setting
> the new tail value. And there's potential for contention with multiple
> readers all wanting to move the tail pointer at once.

OK, since contention was my concern, I want to avoid that.

> And most
> importantly, KnownAssignedXidsSearch can't move the tail pointer so
> we might expend many inefficient searches while never moving the tail
> pointer.

> I think we should get rid of that and just have the two functions that
> can mark entries invalid (which they must do with exclusive lock)
> advance the tail pointer when they invalidate the current tail element.

OK

> Then we have the very simple rule that only the startup process ever
> changes this data structure.

--
Simon Riggs www.2ndQuadrant.com


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-26 07:43:09
Message-ID: 1272267789.4161.2919.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 23:52 +0200, Erik Rijkers wrote:

> I'll try to repeat this pattern on other hardware; although
> if my tests were run with faulty hardware I wouldn't know how/why
> that would give the above effect (such a 'regular aberration').

> testing is more difficult than I thought...

Thanks again for your help.

Please can you confirm:
* Are the standby tests run while the primary is completely quiet?
* What OS is this? Can we use dtrace scripts?

Can anyone else confirm these test results: large scale factor and small
number of sessions?

--
Simon Riggs www.2ndQuadrant.com


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Fujii Masao" <masao(dot)fujii(at)gmail(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-26 15:01:41
Message-ID: d1edfc5f47b094247325a14d08bc60a0.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, April 26, 2010 08:52, Fujii Masao wrote:
> On Mon, Apr 26, 2010 at 3:25 AM, Erik Rijkers <er(at)xs4all(dot)nl> wrote:
>> FWIW, here are some more results from pgbench comparing
>> primary and standby (both with Simon's patch).
>
> Was there a difference in CPU utilization between the primary
> and standby?
>

I haven't monitored it..


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-26 15:04:09
Message-ID: 59f35a8ab4d22cb011d2fe8acdee7637.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, April 26, 2010 09:43, Simon Riggs wrote:
> On Sun, 2010-04-25 at 23:52 +0200, Erik Rijkers wrote:
>
>> I'll try to repeat this pattern on other hardware; although
>> if my tests were run with faulty hardware I wouldn't know how/why
>> that would give the above effect (such a 'regular aberration').
>
>> testing is more difficult than I thought...
>
> Thanks again for your help.
>
> Please can you confirm:
> * Are the standby tests run while the primary is completely quiet?

autovacuum was on. Which is probably not a good idea - I'll try a few runs without it.

> * What OS is this? Can we use dtrace scripts?

Centos 5.4.


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 02:35:22
Message-ID: 1272335722.4161.5971.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2010-04-25 at 13:51 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Sun, 2010-04-25 at 13:33 -0400, Tom Lane wrote:
> >> If you like I'll have a go at rewriting the comments for this patch,
> >> because I am currently thinking that the problem is not so much with
> >> the code as with the poor explanation of what it's doing. Sometimes
> >> the author is too close to the code to understand why other people
> >> have a hard time understanding it.
>
> > That would help me, thank you.
>
> OK. You said you were currently working some more on the patch, so
> I'll wait for v3 and then work on it.

v3 attached

Changes:
* Strange locking in KnownAssignedXidsAdd() moved to RecordKnown...
* KnownAssignedXidsAdd() reordered, assert-ish code added
* Tail movement during snapshots no longer possible
* Tail movement during xid removal added to KnownAssignedXidsSearch()
* Major comment hacking

Little bit rough, definitely needs a re-read of all comments, so good
time to send over.

--
Simon Riggs www.2ndQuadrant.com

Attachment Content-Type Size
knownassigned_sortedarray.v3.patch text/x-patch 35.5 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 02:37:18
Message-ID: 20626.1272335838@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> v3 attached

Thanks, will work on this tomorrow.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 17:52:50
Message-ID: 20854.1272390770@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> v3 attached

This patch changes KnownAssignedXidsRemove() so that failure to find
the target XID is elog(ERROR) (ie, a PANIC, since this is in the
startup process). However, this comment is still there:

/*
* We can fail to find an xid if the xid came from a subtransaction that
* aborts, though the xid hadn't yet been reported and no WAL records have
* been written using the subxid. In that case the abort record will
* contain that subxid and we haven't seen it before.
*/

WTF? Either the comment is wrong or this should not be an elog
condition.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 18:13:33
Message-ID: 1272392013.4161.7882.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > v3 attached
>
> This patch changes KnownAssignedXidsRemove() so that failure to find
> the target XID is elog(ERROR) (ie, a PANIC, since this is in the
> startup process).

Not in all cases. The code is correct, as far as I am aware from
testing.

> However, this comment is still there:
> /*
> * We can fail to find an xid if the xid came from a subtransaction that
> * aborts, though the xid hadn't yet been reported and no WAL records have
> * been written using the subxid. In that case the abort record will
> * contain that subxid and we haven't seen it before.
> */
>
> WTF? Either the comment is wrong or this should not be an elog
> condition.

That section of code has been rewritten many times. I think it is now
inaccurate and should be removed. I left it there because the
unfortunate history of the project has been the removal of comments and
then later rediscovery of the truth, sometimes more than once. I could
no longer reproduce that error; someone else may know differently.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 18:53:16
Message-ID: 21841.1272394396@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hmm ... there's another point here, which is that the array size creates
a hard maximum on the number of entries, whereas the hash table was a
bit more forgiving. What is the proof that the array won't overflow?
The fact that the equivalent data structure on the master can't hold
more than this many entries doesn't seem to me to prove that, because
we will add intermediate not-observed XIDs to the array.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 19:29:13
Message-ID: 1272396553.4161.8043.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-04-27 at 14:53 -0400, Tom Lane wrote:
> Hmm ... there's another point here, which is that the array size
> creates
> a hard maximum on the number of entries, whereas the hash table was a
> bit more forgiving. What is the proof that the array won't overflow?
> The fact that the equivalent data structure on the master can't hold
> more than this many entries doesn't seem to me to prove that, because
> we will add intermediate not-observed XIDs to the array.

We know that not-observed xids have actually been allocated on the
primary. We log an assignment record every 64 subtransactions, so that
the peak size of the array is 65 xids per connection.

It's possible for xids to stay in the array for longer, in the event of
a FATAL error that doesn't log an abort record. We clean those up every
checkpoint, if they exist. The potential number of them is unbounded, so
making special allowance for them doesn't remove the theoretical risk.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 20:18:53
Message-ID: 23207.1272399533@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote:
>> WTF? Either the comment is wrong or this should not be an elog
>> condition.

> That section of code has been rewritten many times. I think it is now
> inaccurate and should be removed. I left it there because the
> unfortunate history of the project has been the removal of comments and
> then later rediscovery of the truth, sometimes more than once. I could
> no longer reproduce that error; someone else may know differently.

I haven't tested this, but it appears to me that the failure would occur
in overflow situations. If we have too many subxacts, we'll generate
XLOG_XACT_ASSIGNMENT, which will cause the subxids to be removed from
KnownAssignedXids[]. Then later when the top-level xact commits or
aborts we'll try to remove them again as a consequence of processing
the top-level's commit/abort record. No?

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 20:44:32
Message-ID: 1272401072.4161.8226.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-04-27 at 16:18 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Tue, 2010-04-27 at 13:52 -0400, Tom Lane wrote:
> >> WTF? Either the comment is wrong or this should not be an elog
> >> condition.
>
> > That section of code has been rewritten many times. I think it is now
> > inaccurate and should be removed. I left it there because the
> > unfortunate history of the project has been the removal of comments and
> > then later rediscovery of the truth, sometimes more than once. I could
> > no longer reproduce that error; someone else may know differently.
>
> I haven't tested this, but it appears to me that the failure would occur
> in overflow situations. If we have too many subxacts, we'll generate
> XLOG_XACT_ASSIGNMENT, which will cause the subxids to be removed from
> KnownAssignedXids[]. Then later when the top-level xact commits or
> aborts we'll try to remove them again as a consequence of processing
> the top-level's commit/abort record. No?

Yes, thank you for clear thinking. Anyway, looks like the comment was
right after all and the new code to throw an error is wrong in some
cases. It was useful for testing, at least. The comment was slightly
misleading, which is a good reason to rewrite it.

It seems like it might be possible to identify which xids could cause an
error and which won't. Much harder than that. We still have the possible
case where we have >64 subtransactions allocated but many of them abort
and we are left with a final commit of <64 subtransactions.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 21:24:48
Message-ID: 24325.1272403488@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Isn't the snapshotOldestActiveXid filter in
RecordKnownAssignedTransactionIds completely wrong/useless/bogus?

AFAICS, snapshotOldestActiveXid is only set once at the start of
recovery. This means it will soon be too old to provide any useful
filtering. But what's far worse is that the XID space will eventually
wrap around, and that test will start filtering *everything*.

I think we should just lose that test, as well as the variable.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 21:45:53
Message-ID: 1272404753.4161.8355.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote:

> Isn't the snapshotOldestActiveXid filter in
> RecordKnownAssignedTransactionIds completely wrong/useless/bogus?
>
> AFAICS, snapshotOldestActiveXid is only set once at the start of
> recovery. This means it will soon be too old to provide any useful
> filtering. But what's far worse is that the XID space will eventually
> wrap around, and that test will start filtering *everything*.
>
> I think we should just lose that test, as well as the variable.

Yes, though it looks like it is still necessary in creating a valid
initial state because otherwise we may have xids in KnownAssigned array
that are already complete. The comment there talks about wasting memory,
though it appears to be a correctness issue.

So perhaps a similar test is required in ProcArrayApplyRecoveryInfo()
but not in RecordKnownAssignedTransactionIds(). That way it is applied,
but only once at initialisation.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 22:08:29
Message-ID: 25104.1272406109@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote:
>> I think we should just lose that test, as well as the variable.

> Yes, though it looks like it is still necessary in creating a valid
> initial state because otherwise we may have xids in KnownAssigned array
> that are already complete.

Huh? How is a filter as coarse as an oldest-running-XID filter going
to prevent that? And aren't we initializing from trustworthy data in
ProcArrayApplyRecoveryInfo, anyway?

I still say it's useless.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-27 22:45:54
Message-ID: 1272408354.4161.8477.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Tue, 2010-04-27 at 17:24 -0400, Tom Lane wrote:
> >> I think we should just lose that test, as well as the variable.
>
> > Yes, though it looks like it is still necessary in creating a valid
> > initial state because otherwise we may have xids in KnownAssigned array
> > that are already complete.
>
> Huh? How is a filter as coarse as an oldest-running-XID filter going
> to prevent that? And aren't we initializing from trustworthy data in
> ProcArrayApplyRecoveryInfo, anyway?
>
> I still say it's useless.

Quite possibly. Your looking at other code outside of this patch. I'm
happy that you do so, but is it immediately related? I can have another
look when we finish this.

--
Simon Riggs www.2ndQuadrant.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-28 00:13:09
Message-ID: 29401.1272413589@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote:
>> Huh? How is a filter as coarse as an oldest-running-XID filter going
>> to prevent that? And aren't we initializing from trustworthy data in
>> ProcArrayApplyRecoveryInfo, anyway?
>>
>> I still say it's useless.

> Quite possibly. Your looking at other code outside of this patch. I'm
> happy that you do so, but is it immediately related? I can have another
> look when we finish this.

Well, it's nearby anyway. I've committed the present patch (with a
number of fixes). While I was looking at it I came across several
things in the existing code that I think are either wrong or at least
inadequately documented --- the above complaint is just the tip of the
iceberg. I'm going to make another pass over it to see if I'm just
missing things, and then report back.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Erik Rijkers <er(at)xs4all(dot)nl>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-04-28 09:37:38
Message-ID: 1272447458.4161.10020.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-04-27 at 20:13 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> > On Tue, 2010-04-27 at 18:08 -0400, Tom Lane wrote:
> >> Huh? How is a filter as coarse as an oldest-running-XID filter going
> >> to prevent that? And aren't we initializing from trustworthy data in
> >> ProcArrayApplyRecoveryInfo, anyway?
> >>
> >> I still say it's useless.
>
> > Quite possibly. Your looking at other code outside of this patch. I'm
> > happy that you do so, but is it immediately related? I can have another
> > look when we finish this.
>
> Well, it's nearby anyway. I've committed the present patch (with a
> number of fixes). While I was looking at it I came across several
> things in the existing code that I think are either wrong or at least
> inadequately documented --- the above complaint is just the tip of the
> iceberg. I'm going to make another pass over it to see if I'm just
> missing things, and then report back.

Thank you for your input and changes.

You're welcome to share my iceberg.

--
Simon Riggs www.2ndQuadrant.com


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-04 16:10:53
Message-ID: b0a9589dd3cb6d34b5cf74771d664880.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Simon,

In another thread you mentioned you were lacking information from me:

On Tue, May 4, 2010 17:10, Simon Riggs wrote:
>
> There is no evidence that Erik's strange performance has anything to do
> with HS; it hasn't been seen elsewhere and he didn't respond to
> questions about the test setup to provide background. The profile didn't
> fit any software problem I can see.
>

I'm sorry if I missed requests for things that where not already mentioned.

Let me repeat:
OS: Centos 5.4
2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
Areca 1280ML
primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)

It goes without saying (I hope) that apart from the pgbench tests
and a few ssh sessions (myself), the machine was idle.

It would be interesting if anyone repeated these simple tests and produced
evidence that these non-HS.

(Unfortunately, I have at the moment not much time for more testing)

thanks,

Erik Rijkers

On Sun, April 25, 2010 21:07, Simon Riggs wrote:
> On Sun, 2010-04-25 at 20:25 +0200, Erik Rijkers wrote:
>
>> Sorry if it's too much data, but to me at least it was illuminating;
>> I now understand the effects of the different parameters better.
>
> That's great, many thanks.
>
> A few observations
>
> * Standby performance is actually slightly above normal running. This is
> credible because of the way snapshots are now taken. We don't need to
> scan the procarray looking for write transactions, since we know
> everything is read only. So we scan just the knownassignedxids, which if
> no activity from primary will be zero-length, so snapshots will actually
> get taken much faster in this case on standby. The snapshot performance
> on standby is O(n) where n is the number of write transactions
> "currently" on primary (transfer delays blur the word "currently").
>
> * The results for scale factor < 100 are fine, and the results for >100
> with few connections get thrown out by long transaction times. With
> larger numbers of connections the wait problems seem to go away. Looks
> like Erik (and possibly Hot Standby in general) has an I/O problem,
> though "from what" is not yet determined. It could be just hardware, or
> might be hardware plus other factors.
>
> --
> Simon Riggs www.2ndQuadrant.com
>
>


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-04 16:19:47
Message-ID: 1272989987.4535.2521.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-05-04 at 18:10 +0200, Erik Rijkers wrote:
> It would be interesting if anyone repeated these simple tests and
> produced
> evidence that these non-HS.
>
> (Unfortunately, I have at the moment not much time for more testing)

Would you be able to make those systems available for further testing?

First, I'd perform the same test with the systems swapped, so we know
more about the symmetry of the issue. After that, would like to look
more into internals.

Is it possible to setup SytemTap and dtrace on these systems?

Thanks, either way.

--
Simon Riggs www.2ndQuadrant.com


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Simon Riggs" <simon(at)2ndQuadrant(dot)com>
Cc: "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-04 16:33:26
Message-ID: 447e467376088e0e1cf10aee567dd541.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 4, 2010 18:19, Simon Riggs wrote:
> On Tue, 2010-05-04 at 18:10 +0200, Erik Rijkers wrote:
>> It would be interesting if anyone repeated these simple tests and
>> produced evidence that these non-HS.
>>
>> (Unfortunately, I have at the moment not much time for more testing)
>
> Would you be able to make those systems available for further testing?

No. sorry.

> First, I'd perform the same test with the systems swapped, so we know
> more about the symmetry of the issue. After that, would like to look
> more into internals.

you mean "systems swapped", primary and standby? primary and standby were on the same machine in
these tests (even the same raid).

I can eventually move the standby (the 'slow' side, as it stands) to another, quite similar
machine. Not in the coming days though...

> Is it possible to setup SytemTap and dtrace on these systems?

I did install systemtap last week. dtrace is not installed (I think. I've never used either.)

Erik Rijkers


From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-04 18:26:58
Message-ID: 4BE066F2.805@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Erik Rijkers wrote:
> OS: Centos 5.4
> 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
> Areca 1280ML
> primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)
>

To fill in from data you already mentioned upthread:
32 GB RAM
CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5

Thanks for the all the reporting you've done here, really helpful.
Questions to make sure I'm trying to duplicate the right thing here:

Is your disk array all configured as one big RAID10 volume, so
essentially a 6-disk stripe with redundancy, or something else? In
particular I want know whether the WAL/database/archives are split onto
separate volumes or all on one big one when you were testing.

Is this is on ext3 with standard mount parameters?

Also, can you confirm that every test you ran only had a single pgbench
worker thread (-j 1 or not specified)? That looked to be the case from
the ones I saw where you posted the whole command used. It would not
surprise me to find that the CPU usage profile of a standby is just
different enough from the primary that it results in the pgbench program
not being scheduled enough time, due to the known Linux issues in that
area. Not going to assume that, of course, just one thing I want to
check when trying to replicate what you've run into.

I didn't see any glaring HS performance issues like you've been
reporting on last time I tried performance testing in this area, just a
small percentage drop. But I didn't specifically go looking for it
either. With your testing rig out of service, we're going to try and
replicate that on a system here. My home server is like a scaled down
version of yours (single quad-core, 8GB RAM, smaller Areca controller, 5
disks instead of 12) and it's running the same CentOS version. If the
problems really universal I should see it here too.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us


From: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Simon Riggs <simon(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-04 19:34:53
Message-ID: 4BE076DD.4040103@kaltenbrunner.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Erik Rijkers wrote:
> Hi Simon,
>
> In another thread you mentioned you were lacking information from me:
>
> On Tue, May 4, 2010 17:10, Simon Riggs wrote:
>> There is no evidence that Erik's strange performance has anything to do
>> with HS; it hasn't been seen elsewhere and he didn't respond to
>> questions about the test setup to provide background. The profile didn't
>> fit any software problem I can see.
>>
>
> I'm sorry if I missed requests for things that where not already mentioned.
>
> Let me repeat:
> OS: Centos 5.4
> 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
> Areca 1280ML
> primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)
>
> It goes without saying (I hope) that apart from the pgbench tests
> and a few ssh sessions (myself), the machine was idle.
>
> It would be interesting if anyone repeated these simple tests and produced
> evidence that these non-HS.
>
> (Unfortunately, I have at the moment not much time for more testing)

FWIW - I'm seeing a behaviour here under pgbench -S workloads that looks
kinda related.

using -j 16 -c 16 -T 120 I get either 100000tps and around 660000
contextswitches per second or on some runs I end up with 150000tps and
around 1M contextswitches/s sustained. I mostly get the 100k result but
once in a while I get the 150k one. And one even can anticipate the
final transaction rate from watching "vmstat 1"...

I'm not sure yet on what is causing that behaviour but that is with
9.0B1 on a Dual Quadcore Nehalem box with 16 cpu threads (8+HT) on a
pure in-memory workload (scale = 20 with 48GB RAM).

Stefan


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Greg Smith" <greg(at)2ndquadrant(dot)com>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-04 19:40:12
Message-ID: 64e9143b80c9c450464f6a6edf461032.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, May 4, 2010 20:26, Greg Smith wrote:
> Erik Rijkers wrote:
>> OS: Centos 5.4
>> 2 quadcores: Intel(R) Xeon(R) CPU X5482 @ 3.20GHz
>> Areca 1280ML
>> primary and standby db both on a 12 disk array (sata 7200rpm, Seagat Barracuda ES.2)
>>
>
> To fill in from data you already mentioned upthread:
> 32 GB RAM
> CentOS release 5.4 (Final), x86_64 Linux, 2.6.18-164.el5
>
> Thanks for the all the reporting you've done here, really helpful.
> Questions to make sure I'm trying to duplicate the right thing here:
>
> Is your disk array all configured as one big RAID10 volume, so
> essentially a 6-disk stripe with redundancy, or something else? In
> particular I want know whether the WAL/database/archives are split onto
> separate volumes or all on one big one when you were testing.

Everything together: the raid is what Areca call 'raid10(1E)'.
(to be honest I don't remember what that 1E exactly means -
extra flexibility in the number of disks, I think).

Btw, some of my emails contained the postgresql.conf of both instances.

>
> Is this is on ext3 with standard mount parameters?

ext3 noatime

> Also, can you confirm that every test you ran only had a single pgbench
> worker thread (-j 1 or not specified)? That looked to be the case from
> the ones I saw where you posted the whole command used. It would not

yes; the literal cmd:
time /var/data1/pg_stuff/pg_installations/pgsql.sr_primary/bin/pgbench -h /tmp -p 6565 -U rijkers
-n -S -c 20 -T 900 -j 1 replicas

To avoid wrapping in the emails I just removed '-h \tmp', -U rijkers', and 'replicas'.

(I may have run the primary's pgbench binary also against the slave - don't think
that should make any difference)

> surprise me to find that the CPU usage profile of a standby is just
> different enough from the primary that it results in the pgbench program
> not being scheduled enough time, due to the known Linux issues in that
> area. Not going to assume that, of course, just one thing I want to
> check when trying to replicate what you've run into.
>
> I didn't see any glaring HS performance issues like you've been
> reporting on last time I tried performance testing in this area, just a
> small percentage drop. But I didn't specifically go looking for it

Here, it seems repeatable, but does not occur with all scales.

Hm, maybe I should just dump *all* of my results on the wiki for reference. (I'll look at that
later).

> either. With your testing rig out of service, we're going to try and
> replicate that on a system here. My home server is like a scaled down
> version of yours (single quad-core, 8GB RAM, smaller Areca controller, 5
> disks instead of 12) and it's running the same CentOS version. If the
> problems really universal I should see it here too.
>

Thanks,

Erik Rijkers


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-04 20:48:07
Message-ID: 1273006088.4535.2932.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, 2010-05-04 at 21:34 +0200, Stefan Kaltenbrunner wrote:

> FWIW - I'm seeing a behaviour here under pgbench -S workloads that looks
> kinda related.
>
> using -j 16 -c 16 -T 120 I get either 100000tps and around 660000
> contextswitches per second or on some runs I end up with 150000tps and
> around 1M contextswitches/s sustained. I mostly get the 100k result but
> once in a while I get the 150k one. And one even can anticipate the
> final transaction rate from watching "vmstat 1"...
>
> I'm not sure yet on what is causing that behaviour but that is with
> 9.0B1 on a Dual Quadcore Nehalem box with 16 cpu threads (8+HT) on a
> pure in-memory workload (scale = 20 with 48GB RAM).

Educated guess at a fix: please test this patch. It's good for
performance testing, but doesn't work correctly at failover, which would
obviously be addressed prior to any commit.

--
Simon Riggs www.2ndQuadrant.com

Attachment Content-Type Size
knownrecoverystate.patch text/x-patch 1.4 KB

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: testing HS/SR - 1 vs 2 performance
Date: 2010-05-07 12:04:02
Message-ID: 4BE401B2.3090509@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Erik Rijkers wrote:
> Everything together: the raid is what Areca call 'raid10(1E)'.
> (to be honest I don't remember what that 1E exactly means -
> extra flexibility in the number of disks, I think).
>

Standard RAID10 only supports an even number of disks. The 1E variants
also allow putting an odd number in. If you're using an even number,
like in your case, the results are the same until you start losing
drives, at which point the degradation performance pattern changes a bit
due to differences in how things are striped. See these for more info:

http://bytepile.com/raid_class.php and
http://en.wikipedia.org/wiki/Non-standard_RAID_levels#IBM_ServeRAID_1E
http://publib.boulder.ibm.com/infocenter/eserver/v1r2/index.jsp?topic=/diricinfo/fqy0_craid1e.html

I don't think using RAID1E has anything to do with your results, but it
is possible given your test configuration that part of the difference
you're seeing relates to where on disk blocks are stored. If you take a
hard drive and write two copies of something onto it, the second copy
will be a little slower than the first. That's just because how drive
speed drops over the surface as you move further along it. There's some
examples of what that looks like at
http://projects.2ndquadrant.it/sites/default/files/pg-hw-bench-2010.pdf
on pages 21-23.

Returning to your higher level results, one of the variables you weren't
sure how to account for was caching effects on the standby--the
possibility that it just didn't have the data cached the same as the
master impacting results. The usual way I look for that is by graphing
the pgbench TPS over time. In that situation, you can see it shoot
upwards over time, very obvious pattern. Example at
http://projects.2ndquadrant.it/sites/default/files/pgbench-intro.pdf on
pages 36,37.

--
Greg Smith 2ndQuadrant US Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.us