Quick Links

Re: WIP patch for latestCompletedXid method of computing snapshot xmax

Lists:	pgsql-hackerspgsql-patches

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-patches(at)postgreSQL(dot)org
Subject:	WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 01:26:51
Message-ID:	3018.1189214811@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

This patch implements Florian's idea about how to manage snapshot xmax
without the ugly and performance-losing tactic of taking XidGenLock and
ProcArrayLock at the same time. I had to do a couple of slightly klugy
things to get bootstrap and prepared transactions to work, but on the
whole it seems at least as clean as the code we have now. Comments?

regards, tom lane

Attachment	Content-Type	Size
latestCompletedXid.patch.gz	application/octet-stream	11.0 KB

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-patches(at)postgreSQL(dot)org>
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 01:48:26
Message-ID:	878x7i0vx1.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> This patch implements Florian's idea about how to manage snapshot xmax
> without the ugly and performance-losing tactic of taking XidGenLock and
> ProcArrayLock at the same time. I had to do a couple of slightly klugy
> things to get bootstrap and prepared transactions to work, but on the
> whole it seems at least as clean as the code we have now. Comments?

Just that it will be fascinating to see what effects this has on the
benchmarks.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 02:25:43
Message-ID:	4084.1189218343@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> This patch implements Florian's idea about how to manage snapshot xmax
>> without the ugly and performance-losing tactic of taking XidGenLock and
>> ProcArrayLock at the same time. I had to do a couple of slightly klugy
>> things to get bootstrap and prepared transactions to work, but on the
>> whole it seems at least as clean as the code we have now. Comments?

> Just that it will be fascinating to see what effects this has on the
> benchmarks.

Yeah, I was hoping to get some benchmarks before deciding whether it's
worth the risk of pushing this into 8.3. I'm off trying pgbench now,
but if anyone wants to try something more serious like DBT2 ...

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 20:21:57
Message-ID:	28198.1189282917@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

I wrote:
> This patch implements Florian's idea about how to manage snapshot xmax
> without the ugly and performance-losing tactic of taking XidGenLock and
> ProcArrayLock at the same time. I had to do a couple of slightly klugy
> things to get bootstrap and prepared transactions to work, but on the
> whole it seems at least as clean as the code we have now. Comments?

I spent a fair amount of time this afternoon trying to measure the
performance impact of this patch using pgbench, without a lot of success
--- as far as average transaction rates go, it's a wash compared to CVS
HEAD. However, it eventually struck me to look at the distribution of
transaction times, using pgbench's -l logging output, and on that basis
it is clear that getting rid of the XidGenLock interaction has a good
deal of use in eliminating outlier times. My desktop machine has a
single consumer-grade IDE drive, and even with fsync off and
synchronous_commit off, it can barely make 190 tps sustained pgbench
throughput --- it's just disk write bound all the time. On a run with 8
clients, 10000 transactions per client, DB scale factor 25, I get this
distribution of transaction times from CVS HEAD:

postgres=# select usec/1000000, count(*) from plhead group by 1 order by 1;
?column? | count
----------+-------
0 | 79306
1 | 290
2 | 116
3 | 65
4 | 82
5 | 30
6 | 31
7 | 32
10 | 8
11 | 8
13 | 16
14 | 5
15 | 3
20 | 8
(14 rows)

and this from HEAD plus the patch:

postgres=# select usec/1000000, count(*) from plpatch group by 1 order by 1;
?column? | count
----------+-------
0 | 79305
1 | 325
2 | 85
3 | 49
4 | 68
5 | 50
6 | 45
7 | 35
8 | 14
9 | 8
10 | 6
11 | 10
(12 rows)

The worst-case transaction time has dropped by nearly a factor of 2:

postgres=# select * from plhead order by usec desc limit 20;
client | trans | usec | f | epoch | lsb
--------+-------+----------+---+------------+--------
2 | 6379 | 20621910 | 0 | 1189280557 | 664207
6 | 5992 | 20621175 | 0 | 1189280557 | 665970
7 | 5795 | 20621024 | 0 | 1189280557 | 666353
1 | 6327 | 20620833 | 0 | 1189280557 | 663606
3 | 6463 | 20620277 | 0 | 1189280557 | 663895
4 | 6383 | 20620260 | 0 | 1189280557 | 664000
5 | 6209 | 20620077 | 0 | 1189280557 | 665060
0 | 6269 | 20619875 | 0 | 1189280557 | 664935
6 | 8182 | 15191784 | 0 | 1189280655 | 87859
3 | 8810 | 15191637 | 0 | 1189280655 | 86802
2 | 8700 | 15185120 | 0 | 1189280655 | 86742
5 | 8479 | 14078513 | 0 | 1189280653 | 978339
1 | 8618 | 14077106 | 0 | 1189280653 | 978216
7 | 7930 | 14076905 | 0 | 1189280653 | 978832
4 | 8704 | 14076429 | 0 | 1189280653 | 977877
0 | 8557 | 14076249 | 0 | 1189280653 | 977477
0 | 6717 | 13932179 | 0 | 1189280576 | 65288
1 | 6775 | 13931973 | 0 | 1189280576 | 65387
6 | 6410 | 13931493 | 0 | 1189280576 | 67192
7 | 6201 | 13931140 | 0 | 1189280576 | 69247
(20 rows)

postgres=# select * from plpatch order by usec desc limit 20;
client | trans | usec | f | epoch | lsb
--------+-------+----------+---+------------+--------
6 | 6008 | 11833702 | 0 | 1189281093 | 646851
0 | 6140 | 11833041 | 0 | 1189281093 | 645738
2 | 6289 | 11809343 | 0 | 1189281093 | 616734
4 | 6315 | 11808044 | 0 | 1189281093 | 617505
3 | 6344 | 11807762 | 0 | 1189281093 | 616970
7 | 5802 | 11807641 | 0 | 1189281093 | 617932
5 | 6183 | 11806964 | 0 | 1189281093 | 618060
1 | 6163 | 11805494 | 0 | 1189281093 | 616679
7 | 8175 | 11027973 | 0 | 1189281239 | 675499
2 | 8725 | 11019066 | 0 | 1189281239 | 674305
5 | 8828 | 10997331 | 0 | 1189281239 | 674953
0 | 8541 | 10987629 | 0 | 1189281239 | 673773
1 | 8799 | 10713734 | 0 | 1189281239 | 673217
4 | 8897 | 10705975 | 0 | 1189281239 | 672743
6 | 8364 | 10702875 | 0 | 1189281239 | 677163
3 | 8814 | 10701467 | 0 | 1189281239 | 674369
3 | 9223 | 9158554 | 0 | 1189281258 | 202934
5 | 9234 | 9143744 | 0 | 1189281258 | 203073
7 | 8493 | 9099174 | 0 | 1189281258 | 204092
4 | 9306 | 9097074 | 0 | 1189281258 | 202402
(20 rows)

So on the strength of that, I'm going to go ahead and commit the patch,
but I'd be interested to see benchmarks from people with access to
better hardware.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 21:00:55
Message-ID:	200709081400.56045.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom,

> So on the strength of that, I'm going to go ahead and commit the patch,
> but I'd be interested to see benchmarks from people with access to
> better hardware.

Is the latest version of the patch from -patches the one I should test?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgreSQL(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 21:05:56
Message-ID:	46E30EB4.7020304@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Hello Tom,

Tom Lane wrote:
> So on the strength of that, I'm going to go ahead and commit the patch,
> but I'd be interested to see benchmarks from people with access to
> better hardware.

I've just completed two dbt2 test runs on a mid-level system, with 4GB
RAM and a 7 disk SATA RAID 1+0 w/ BBU. Once with code as of 2007/09/05
18:00 (which is *before* the first lazy xid commit) and once with cvs
HEAD (2007/09/08 +latestCompletedXid.patch.

Here are the results from the first test run (test run 33, without lazy
xid):

> $ cat 33/driver/results.out
> Response Time (s)
> Transaction % Average : 90th % Total Rollbacks %
> ------------ ----- --------------------- ----------- --------------- -----
> Delivery 3.97 3.745 : 7.844 11844 0 0.00
> New Order 45.35 3.844 : 7.692 135192 1352 1.01
> Order Status 3.95 2.728 : 6.371 11764 0 0.00
> Payment 42.74 2.649 : 6.349 127415 0 0.00
> Stock Level 4.00 2.172 : 5.634 11915 0 0.00
> ------------ ----- --------------------- ----------- --------------- -----
>
> 1103.45 new-order transactions per minute (NOTPM)
> 120.1 minute duration
> 0 total unknown errors
> 1003 second(s) ramping up

And that's with HEAD +latestCompletedXid.patch (test run 34):

> $ cat 34/driver/results.out
> Response Time (s)
> Transaction % Average : 90th % Total Rollbacks %
> ------------ ----- --------------------- ----------- --------------- -----
> Delivery 3.96 3.843 : 8.223 11760 0 0.00
> New Order 45.28 4.049 : 8.451 134398 1300 0.98
> Order Status 3.97 2.877 : 6.815 11777 0 0.00
> Payment 42.80 2.745 : 6.718 127027 0 0.00
> Stock Level 4.00 2.280 : 6.129 11859 0 0.00
> ------------ ----- --------------------- ----------- --------------- -----
>
> 1097.71 new-order transactions per minute (NOTPM)
> 120.1 minute duration
> 0 total unknown errors
> 1003 second(s) ramping up

Both tests ran for two hours, had 100 warehouses and 50 connections.
shared_buffers were set to 1024MB, effective_cachesize = 3800MB, all
other settings were standard.

Regards

Markus

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 21:18:34
Message-ID:	200709081418.35059.josh@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Markus,

> I've just completed two dbt2 test runs on a mid-level system, with 4GB
> RAM and a 7 disk SATA RAID 1+0 w/ BBU. Once with code as of 2007/09/05
> 18:00 (which is *before* the first lazy xid commit) and once with cvs
> HEAD (2007/09/08 +latestCompletedXid.patch.

Hmmm. Your results are withing the margin of error for DBT2, so they show no
real difference. What we need for this is a heavy-read workload, though; on
a workload like DBT2 (which is mostly writes) I wouldn't expect lazy-XID to
help much.

I'll see if I can find something appropriate. Maybe Jan's TPC-W
implementation?

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Markus Schiltknecht <markus(at)bluegap(dot)ch>
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 21:29:05
Message-ID:	6102.1189286945@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Hmmm. Your results are withing the margin of error for DBT2, so they
> show no real difference. What we need for this is a heavy-read
> workload, though; on a workload like DBT2 (which is mostly writes) I
> wouldn't expect lazy-XID to help much.

Lazy-XID doesn't help on a write-mostly workload, but I would have
expected to see some benefit from the latestCompletedXid patch.

The rough tests I just finished suggest that the win to look for is
improvement of the very tail of the distribution --- well beyond the
90th percentile point which is the most we can see in Markus'
printout. Can we get 95% and 99% response time percentiles?

For comparison, in my little test I got 0.828546 vs 0.873957 as the
99% percentile pgbench transaction times.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 21:29:34
Message-ID:	6119.1189286974@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> Is the latest version of the patch from -patches the one I should test?

Yeah, the posted patch is the same as what I committed.

regards, tom lane

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 21:39:45
Message-ID:	87r6l8x2e6.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> My desktop machine has a single consumer-grade IDE drive, and even with
> fsync off and synchronous_commit off, it can barely make 190 tps sustained
> pgbench throughput --- it's just disk write bound all the time. On a run
> with 8 clients, 10000 transactions per client, DB scale factor 25, I get
> this distribution of transaction times from CVS HEAD:

Wouldn't you expect to see more of an effect on cpu-bound environments? Is
there any i/o going on while these locks are being held? I suppose there's a
WAL commit record that has to be synced?

Have you tried with a smaller scale factor?

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Josh Berkus" <josh(at)agliodbs(dot)com>
Cc:	<pgsql-hackers(at)postgresql(dot)org>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-08 21:51:18
Message-ID:	87myvwx1ux.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Josh Berkus" <josh(at)agliodbs(dot)com> writes:

> Hmmm. Your results are withing the margin of error for DBT2, so they show no
> real difference. What we need for this is a heavy-read workload, though; on
> a workload like DBT2 (which is mostly writes) I wouldn't expect lazy-XID to
> help much.

The TPM is definitely within the margin of error but the 90th percentile
response times seem worrisome. But they're already over 5s which are TPC-C
failures. From what we've seen when you push the scale factor too high until
those numbers are even close to 5s you get very unstable results.

The published benchmarks seem to keep the max around 5s which puts the 90th
percentile around half that. That seems overly conservative but I wonder what
reason they have for doing it that way.

> I'll see if I can find something appropriate. Maybe Jan's TPC-W
> implementation?

We could just rejigger the percentages on the transactions. I think Stock
Level and Order Status are entirely read-only but I would have to check. Stock
Level is a very intensive query though so I wouldn't suggest raising the
percentage on that.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-patches(at)postgreSQL(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-09 00:26:02
Message-ID:	46E33D9A.9000009@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> This patch implements Florian's idea about how to manage snapshot xmax
> without the ugly and performance-losing tactic of taking XidGenLock and
> ProcArrayLock at the same time. I had to do a couple of slightly klugy
> things to get bootstrap and prepared transactions to work, but on the
> whole it seems at least as clean as the code we have now. Comments?

I'm a bit late, but still a few comments

1) I wonder why you made XidCacheRemoveRunningXids take the latestXid as
an extra argument, though - it should be able to figure that out on
it's own I'd have thought. Is that just for consistency, or did I miss
something?

2) I don't see how either the childXids array or the XID cache in the
proc array could ever not be in ascending xid order. We do guarantee
that a child xid is always greater than it's parent. This guarantee
enables a few optimizations. First, as you say in the comments,
finding the largest xid when committing becomes trivial. But more
important, if we can assume that the proc array xid cache is always
sorted, we can get ride of the exclusive lock during subxact abort.

We will *always* remove the last n xids from the cache, so we just
need to reset subxids.nxids, and not touch the actual array at all.
(We might later set the now-unused entries to InvalidTransactionId,
but thats only cosmetic).

Regarding 2) Removing subxids from the proc array cannnot effect xmin
calculations, because the toplevel xid is always small than all it's
children. So we only need to care about snapshots themselves.

But they won't care much if they find an aborted sub xid in the
proc array or now. If not, the xid is neither running nor committed,
so it's assumes to have crashed. If it's in the snapshot, it's
treated as in-progress.

So I believe the code could be simplified a bit if we just made it a rule
that both childXids and the cache in the proc array are always in ascending
xid order, and we could get rid of the exclusive lock during subxact abort.

greetings, Florian Pflug

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	pgsql-patches(at)postgreSQL(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-09 02:46:05
Message-ID:	10093.1189305965@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
> 1) I wonder why you made XidCacheRemoveRunningXids take the latestXid as
> an extra argument, though - it should be able to figure that out on
> it's own I'd have thought. Is that just for consistency, or did I miss
> something?

Well, we were going to have to figure the latestXid in xact.c anyway in
the main-xact path, so I thought it was better to have it responsible
in both cases. It's a judgment call though.

> 2) I don't see how either the childXids array or the XID cache in the
> proc array could ever not be in ascending xid order. We do guarantee
> that a child xid is always greater than it's parent.

It's the relationships among siblings that worry me. It's likely that
we could guarantee this with a bit of care and some Asserts in xact.c's
xid-list manipulations, but it's not quite obvious that it must be so
--- for instance using an lcons instead of lappend might break it.

> This guarantee
> enables a few optimizations. First, as you say in the comments,
> finding the largest xid when committing becomes trivial. But more
> important, if we can assume that the proc array xid cache is always
> sorted, we can get ride of the exclusive lock during subxact abort.

> We will *always* remove the last n xids from the cache, so we just
> need to reset subxids.nxids, and not touch the actual array at all.
> (We might later set the now-unused entries to InvalidTransactionId,
> but thats only cosmetic).

I'm not convinced that that works when the subxids cache has overflowed
earlier in the transaction.

In any case, the bottom line is that it's very unlikely that that code
costs enough to be worth optimizing. When you think about the total
number of cycles that have been expended on each subtransaction
(especially one that has an xid), another comparison or two is lost in
the noise. As long as the code isn't O(N^2) ... which it isn't AFAICS
... I would rather keep it simple and robust.

regards, tom lane

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-patches(at)postgreSQL(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-09 06:16:25
Message-ID:	46E38FB9.1030406@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> "Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
>> 2) I don't see how either the childXids array or the XID cache in the proc
>> array could ever not be in ascending xid order. We do guarantee that a
>> child xid is always greater than it's parent.
>
> It's the relationships among siblings that worry me. It's likely that we
> could guarantee this with a bit of care and some Asserts in xact.c's xid-list
> manipulations, but it's not quite obvious that it must be so --- for instance
> using an lcons instead of lappend might break it.

Hm.. AFAICT, a transaction can only ever have currently "live" subtransaction.
In other words, for any node in the transaction tree, all subtrees except for
one are always aborted.

Since we do guarantee that a node has an xid smaller than the of any node below,
it seems the only place we have to be careful at is AtSubCommit_childXids.

>
>> This guarantee enables a few optimizations. First, as you say in the
>> comments, finding the largest xid when committing becomes trivial. But more
>> important, if we can assume that the proc array xid cache is always
>> sorted, we can get ride of the exclusive lock during subxact abort.
>
>> We will *always* remove the last n xids from the cache, so we just need to
>> reset subxids.nxids, and not touch the actual array at all. (We might later
>> set the now-unused entries to InvalidTransactionId, but thats only
>> cosmetic).
>
> I'm not convinced that that works when the subxids cache has overflowed
> earlier in the transaction.

It should. AFAICS, the subxid cache always shows the xids along the "path"
of the transaction tree that leads to the currently active subtransaction.
If it hasn't overflowed, that is. If it has, the list is cut off at some
point. But being cut-off will only make the number of elements we remove
smaller, but not change that we remove some numer of last elements.

> In any case, the bottom line is that it's very unlikely that that code costs
> enough to be worth optimizing. When you think about the total number of
> cycles that have been expended on each subtransaction (especially one that
> has an xid), another comparison or two is lost in the noise. As long as the
> code isn't O(N^2) ... which it isn't AFAICS ... I would rather keep it simple
> and robust.

The real win is that because of the guaranteed order, removing xids from the
cache gets less messy - it reduces to just changing subxids.nxids.Which in turn
allows us to only hold a *shared* proc array lock while doing the removal. To
quote my original mail

"Removing subxids from the proc array cannnot effect xmin
calculations, because the toplevel xid is always small than all it's
children. So we only need to care about snapshots themselves.

greetings, Florian Pflug

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-09 15:22:53
Message-ID:	18533.1189351373@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
> This guarantee enables a few optimizations. First, as you say in the
> comments, finding the largest xid when committing becomes trivial. But more
> important, if we can assume that the proc array xid cache is always
> sorted, we can get ride of the exclusive lock during subxact abort.

No, we can't, because subxact abort still has to advance latestCompletedXid.

regards, tom lane

From:	"Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-patches(at)postgresql(dot)org
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-09 22:52:04
Message-ID:	46E47914.8000504@phlo.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

Tom Lane wrote:
> "Florian G. Pflug" <fgp(at)phlo(dot)org> writes:
>> This guarantee enables a few optimizations. First, as you say in the
>> comments, finding the largest xid when committing becomes trivial. But more
>> important, if we can assume that the proc array xid cache is always
>> sorted, we can get ride of the exclusive lock during subxact abort.
>
> No, we can't, because subxact abort still has to advance latestCompletedXid.

I don't think so. I didn't notice that when initially reading your patch -
but it seems that updating latestCompletedXid during subxact abort is actually
worsening performance (only very slightly, though).

The whole point of updating latestCompletedXid when aborting a toplevel xact
is to prevent the following corner case.
.) Some transaction commits
.) Only readonly transactions
(all xmins = xmaxs = latestCompletedXid+1)
.) One large Transaction that Aborts
(xid = GetNewTransactionId() = latestCompletedXid+1)
.) Only readonly transactions for a long period again.
all xmins = xmaxs = latestCompletedXid+1

If the ABORT didn't update latestCompletedXid, than we'd not be able to remove
rows created by that one large transactions (which aborted), because all the
xmins would still be latestCompletedXid+1, which is not > the xid of the
aborted transaction.

So we updating latestCompletedXid is only *necessary* durings COMMITS - during
ABORTS it's just done for efficiency reasons.

But for subtransactions, there is no efficiency gain at all, because the
toplevel xid already bounds the xmin. In fact, updating latestCompletedXid
during subxact moves the snapshot's xmax unnecessary far into the future,
which leads to larger snasphots, meaning more cycles spent to scan them.

I do feel that I explained the idea rather badly though in my initial
response to your patch - I'll post (hopefully) better explanation to
the hackers list shortly.

greetings, Florian Pflug

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	<pgsql-patches(at)postgresql(dot)org>
Subject:	Re: WIP patch for latestCompletedXid method of computing snapshot xmax
Date:	2007-09-11 11:04:08
Message-ID:	878x7dzcnr.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-patches

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>>> This patch implements Florian's idea about how to manage snapshot xmax
>>> without the ugly and performance-losing tactic of taking XidGenLock and
>>> ProcArrayLock at the same time. I had to do a couple of slightly klugy
>>> things to get bootstrap and prepared transactions to work, but on the
>>> whole it seems at least as clean as the code we have now. Comments?
>
>> Just that it will be fascinating to see what effects this has on the
>> benchmarks.
>
> Yeah, I was hoping to get some benchmarks before deciding whether it's
> worth the risk of pushing this into 8.3. I'm off trying pgbench now,
> but if anyone wants to try something more serious like DBT2 ...

I ran some DBT2 tests but haven't been able to show any performance effects
either in average or worst-case response times.

I think that's for a few reasons:

1) This is only a dual-processor machine I'm playing with and we scale well on
two processors already.

2) TPC-C doesn't have many read-only transactions

3) The deadlocks I found earlier cause enough noise in the response times to
hide any subtler effects.

We may have to wait until the next time Sun runs their 1,000-process monster
Java benchmark to see if it helps there.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com