Re: Avoiding repeated snapshot computation

Lists: pgsql-hackers
From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Avoiding repeated snapshot computation
Date: 2011-11-26 15:52:50
Message-ID: CABOikdMsJ4OsxtA7XBV2quhKYUo_4105fJF4N+uyRoyBAzSuuQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On some recent benchmarks and profile data, I saw GetSnapshotData
figures at the very top or near top. For lesser number of clients, it
can account for 10-20% of time, but more number of clients I have seen
it taking up as much as 40% of sample time. Unfortunately, the machine
of which I was running these tests is currently not available and so I
don't have the exact numbers. But the observation is almost correct.
Our recent work on separating the hot members of PGPROC in a separate
array would definitely reduce data cache misses ans reduce the
GetSnapshotData time, but it probably still accounts for a large
enough critical section for a highly contended lock.

I think now that we have reduced the run time of the function itself,
we should now try to reduce the number of times the function is
called. Robert proposed a way to reduce the number of calls per
transaction. I think we can go one more step further and reduce the
number for across the transactions.

One major problem today could be because the way LWLock works. If the
lock is currently held in SHARED mode by some backend and some other
backend now requests it in SHARED mode, it will immediately get it.
Thats probably the right thing to do because you don't want the reader
to really wait when the lock is readily available. But in the case of
GetSnapshotData(), every reader is doing exactly the same thing; they
are computing a snapshot based on the same shared state and would
compute exactly the same snapshot (if we ignore the fact that we don't
include caller's XID in xip array, but thats a minor detail). And
because the way LWLock works, more and more readers would get in to
compute the snapshot, until the exclusive waiters get a window to
sneak in, either because more and more processes slowly start sleeping
for exclusive access. To depict it, the four transactions make
overlapping calls for GetSnapshotData() and hence the total critical
section starts when the first caller enters it and the ends when the
last caller exits.

Txn1 ------[ SHARED ]---------------------
Txn2 --------[ SHARED ]-------------------
Txn3 -----------------[ SHARED ]-------------
Txn4 -------------------------------------------[ SHARED
]---------
|<---------------Total Time ------------------------------------>|

Couple of ideas come to mind to solve this issue.

A snapshot once computed will remain valid for every call irrespective
of its origin until at least one transaction ends. So we can store the
last computed snapshot in some shared area and reuse it for all
subsequent GetSnapshotData calls. The shared snapshot will get
invalidated when some transaction ends by calling
ProcArrayEndTransaction(). I tried this approach and saw a 15%
improvement for 32-80 clients on the 32 core HP IA box with pgbench -s
100 -N tests. Not bad, but I think this can be improved further.

What we can do is when a transaction comes to compute its snapshot, it
checks if some other transaction is already computing a snapshot for
itself. If so, it just sleeps on the lock. When the other process
finishes computing the snapshot, it saves the snapshot is a shared
area and wakes up all processes waiting for the snapshot. All those
processes then just copy the snapshot from the shared area and they
are done. This will not only reduce the total CPU consumption by
avoiding repetitive work, but would also reduce the total time for
which ProcArrayLock is held in SHARED mode by avoiding pipeline of
GetSnapshotData calls. I am currently trying the shared work queue
mechanism to implement this, but I am sure we can do it this in some
other way too.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 17:13:47
Message-ID: CA+TgmoYOcJAqtdLyTkFEeLSmgbNGEbcMnrG26gm7NyqMaTDrFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> What we can do is when a transaction comes to compute its snapshot, it
> checks if some other transaction is already computing a snapshot for
> itself. If so, it just sleeps on the lock. When the other process
> finishes computing the snapshot, it saves the snapshot is a shared
> area and wakes up all processes waiting for the snapshot. All those
> processes then just copy the snapshot from the shared area and they
> are done. This will not only reduce the total CPU consumption by
> avoiding repetitive work, but would also reduce the total time for
> which ProcArrayLock is held in SHARED mode by avoiding pipeline of
> GetSnapshotData calls. I am currently trying the shared work queue
> mechanism to implement this, but I am sure we can do it this in some
> other way too.

I don't quite understand how this is going to work. Transaction A
ends, invaliding the shared snapshot. Now transactions B, C, and D
come along wishing to take snapshots. B begins taking a snapshot, so
C and D wait (blech!) for that to be complete. Then, they start
copying the snapshot. Meanwhile, transaction E ends, invalidating the
shared snapshot, and then transaction F comes along, wanting to take a
snapshot. If F constructs a new shared snapshot, then doesn't that
overwrite the same memory area C and D were busy copying? It seems
like F needs some way to know that C and D are done with the old
shared snapshot before it starts writing a new one. Furthermore, it's
hard to understand how this could be a net improvement in the general
case, because now both B and F are copying everything twice (once to
the shared area and one to backend-local memory) instead of just once
(to backend-local memory) and C and D are sleeping (yikes!). That
could maybe be a gain at very high concurrency when spinlock
contention is eating us alive, but it doesn't seem like a good idea in
general. Maybe I'm missing something; did you mean to attach a patch?

I had a different idea for avoiding repeated snapshot computation:
save the snapshot in backend-private memory. When a process takes
ProcArrayLock in exclusive mode, it increments a 64-byte counter.
When a process takes ProcArrayLock in shared mode, it can check
whether the counter has changed; if not, it can reuse the snapshot it
took before. I think there might be away to tinker with the snapshot
management so as to do this without any more copying than we do
presently, since CurrentSnapshotData and SecondarySnapshotData are
basically treated as scratch-pads anyhow.

Another idea would be to try to avoid taking ProcArrayLock at all.
For example, suppose we again have a 64-bit counter in shared memory.
A transaction wishing to do ProcArrayEndTransaction() takes
ProcArrayLock in exclusive mode, increments the counter, memory
barrier, clears its XID, memory barrier, increments the counter,
releases ProcArrayLock. A transaction wishing to take a snapshot
first reads the counter. If the value read from the counter is even,
then we continue like this: memory barrier, take snapshot, memory
barrier, recheck counter. If we find that the counter is unchanged,
then nobody did ProcArrayEndTransaction() while we were taking the
snapshot, and we're good to go. If the counter changed then we take
ProcArrayLock in shared mode and retake the snapshot. (We also take
ProcArrayLock in shared mode if the initial value we read was odd.)
This seems like it would all but eliminate contention on the spinlock
protecting ProcArrayLock, but I'm not sure how much overhead it would
add in other cases. In particular, if we have to retake the snapshot
more than very occasionally, it's not going to be good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 17:32:15
Message-ID: CABOikdNLnUyBJeEL+0i4KOaJqRx0_hfvTPPj2x_G1Cgx4d+keQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Sat, Nov 26, 2011 at 10:52 AM, Pavan Deolasee
> <pavan(dot)deolasee(at)gmail(dot)com> wrote:
>> What we can do is when a transaction comes to compute its snapshot, it
>> checks if some other transaction is already computing a snapshot for
>> itself. If so, it just sleeps on the lock. When the other process
>> finishes computing the snapshot, it saves the snapshot is a shared
>> area and wakes up all processes waiting for the snapshot. All those
>> processes then just copy the snapshot from the shared area and they
>> are done. This will not only reduce the total CPU consumption by
>> avoiding repetitive work, but would also reduce the total time for
>> which ProcArrayLock is held in SHARED mode by avoiding pipeline of
>> GetSnapshotData calls. I am currently trying the shared work queue
>> mechanism to implement this, but I am sure we can do it this in some
>> other way too.
>
> I don't quite understand how this is going to work.

I will try, keeping it simple.

> Transaction A
> ends, invaliding the shared snapshot.  Now transactions B, C, and D
> come along wishing to take snapshots.  B begins taking a snapshot, so
> C and D wait (blech!)

Yeah, C and D waits. But thats better than concurrently doing the same
computation.

> for that to be complete.  Then, they start
> copying the snapshot.

And they are holding the ProcArrayLock in shared mode.

> Meanwhile, transaction E ends, invalidating the
> shared snapshot,

E can't end until B, C and D are done with copying the snapshot.

> and then transaction F comes along, wanting to take a
> snapshot.  If F constructs a new shared snapshot, then doesn't that
> overwrite the same memory area C and D were busy copying?  It seems
> like F needs some way to know that C and D are done with the old
> shared snapshot before it starts writing a new one.

Right. And that can easily be achieved by holding shared lock on
ProcArray. The snapshot is invalidated by holding exclusive lock and
set/copied while holding shared lock. I am assuming setting a boolean
(valid/invalid) can safely be done with a shared lock. But I may be
wrong.

>  Furthermore, it's
> hard to understand how this could be a net improvement in the general
> case, because now both B and F are copying everything twice (once to
> the shared area and one to backend-local memory) instead of just once
> (to backend-local memory) and C and D are sleeping (yikes!).

Yes, B and F pay a price of double copy. But I think it can be a net
saving because C and D (and many more hopefully) don't need to
recompute the snapshot again by going over a potentially large
ProcArray. So as I demonstrated in the illustration, the total time
for which ProcArray lock is held will be significantly less with large
number of clients.

>  That
> could maybe be a gain at very high concurrency when spinlock
> contention is eating us alive, but it doesn't seem like a good idea in
> general.  Maybe I'm missing something; did you mean to attach a patch?
>

I have a patch that I am attaching. It contains some unrelated changes
(don't know why; may be I took a diff with some other branch), but you
will get the idea. This needs improvement though because it just
checks if a valid shared snapshot is available and copies that. If
not, it goes ahead and computes one for itself. We need something more
intelligent to know that a snapshot computation is in progress and we
should wait instead of building our own. This patch had shown 15-20%
improvement on the HP box that we are using for the benchmarks.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Attachment Content-Type Size
shared-snapshot-v2.patch application/octet-stream 12.7 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 18:56:27
Message-ID: 29164.1322333787@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Furthermore, it's
>> hard to understand how this could be a net improvement in the general
>> case, because now both B and F are copying everything twice (once to
>> the shared area and one to backend-local memory) instead of just once
>> (to backend-local memory) and C and D are sleeping (yikes!).

> Yes, B and F pay a price of double copy. But I think it can be a net
> saving because C and D (and many more hopefully) don't need to
> recompute the snapshot again by going over a potentially large
> ProcArray.

Like Robert, I'm finding it hard to believe that this isn't going to be
a net pessimization in as many or more cases as it'd be a win. Also,
didn't we just get rid of most of the issue of "going over a large
ProcArray" with the move of hot members to a separate array?

In the end, getting a snapshot is exactly about copying information out
of shared memory. Perhaps it would be more productive to think about
how we could further modify the ProcArray representation to reduce the
impedance mismatch between it and the snapshot representation, rather
than introducing a second shared-memory copy of the same information.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 20:42:12
Message-ID: 201111262142.13130.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Saturday, November 26, 2011 04:52:50 PM Pavan Deolasee wrote:
> I think now that we have reduced the run time of the function itself,
> we should now try to reduce the number of times the function is
> called. Robert proposed a way to reduce the number of calls per
> transaction. I think we can go one more step further and reduce the
> number for across the transactions.
You could also try if it makes a difference reducing SnapshotData to one
instead of two cachelines. The data itself fits into one without problems.
Trivial patch attached.

Generally I think we should check that for most of the more commonly used
strutures, we have many with too much padding.

Andres

Attachment Content-Type Size
snapshotdata-one-cacheline.patch text/x-patch 1.2 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 20:52:17
Message-ID: 4427.1322340737@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> You could also try if it makes a difference reducing SnapshotData to one
> instead of two cachelines. The data itself fits into one without problems.
> Trivial patch attached.

On what grounds do you argue that this patch gets SnapshotData into one
cacheline today (and on what hardware), or that it will continue to do
so in the future? If we're this close to the edge then any future
addition to the struct will destroy the point. I'd just as soon keep
the fields in a logical order.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 21:18:21
Message-ID: 201111262218.21421.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > You could also try if it makes a difference reducing SnapshotData to one
> > instead of two cachelines. The data itself fits into one without
> > problems. Trivial patch attached.
> On what grounds do you argue that this patch gets SnapshotData into one
> cacheline today (and on what hardware), or that it will continue to do
> so in the future? If we're this close to the edge then any future
> addition to the struct will destroy the point. I'd just as soon keep
> the fields in a logical order.
To my knowledge there is no current and supported (i.e. I don't count Itanium)
hardware that doesn't have 64bit cachelines except in the embedded market.
I also don't see a movement towards 128bit cpus or anything else requiring
bigger pointers.
If platforms with 128bit cachelines or such would gain popularity - nothing
would be lost by that change.
Which change are you "afraid" of?

Now the holes I talked about obviously were on a 64bit machine. But there are
a few situations where improving layout so it fits with 8byte alignment for
pointers makes the situation worse for 32bit. Looking a bit careful at the
changes one does should catch those problems.

Also its not *that* close to overflowing again. It was 72 bytes before, now
its 56 on a 64bit x86 machine with linux abi.

I agree that logical order is important. I don't propose changing all structs
in pg mechanically.+

Its sad that there is no sensible method to let the compiler safely reorder
struct members accross compiler (-versions) and platforms...

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 22:28:30
Message-ID: 201111262328.30225.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> I'd just as soon keep the fields in a logical order.
Btw, I don't think the new order is necessarily worse than the old one.

Andres


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 22:39:23
Message-ID: CA+Tgmoae8yoVAOVM7PGO5FhLMhN2LL3K3A0iz+4vWuZGWdFisA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
>> I'd just as soon keep the fields in a logical order.
> Btw, I don't think the new order is necessarily worse than the old one.

You forget to attach the benchmark results.

My impression is that cache lines on modern hardware are 64 or 128
*bytes*, in which case this wouldn't matter a bit.

But testing is even better than guessing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 22:49:19
Message-ID: 201111262349.20067.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> >> I'd just as soon keep the fields in a logical order.
> >
> > Btw, I don't think the new order is necessarily worse than the old one.
>
> You forget to attach the benchmark results.
>
> My impression is that cache lines on modern hardware are 64 or 128
> *bytes*, in which case this wouldn't matter a bit.
All current x86 cpus use 64bytes. The 2nd 128bit reference was a typo. Sorry
for that.
And why is 72=>56 *bytes* (I even got that one right) not relevant for 64byte
cachelines?

And yea. I didn't add benchmark results. I don't think I *have* to do that
when making suggestions to somebody trying to improve something specific. I
also currently don't have hardware where I can sensibly run at a high enough
concurrency to see that GetSnapshotData takes ~40% of runtime.
Additional cacheline references around synchronized access can hurt to my
knowledge...

Andres


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-26 23:51:06
Message-ID: 201111270051.06904.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> >> I'd just as soon keep the fields in a logical order.
> >
> > Btw, I don't think the new order is necessarily worse than the old one.
>
> You forget to attach the benchmark results.
>
> My impression is that cache lines on modern hardware are 64 or 128
> *bytes*, in which case this wouldn't matter a bit.
>
> But testing is even better than guessing.
Being prodded like that I ran a very quick benchmark on my workstation.
Unfortunately that means I cannot work during the time which is why I kept it
rather short...

That machine has 2 E5520(at)2(dot)27GHz which means 2(sockets) * 4(cores) *
2(threads) and 24GB of ram.

Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20 pgbench

pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60

origsnap: 92825.743958 93145.110901 93389.915474 93175.482351
reordersnap: 93560.183272 93613.333494 93495.263012 93523.368489

pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60

origsnap: 81846.743329 81545.175672 81702.755226 81576.576435
81228.154119 81546.047708 81421.436262
reordersnap: 81823.479196 81787.784508 81820.242145 81790.263415
81762.421592 81496.333144 81732.088876

At that point I noticed I had accidentally run with a nearly stock config...
An even shorter run with a more approrioate config yielded:

pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20

origsnap: 102234.664020 102003.449741 102119.509053 101722.410387
101973.651318 102056.440561
reordersnap: 103444.877879 103385.888808 103302.318923 103372.659486
103330.157612 103313.833821

Looks like a win to me. Even on this comparatively small machine.

Andres


From: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-28 19:55:28
Message-ID: CABwTF4UhinaNb+jPTVmcznew1d7HstH8Nqw9t0Ag86tCYji-QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> > On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres(at)anarazel(dot)de>
> wrote:
> > > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> > >> I'd just as soon keep the fields in a logical order.
> > >
> > > Btw, I don't think the new order is necessarily worse than the old one.
> >
> > You forget to attach the benchmark results.
> >
> > My impression is that cache lines on modern hardware are 64 or 128
> > *bytes*, in which case this wouldn't matter a bit.
> >
> > But testing is even better than guessing.
> Being prodded like that I ran a very quick benchmark on my workstation.
> Unfortunately that means I cannot work during the time which is why I kept
> it
> rather short...
>
> That machine has 2 E5520(at)2(dot)27GHz which means 2(sockets) * 4(cores) *
> 2(threads) and 24GB of ram.
>
> Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20
> pgbench
>
>
> pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60
>
> origsnap: 92825.743958 93145.110901 93389.915474 93175.482351
> reordersnap: 93560.183272 93613.333494 93495.263012 93523.368489
>
> pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 60
>
> origsnap: 81846.743329 81545.175672 81702.755226 81576.576435
> 81228.154119 81546.047708 81421.436262
> reordersnap: 81823.479196 81787.784508 81820.242145 81790.263415
> 81762.421592 81496.333144 81732.088876
>
> At that point I noticed I had accidentally run with a nearly stock
> config...
> An even shorter run with a more approrioate config yielded:
>
>
> pgbench -h /tmp/ pgbench -S -j 32 -c 32 -T 20
>
> origsnap: 102234.664020 102003.449741 102119.509053 101722.410387
> 101973.651318 102056.440561
> reordersnap: 103444.877879 103385.888808 103302.318923 103372.659486
> 103330.157612 103313.833821
>
>
>
> Looks like a win to me. Even on this comparatively small machine.
>

This may not be necessary, but can you please share the modified config you
used for the last run.

I tabulated your last results to make it more readable, and added columns
to show the improvement.

origsnap reordersnap diff %age improvement
------------------------------------------------------------------
102234.66402 103444.877879 1210.213859 1.1837607827
102003.449741 103385.888808 1382.439067 1.3552865815
102119.509053 103302.318923 1182.80987 1.1582604352
101722.410387 103372.659486 1650.249099 1.6223063263
101973.651318 103330.157612 1356.506294 1.3302517626
102056.440561 103313.833821 1257.39326 1.2320567454

That looks like a win to me too. We're getting a little over 1% improvement
for free!

Maybe submitting this patch to the commitfest might help get some serious
consideration from a reviewer.

Regards,
--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-29 01:41:32
Message-ID: 201111290241.32799.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi Gurjeet,

On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote:
> On Sat, Nov 26, 2011 at 6:51 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
> > > On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres(at)anarazel(dot)de>
> >
> > wrote:
> > > > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
> > > >> I'd just as soon keep the fields in a logical order.
> > > >
> > > > Btw, I don't think the new order is necessarily worse than the old
> > > > one.
> > >
> > > You forget to attach the benchmark results.
> > >
> > > My impression is that cache lines on modern hardware are 64 or 128
> > > *bytes*, in which case this wouldn't matter a bit.
> > >
> > > But testing is even better than guessing.
> >
> > Being prodded like that I ran a very quick benchmark on my workstation.
> > Unfortunately that means I cannot work during the time which is why I
> > kept it
> > rather short...
> >
> > That machine has 2 E5520(at)2(dot)27GHz which means 2(sockets) * 4(cores) *
> > 2(threads) and 24GB of ram.
> >
> > Data was initialized with: pgbench -h /tmp/ --unlogged-tables -i -s 20
> > pgbench
> >
> >
> > pgbench -h /tmp/ pgbench -S -j 16 -c 16 -T 60
> > ...
> > Looks like a win to me. Even on this comparatively small machine.
> This may not be necessary, but can you please share the modified config you
> used for the last run.
Can't right now because I don't have access from here, but I will do so. I
don't think its particularly interesting though.

> I tabulated your last results to make it more readable, and added columns
> to show the improvement.
>
> origsnap reordersnap diff %age improvement
> ------------------------------------------------------------------
> 102234.66402 103444.877879 1210.213859 1.1837607827
> 102003.449741 103385.888808 1382.439067 1.3552865815
> 102119.509053 103302.318923 1182.80987 1.1582604352
> 101722.410387 103372.659486 1650.249099 1.6223063263
> 101973.651318 103330.157612 1356.506294 1.3302517626
> 102056.440561 103313.833821 1257.39326 1.2320567454
>
> That looks like a win to me too. We're getting a little over 1% improvement
> for free!
At least if we can convince Tom that such a change would be ok ;)

I would like to see somebody running a benchmark on a machine with higher
concurrency...

> Maybe submitting this patch to the commitfest might help get some serious
> consideration from a reviewer.
It wasn't actually ment as an actual patch but a comment, but well ;). Will
add it.

Andres


From: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-29 05:12:21
Message-ID: CABOikdP99LD1BM_6EKJV=dUGEDAmb0J5hVSeWcT1TUFSebZz8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
>> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Furthermore, it's
>>> hard to understand how this could be a net improvement in the general
>>> case, because now both B and F are copying everything twice (once to
>>> the shared area and one to backend-local memory) instead of just once
>>> (to backend-local memory) and C and D are sleeping (yikes!).
>
>> Yes, B and F pay a price of double copy. But I think it can be a net
>> saving because C and D (and many more hopefully) don't need to
>> recompute the snapshot again by going over a potentially large
>> ProcArray.
>
> Like Robert, I'm finding it hard to believe that this isn't going to be
> a net pessimization in as many or more cases as it'd be a win.  Also,
> didn't we just get rid of most of the issue of "going over a large
> ProcArray" with the move of hot members to a separate array?
>

Yeah, separating the PGPROC members has helped a lot. But that does
not reduce the number of GetSnapshotData calls. It only makes each
call less expensive. As I said, I had seen 15-20% improvement with not
even a slightly tuned patch, so I am optimistic about the potential of
the approach.

> In the end, getting a snapshot is exactly about copying information out
> of shared memory.  Perhaps it would be more productive to think about
> how we could further modify the ProcArray representation to reduce the
> impedance mismatch between it and the snapshot representation, rather
> than introducing a second shared-memory copy of the same information.
>

I think that a good idea. We need a representation that needs minimum
processing to derive the snapshot. One idea is to maintain a bitmap of
currently running transactions in the shared memory. The size of the
bitmap has to be significantly larger than the ProcArray itself
because there could be gaps. But even then since its just a bit per
XID, we can have a bitmap to represent a window of say 16K or 32K
consecutive XIDs. We can also have a summary bitmap to quickly find if
there is at least one running transaction in any given chunk. If there
are long running transactions which don't fit in the bitmap, we can
track them separately in an array. Snapshot is then just a copy of
this bitmap along with some additional information.

Will try to think more about it and see if I can test this idea. But
any comments welcome.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-29 15:19:15
Message-ID: 201111291519.pATFJFM22502@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Pavan Deolasee wrote:
> On Sun, Nov 27, 2011 at 12:26 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> writes:
> >> On Sat, Nov 26, 2011 at 10:43 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >>> Furthermore, it's
> >>> hard to understand how this could be a net improvement in the general
> >>> case, because now both B and F are copying everything twice (once to
> >>> the shared area and one to backend-local memory) instead of just once
> >>> (to backend-local memory) and C and D are sleeping (yikes!).
> >
> >> Yes, B and F pay a price of double copy. But I think it can be a net
> >> saving because C and D (and many more hopefully) don't need to
> >> recompute the snapshot again by going over a potentially large
> >> ProcArray.
> >
> > Like Robert, I'm finding it hard to believe that this isn't going to be
> > a net pessimization in as many or more cases as it'd be a win. ?Also,
> > didn't we just get rid of most of the issue of "going over a large
> > ProcArray" with the move of hot members to a separate array?
> >
>
> Yeah, separating the PGPROC members has helped a lot. But that does
> not reduce the number of GetSnapshotData calls. It only makes each
> call less expensive. As I said, I had seen 15-20% improvement with not
> even a slightly tuned patch, so I am optimistic about the potential of
> the approach.

Agreed. I think there is great potential here. We have been stumped at
how to reduce the overhead of this for years, and it seems you are now
on a promising path.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org
Cc: Gurjeet Singh <singh(dot)gurjeet(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-29 22:00:50
Message-ID: 201111292300.51042.andres@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On Monday, November 28, 2011 08:55:28 PM Gurjeet Singh wrote:
> This may not be necessary, but can you please share the modified config you
> used for the last run.
I just copied the .conf I had for some independent development.

max_connections = 100
listen_addresses = ''
port=5432
shared_buffers = 2GB
temp_buffers = 64MB
work_mem = 96MB
maintenance_work_mem = 1GB
effective_cache_size = 20GB

log_line_prefix = '%d %a %c %v %x %m ' # special values:
log_statement = 'ddl'

#decreases variance
track_activities = off
track_counts = off
autovacuum = off

update_process_title = off
logging_collector = off
log_destination = stderr
log_min_messages = error
log_checkpoints = on
log_autovacuum_min_duration=0
synchronous_commit = off

checkpoint_segments = 30
checkpoint_timeout = 30min
checkpoint_completion_target = 0.8
log_error_verbosity = verbose

But about the only thing really important is the shared_buffers difference
(commenting it out nearly yielded the former speed)

Andres


From: Ants Aasma <ants(dot)aasma(at)eesti(dot)ee>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-11-30 00:03:23
Message-ID: CA+CSw_u1Pv37wn8ovfH0D3yq8dsXV1qxrfzJqPzE1H3G3hXKPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 29, 2011 at 7:12 AM, Pavan Deolasee
<pavan(dot)deolasee(at)gmail(dot)com> wrote:
> I think that a good idea. We need a representation that needs minimum
> processing to derive the snapshot.

I was looking over the generated code for GetSnapshotData to see if there
is any low hanging fruit for micro-optimization. The assembly mostly looks
pretty tight, but there are 3 function calls to TransactionIdPrecedes and
TransactionIdFollowsOrEquals. All the parameters are known to be normal
xids, so there are duplicated checks for that and a lot of movs for the calling
convention. I wonder if replacing them with special case macros would be
a good idea. In that case the whole check will compile down to one cmp
instruction. I'm running a set of benchmarks now on my laptop, but I guess
the difference will mostly become noticeable on beefier hardware when
ProcArray lock is heavily contended. Attached is a patch, if anyone wishes
to give it a go.

--
Ants Aasma

Attachment Content-Type Size
opt-snapshot-txcmp.patch text/x-patch 1.6 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Subject: Re: Avoiding repeated snapshot computation
Date: 2011-12-13 17:52:30
Message-ID: CA+TgmoaKu-h=1oHFqwUw+9Qg4zkyvS8bWcW6Kiw2qxFp3JvC1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sat, Nov 26, 2011 at 5:49 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Saturday, November 26, 2011 11:39:23 PM Robert Haas wrote:
>> On Sat, Nov 26, 2011 at 5:28 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> > On Saturday, November 26, 2011 09:52:17 PM Tom Lane wrote:
>> >> I'd just as soon keep the fields in a logical order.
>> >
>> > Btw, I don't think the new order is necessarily worse than the old one.
>>
>> You forget to attach the benchmark results.
>>
>> My impression is that cache lines on modern hardware are 64 or 128
>> *bytes*, in which case this wouldn't matter a bit.
> All current x86 cpus use 64bytes. The 2nd 128bit reference was a typo. Sorry
> for that.
> And why is 72=>56 *bytes* (I even got that one right) not relevant for 64byte
> cachelines?

OK, so I got around to looking at this again today; sorry for the
delay. I agree that 72 -> 56 bytes is very relevant for 64-byte cache
lines. I had not realized the structure was as big as that.

> And yea. I didn't add benchmark results. I don't think I *have* to do that
> when making suggestions to somebody trying to improve something specific. I
> also currently don't have hardware where I can sensibly run at a high enough
> concurrency to see that GetSnapshotData takes ~40% of runtime.
> Additional cacheline references around synchronized access can hurt to my
> knowledge...

I tested this on Nate Boley's 32-core box today with the 32 clients
doing a select-only pgbench test. Results of individual 5 minute
runs:

results.master.32:tps = 171701.947919 (including connections establishing)
results.master.32:tps = 227777.430112 (including connections establishing)
results.master.32:tps = 218257.478461 (including connections establishing)
results.master.32:tps = 226425.964855 (including connections establishing)
results.master.32:tps = 218687.662373 (including connections establishing)
results.master.32:tps = 219819.451605 (including connections establishing)
results.master.32:tps = 216800.131634 (including connections establishing)
results.snapshotdata-one-cacheline.32:tps = 181997.531357 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 171505.145440 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 226970.348605 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 169725.115763 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 219548.690522 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 175440.705722 (including
connections establishing)
results.snapshotdata-one-cacheline.32:tps = 217154.173823 (including
connections establishing)

There's no statistically significant difference between those two data
sets; if anything, the results with the patch look like they might be
worse, although I believe that's an artifact - some runs seem to
mysteriously come out in the 170-180 rangeinstead of the 215-225
range, with nothing in between. But even if you only average the good
runs out of each set there's no material difference.

Having said that, I am coming around to the view that we should apply
this patch anyway, just because it reduces memory consumption. Since
the size change crosses a power-of-two boundary, I believe it will
actually cut down the size of a palloc'd chunk for a SnapshotData
object from 128 bytes to 64 bytes. I doubt we can measure the benefit
of that even on a microbenchmark unless someone has a better idea for
making PostgreSQL take ridiculous numbers of snapshots than I do. It
still seems like a good idea, though: a penny saved is a penny earned.

With response to Tom's objections downthread, I don't think that the
new ordering is significantly more confusing than the old one.
xcnt/subxcnt/xip/subxip doesn't seem measurably less clear than
xcnt/xip/subxcnt/subxip, and we're not normally in the habit of
letting such concerns get in the way of squeezing out alignment
padding.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2012-08-17 01:02:10
Message-ID: 20120817010210.GK30286@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Did we ever make a decision on this patch?

---------------------------------------------------------------------------

On Sat, Nov 26, 2011 at 09:22:50PM +0530, Pavan Deolasee wrote:
> On some recent benchmarks and profile data, I saw GetSnapshotData
> figures at the very top or near top. For lesser number of clients, it
> can account for 10-20% of time, but more number of clients I have seen
> it taking up as much as 40% of sample time. Unfortunately, the machine
> of which I was running these tests is currently not available and so I
> don't have the exact numbers. But the observation is almost correct.
> Our recent work on separating the hot members of PGPROC in a separate
> array would definitely reduce data cache misses ans reduce the
> GetSnapshotData time, but it probably still accounts for a large
> enough critical section for a highly contended lock.
>
> I think now that we have reduced the run time of the function itself,
> we should now try to reduce the number of times the function is
> called. Robert proposed a way to reduce the number of calls per
> transaction. I think we can go one more step further and reduce the
> number for across the transactions.
>
> One major problem today could be because the way LWLock works. If the
> lock is currently held in SHARED mode by some backend and some other
> backend now requests it in SHARED mode, it will immediately get it.
> Thats probably the right thing to do because you don't want the reader
> to really wait when the lock is readily available. But in the case of
> GetSnapshotData(), every reader is doing exactly the same thing; they
> are computing a snapshot based on the same shared state and would
> compute exactly the same snapshot (if we ignore the fact that we don't
> include caller's XID in xip array, but thats a minor detail). And
> because the way LWLock works, more and more readers would get in to
> compute the snapshot, until the exclusive waiters get a window to
> sneak in, either because more and more processes slowly start sleeping
> for exclusive access. To depict it, the four transactions make
> overlapping calls for GetSnapshotData() and hence the total critical
> section starts when the first caller enters it and the ends when the
> last caller exits.
>
> Txn1 ------[ SHARED ]---------------------
> Txn2 --------[ SHARED ]-------------------
> Txn3 -----------------[ SHARED ]-------------
> Txn4 -------------------------------------------[ SHARED
> ]---------
> |<---------------Total Time ------------------------------------>|
>
> Couple of ideas come to mind to solve this issue.
>
> A snapshot once computed will remain valid for every call irrespective
> of its origin until at least one transaction ends. So we can store the
> last computed snapshot in some shared area and reuse it for all
> subsequent GetSnapshotData calls. The shared snapshot will get
> invalidated when some transaction ends by calling
> ProcArrayEndTransaction(). I tried this approach and saw a 15%
> improvement for 32-80 clients on the 32 core HP IA box with pgbench -s
> 100 -N tests. Not bad, but I think this can be improved further.
>
> What we can do is when a transaction comes to compute its snapshot, it
> checks if some other transaction is already computing a snapshot for
> itself. If so, it just sleeps on the lock. When the other process
> finishes computing the snapshot, it saves the snapshot is a shared
> area and wakes up all processes waiting for the snapshot. All those
> processes then just copy the snapshot from the shared area and they
> are done. This will not only reduce the total CPU consumption by
> avoiding repetitive work, but would also reduce the total time for
> which ProcArrayLock is held in SHARED mode by avoiding pipeline of
> GetSnapshotData calls. I am currently trying the shared work queue
> mechanism to implement this, but I am sure we can do it this in some
> other way too.
>
> Thanks,
> Pavan
>
> --
> Pavan Deolasee
> EnterpriseDB     http://www.enterprisedb.com
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Avoiding repeated snapshot computation
Date: 2012-08-20 15:39:05
Message-ID: CA+TgmoZ7nVYbWB_rDsMUMAJamLQ9fPWvHfXVMMnA1qAxTeKY-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Aug 16, 2012 at 9:02 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Did we ever make a decision on this patch?

I committed it as 1fc3d18faa8f4476944bc6854be0f7f6adf4aec8.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company