Re: Wait free LW_SHARED acquisition - v0.2

From: Peter Geoghegan <pg(at)heroku(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Wait free LW_SHARED acquisition - v0.2
Date: 2014-02-02 03:47:29
Message-ID: CAM3SWZQeuJVhUA49cSm6mzf7OHxqa2p5ZaWPpdTQ6R+sPW2hGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Feb 1, 2014 at 1:41 PM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>> However, I tested the
>> most recent revision from your git remote on the AWS instance.

> But that was before my fix, right. Except you managed to timetravel :)

Heh, okay. So Nathan Boley has generously made available a machine
with 4 AMD Opteron 6272s. I've performed the same benchmark on that
server. However, I thought it might be interesting to circle back and
get some additional numbers for the AWS instance already tested - I'd
like to see what it looks like after your recent tweaks to fix the
regression. The single client performance of that instance seems to be
markedly better than that of Nathan's server.

Tip: AWS command line tools + S3 are a great way to easily publish
bulky pgbench-tools results, once you figure out how to correctly set
your S3 bucket's security manifest to allow public http. It has
similar advantages to rsync, and just works with the minimum of fuss.

Anyway, I don't think that the new, third c3.8xlarge-rwlocks testset
tells us much of anything:
http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/c38xlarge-rwlocks/

Here are the results of a benchmark on Nathan Boley's 64-core, 4
socket server: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/

Perhaps I should have gone past 64 clients, because in the document
"Lock Scaling Analysis on Intel Xeon Processors" [1], Intel write:

"This implies that with oversubscription (more threads running than
available logical CPUs), the performance of spinlocks can depend
heavily on the exact OS scheduler behavior, and may change drastically
with operating system or VM updates."

I haven't bothered with a higher client counts though, because Andres
noted it's the same with 90 clients on this AMD system. Andres: Do you
see big problems when # clients < # logical cores on the affected
Intel systems?

There is only a marginal improvement in performance on this big 4
socket system. Andres informs me privately that he has reproduced the
problem on multiple new 4-socket Intel servers, so it seems reasonable
to suppose that more or less an Intel thing.

The Intel document [1] further notes:

"As the number of threads polling the status of a lock address
increases, the time it takes to process those polling requests will
increase. Initially, the latency to transfer data across socket
boundaries will always be an order of magnitude longer than the
on-chip cache-to-cache transfer latencies. Such cross-socket
transfers, if they are not effectively minimized by software, will
negatively impact the performance of any lock algorithm that depends
on them."

So, I think it's fair to say, given what we now know from Andres'
numbers and the numbers I got from Nathan's server, that this patch is
closer to being something that addresses a particularly unfortunate
pathology on many-socket Intel system than it is to being a general
performance optimization. Based on the above quoted passage, it isn't
unreasonable to suppose other vendors or architectures could be
affected, but that isn't in evidence. While I welcome the use of
atomic operations in the context of LW_SHARED acquisition as general
progress, I think that to the outside world my proposed messaging is
more useful. It's not quite a bug-fix, but if you're using a
many-socket Intel server, you're *definitely* going to want to use a
PostgreSQL version that is unaffected. You may well not want to take
on the burden of waiting for 9.4, or waiting for it to fully
stabilize.

I note that Andres has a feature branch of this backported to Postgres
9.2, no doubt because of a request from a 2ndQuadrant customer. I have
to wonder if we should think about making this available with a
configure switch in one or more back branches. I think that the
complete back-porting of the fsync request queue issue's fix in commit
758728 could be considered a precedent - that too was a fix for a
really bad worst-case that was encountered fairly infrequently in the
wild. It's sort of horrifying to have red-hot spinlocks in production,
so that seems like the kind of thing we should make an effort to
address for those running multi-socket systems. Of those running
Postgres on new multi-socket systems, the reality is that the majority
are running on Intel hardware. Unfortunately, everyone knows that
Intel will soon be the only game in town when it comes to high-end
x86_64 servers, which contributes to my feeling that we need to target
back branches. We should do something about the possible regression
with older compilers using the fallback first, though.

It would be useful to know more about the nature of the problem that
made such an appreciable difference in Andres' original post. Again,
through private e-mail, I saw perf profiles from affected servers and
an unaffected though roughly comparable server (i.e. Nathan's 64 core
AMD server). Andres observed that "stalled-cycles-frontend" and
"stalled-cycles-backend" Linux perf events were at huge variance
depending on whether these Intel systems were patched or unpatched.
They were about the same on the AMD system to begin with.

[1] http://www.intel.com/content/dam/www/public/us/en/documents/white-papers/xeon-lock-scaling-analysis-paper.pdf
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Dave Page 2014-02-02 06:58:15 Re: narwhal and PGDLLIMPORT
Previous Message Andres Freund 2014-02-02 02:12:35 Re: narwhal and PGDLLIMPORT