Re: MVCC catalog access

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: MVCC catalog access
Date: 2013-06-20 14:35:14
Message-ID: 20130620143514.GA16659@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-06-20 09:45:26 -0400, Robert Haas wrote:
> > With that setup one can create quite a noticeable overhead for the mvcc
> > patch (best of 5):
> >
> > master-optimize:
> > tps = 1261.629474 (including connections establishing)
> > tps = 15121.648834 (excluding connections establishing)
> >
> > dev-optimize:
> > tps = 773.719637 (including connections establishing)
> > tps = 2804.239979 (excluding connections establishing)
> >
> > Most of the time in both, patched and unpatched is by far spent in
> > GetSnapshotData. I think the reason this shows a far higher overhead
> > than what you previously measured is that a) in your test the other
> > backends were idle, in mine they actually modify PGXACT which causes
> > noticeable cacheline bouncing b) I have higher numer of connections &
> > #max_connections
> >
> > A quick test shows that even with max_connection=600, 400 background,
> > and 100 foreground pgbenches there's noticeable overhead:
> > master-optimize:
> > tps = 2221.226711 (including connections establishing)
> > tps = 31203.259472 (excluding connections establishing)
> > dev-optimize:
> > tps = 1629.734352 (including connections establishing)
> > tps = 4754.449726 (excluding connections establishing)
> >
> > Now I grant that's a somewhat harsh test for postgres, but I don't
> > think it's entirely unreasonable and the performance impact is quite
> > stark.
>
> It's not entirely unreasonable, but it *is* mostly unreasonable.

Well, sure. So are the tests that you ran. But that's *completely*
fine. In contrast to evaluating whether a performance improvement is
worth its complexity we're not trying to measure real world
improvements. We're trying to test the worst cases we can think of, even
if they aren't really interesting by stressing potential pain points. If
we can't find a relevant regression for those using something akin to
microbenchmarks it's less likely that there are performance regressions.

The "not entirely unreasonable" point is just about making sure you're
not testing something entirely irrelevant. Say, performance of a 1TB
database when shared_buffers is set to 64k. Or testing DDL performance
while locking pg_class exclusively.

The test was specifically chosen to:
* do uncached syscache lookups (-C) to mesure the impact of the added
GetSnapshotData() calls
* make individual GetSnapshotData() calls slower. (all processes have an
xid)
* contend on ProcArrayLock but not much else (high number of clients in
the background)

> I
> mean, nobody is going to run 1000 connections in the background that
> do nothing but thrash PGXACT on a real system. I just can't get
> concerned about that.

In the original mail I did retry it with 400 and the regression is still
pretty big. And the "background" things could also be doing something
that's not that likely to be blocked by global locks. Say, operate on
temporary or unlogged tables. Or just acquire a single row level lock
and then continue to do readonly work in a read committed transaction.

I think we both can come up with scenarios where at least part of the
above scenario is present. But imo that doesn't really matter.

> What I am concerned about is that there may be
> other, more realistic workloads that show similar regressions. But I
> don't know how to find out whether that's actually the case.

So, given the results from that test and the profile I got where
GetSnapshotData was by far the most expensive thing a more
representative test might be something like a readonly pgbench with a
moderately high number of short lived connections. I wouldn't be
surprised if that still showed performance problems.

If that's not enough something like:
BEGIN;
SELECT * FROM my_client WHERE client_id = :id FOR UPDATE;
SELECT * FROM key_table WHERE key = :random
...
SELECT * FROM key_table WHERE key = :random
COMMIT;

will sure still show the problem.

> On the
> IBM POWER box where I tested this, it's not even GetSnapshotData()
> that kills you; it's the system CPU scheduler.

I haven't tried yet, but I'd guess the above setup shows the difference
with less than 400 clients. Might make it more reasonable to run there.

> But I'm still on the fence about whether this is really a valid test.

I think it shows that we need to be careful and do further performance
evaluations and/or alleviate the pain by making things cheaper (say, a
"ddl counter" in shared mem, allowing to cache snapshots for the
syscache). If that artificial test hadn't shown problems I'd have voted
for just going ahead and not worry further.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2013-06-20 14:41:01 Re: dynamic background workers
Previous Message Robert Haas 2013-06-20 14:03:30 Re: Request for Patch Feedback: Lag & Lead Window Functions Can Ignore Nulls