Quick Links

Re: Latches with weak memory ordering (Re: max_wal_senders must die)

Lists:	pgsql-hackers

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	max_wal_senders must die
Date:	2010-10-19 00:20:01
Message-ID:	4CBCE431.4050608@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hackers,

What purpose is served, exactly, by max_wal_senders?

In order for a standby to connect, it must have a superuser login, and
replication connections must be enabled in pg_hba.conf. How is having
one more setting in one more file you have to enable on the master
benefitting anyone?

Under what bizarre set of circumstances would anyone have runaway
connections from replicas to the master?

Proposed that we simply remove this setting in 9.1. The real maximum
wal senders should be whatever max_connections is.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 11:14:30
Message-ID:	4CBD7D96.10006@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
> Hackers,
>
> What purpose is served, exactly, by max_wal_senders?
>
> In order for a standby to connect, it must have a superuser login, and
> replication connections must be enabled in pg_hba.conf. How is having
> one more setting in one more file you have to enable on the master
> benefitting anyone?
>
> Under what bizarre set of circumstances would anyone have runaway
> connections from replicas to the master?
>
> Proposed that we simply remove this setting in 9.1. The real maximum
> wal senders should be whatever max_connections is.

I disagree - limiting the maximum number of replication connections is
important for my usecases.
Replication connections are significantly more heavilyweight than a
normal connection and right now I for example simply use this setting to
prevent stupid mistakes (especially in virtualized^cloudstyle environments).

What we really should look into is using a less privileged role - or
dedicated replication role - and use the existing per role connection
limit feature. That feature is unlimited by default, people can change
it like for every role and we can git rid of that guc.

Stefan

From:	Magnus Hagander <magnus(at)hagander(dot)net>
To:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 11:18:07
Message-ID:	AANLkTi=hGC4UpNetGxndYfsDsaMcvefrAC+y9CXUOJ=n@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 19, 2010 at 13:14, Stefan Kaltenbrunner
<stefan(at)kaltenbrunner(dot)cc> wrote:
> Josh Berkus wrote:
>>
>> Hackers,
>>
>> What purpose is served, exactly, by max_wal_senders?
>>
>> In order for a standby to connect, it must have a superuser login, and
>> replication connections must be enabled in pg_hba.conf. How is having one
>> more setting in one more file you have to enable on the master benefitting
>> anyone?
>>
>> Under what bizarre set of circumstances would anyone have runaway
>> connections from replicas to the master?
>>
>> Proposed that we simply remove this setting in 9.1. The real maximum wal
>> senders should be whatever max_connections is.
>
> I disagree - limiting the maximum number of replication connections is
> important for my usecases.
> Replication connections are significantly more heavilyweight than a normal
> connection and right now I for example simply use this setting to prevent
> stupid mistakes (especially in virtualized^cloudstyle environments).
>
> What we really should look into is using a less privileged role - or
> dedicated replication role - and use the existing per role connection limit
> feature. That feature is unlimited by default, people can change it like
> for every role and we can git rid of that guc.

+1 for being able to control it that wya - that should keep it simple
for the newbie usecase, while retaining the ability for fine-grained
control for those who need it.

I think it's already on the TODO for 9.1 to use a separate role for it...

If we want something fixed *now*, should we perhaps just bump the
*default* value for max_wal_senders to 5 or something?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Magnus Hagander <magnus(at)hagander(dot)net>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 11:23:38
Message-ID:	4CBD7FBA.7020401@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Magnus Hagander wrote:
> On Tue, Oct 19, 2010 at 13:14, Stefan Kaltenbrunner
> <stefan(at)kaltenbrunner(dot)cc> wrote:
>> Josh Berkus wrote:
>>> Hackers,
>>>
>>> What purpose is served, exactly, by max_wal_senders?
>>>
>>> In order for a standby to connect, it must have a superuser login, and
>>> replication connections must be enabled in pg_hba.conf. How is having one
>>> more setting in one more file you have to enable on the master benefitting
>>> anyone?
>>>
>>> Under what bizarre set of circumstances would anyone have runaway
>>> connections from replicas to the master?
>>>
>>> Proposed that we simply remove this setting in 9.1. The real maximum wal
>>> senders should be whatever max_connections is.
>> I disagree - limiting the maximum number of replication connections is
>> important for my usecases.
>> Replication connections are significantly more heavilyweight than a normal
>> connection and right now I for example simply use this setting to prevent
>> stupid mistakes (especially in virtualized^cloudstyle environments).
>>
>> What we really should look into is using a less privileged role - or
>> dedicated replication role - and use the existing per role connection limit
>> feature. That feature is unlimited by default, people can change it like
>> for every role and we can git rid of that guc.
>
> +1 for being able to control it that wya - that should keep it simple
> for the newbie usecase, while retaining the ability for fine-grained
> control for those who need it.
>
> I think it's already on the TODO for 9.1 to use a separate role for it...

I Think we had some plans to do that - I wonder how hard it would be to
just do the dedicated role thing for now (maybe with the only constraint
that it can only be used on a replication connection) and looking into
making it (technically) less privileged later?

>
> If we want something fixed *now*, should we perhaps just bump the
> *default* value for max_wal_senders to 5 or something?

or accept -1 for "unlimited" and use by default, that would fix part of
the complaint from josh but you would still have to restart the master
to implement a limit...

Stefan

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 15:59:35
Message-ID:	4CBDC067.2020408@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Stefan, Dimitri,

> I disagree - limiting the maximum number of replication connections is
> important for my usecases.

Can you explain more? I clearly don't understand your use case.

> If we want something fixed *now*, should we perhaps just bump the
> *default* value for max_wal_senders to 5 or something?

If we're not going to remove it, then the default should be -1, which
should mean "same as max_connections".

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 16:06:37
Message-ID:	4CBDC20D.70201@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
> Under what bizarre set of circumstances would anyone have runaway
> connections from replicas to the master?

Cloud computing deployments where additional replicas are brought up
automatically in response to demand. It's easy to imagine a situation
where a standby instance is spawned, starts to sync, and that additional
load triggers *another* standby to come on board; repeat until the
master is doing nothing but servicing standby sync requests.
max_wal_senders provides a safety value for that.

I think Magnus's idea to bump the default to 5 triages the worst of the
annoyance here, without dropping the feature (which has uses) or waiting
for new development to complete. I'd be in favor of just committing
that change right now, before it gets forgotten about, and then if
nobody else gets around to further work at least something improved here
for 9.1.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 16:18:41
Message-ID:	4CBDC4E1.8000108@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/19/2010 09:06 AM, Greg Smith wrote:
> I think Magnus's idea to bump the default to 5 triages the worst of the
> annoyance here, without dropping the feature (which has uses) or waiting
> for new development to complete. I'd be in favor of just committing
> that change right now, before it gets forgotten about, and then if
> nobody else gets around to further work at least something improved here
> for 9.1.

Heck, even *I* could write that patch, if we're agreed. Although you
can commit it.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 19:00:24
Message-ID:	AANLkTi=jm0=evkve0D+UXpMfjSnW4Tx-23gFbGPRp3YW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 19, 2010 at 12:18 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 10/19/2010 09:06 AM, Greg Smith wrote:
>>
>> I think Magnus's idea to bump the default to 5 triages the worst of the
>> annoyance here, without dropping the feature (which has uses) or waiting
>> for new development to complete. I'd be in favor of just committing
>> that change right now, before it gets forgotten about, and then if
>> nobody else gets around to further work at least something improved here
>> for 9.1.
>
> Heck, even *I* could write that patch, if we're agreed. Although you can
> commit it.

Setting max_wal_senders to a non-zero value causes additional work to
be done every time a transaction commits, aborts, or is prepared.
It's possible that this overhead is too trivial to be worth worrying
about; I haven't looked at it closely.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 19:32:57
Message-ID:	7445.1287516777@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Tue, Oct 19, 2010 at 12:18 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> On 10/19/2010 09:06 AM, Greg Smith wrote:
>>> I think Magnus's idea to bump the default to 5 triages the worst of the
>>> annoyance here, without dropping the feature (which has uses) or waiting
>>> for new development to complete.

> Setting max_wal_senders to a non-zero value causes additional work to
> be done every time a transaction commits, aborts, or is prepared.

Yes. This isn't just a numeric parameter; it's also a boolean
indicating "do I want to pay the overhead to be prepared to be a
replication master?". Josh has completely failed to make a case that
that should be the default. In fact, the system would fail to start
at all if we just changed the default for max_wal_senders and not the
default for wal_level.

regards, tom lane

From:	Rob Wultsch <wultsch(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 20:25:39
Message-ID:	AANLkTinXUXoAGaf_nmaWoNw69UiuDVqO3AKh4qpkSe1n@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 19, 2010 at 12:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Oct 19, 2010 at 12:18 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 10/19/2010 09:06 AM, Greg Smith wrote:
>>>> I think Magnus's idea to bump the default to 5 triages the worst of the
>>>> annoyance here, without dropping the feature (which has uses) or waiting
>>>> for new development to complete.
>
>> Setting max_wal_senders to a non-zero value causes additional work to
>> be done every time a transaction commits, aborts, or is prepared.
>
> Yes. This isn't just a numeric parameter; it's also a boolean
> indicating "do I want to pay the overhead to be prepared to be a
> replication master?". Josh has completely failed to make a case that
> that should be the default. In fact, the system would fail to start
> at all if we just changed the default for max_wal_senders and not the
> default for wal_level.
>
> regards, tom lane

If the variable is altered such that it is dynamic, could it not be
updated by the postmaster when a connection attempts to begin
replicating?

--
Rob Wultsch
wultsch(at)gmail(dot)com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 20:35:02
Message-ID:	AANLkTikA6OZSz9mW14=X4S8_uNQ0YnAKrBDWsVTjctmk@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Oct 19, 2010 at 3:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Tue, Oct 19, 2010 at 12:18 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>> On 10/19/2010 09:06 AM, Greg Smith wrote:
>>>> I think Magnus's idea to bump the default to 5 triages the worst of the
>>>> annoyance here, without dropping the feature (which has uses) or waiting
>>>> for new development to complete.
>
>> Setting max_wal_senders to a non-zero value causes additional work to
>> be done every time a transaction commits, aborts, or is prepared.
>
> Yes. This isn't just a numeric parameter; it's also a boolean
> indicating "do I want to pay the overhead to be prepared to be a
> replication master?". Josh has completely failed to make a case that
> that should be the default. In fact, the system would fail to start
> at all if we just changed the default for max_wal_senders and not the
> default for wal_level.

On a slightly tangential note, it would be really nice to be able to
change things like wal_level and max_wal_senders on the fly. ISTM
that needing to change the setting is actually the smaller portion of
the gripe; the bigger annoyance is that you bring the system down,
change one setting, bring it back up, take a hot backup, and then
realize that you have to shut the system down again to change the
other setting, because you forgot to do it the first time. Since the
error message you get on the slave side is pretty explicit, the sheer
fact of needing to change max_wal_senders is only a minor
inconvenience; but the possible need to take down the system a second
time is a major one.

One of the problems with making these and similar settings changeable
on the fly is that, to be safe, we can declare that the value is
increased until we can verify that every still-running backend has
picked up the increased value and begun acting upon it. We have no
architecture for knowing when that has happened; the postmaster just
signals its children and forgets about it. If you had a shared memory
array (similar to the ProcArray, but perhaps with a separate set of
locks) in which backends wrote their last-known values of GUCs that
fall into this category, it would be possible to write a function that
returns the answer to the question "What is the least value of
wal_level that exists in any backend in the system?". Maybe that's
not the right architecture; I'm not sure. But we should see if we can
figure out something that will work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-19 20:54:56
Message-ID:	4CBE05A0.1090406@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> Yes. This isn't just a numeric parameter; it's also a boolean
>> indicating "do I want to pay the overhead to be prepared to be a
>> replication master?".

Since this is the first time I've heard of the overhead, it would be
hard for me to have taken that into consideration. If there was
discussion about this ever, I missed it. That explains why we changed
the default in RC1 though, which was a surprise to me.

What is the overhead exactly? What specific work do we do? Link?

>> Josh has completely failed to make a case that
>> that should be the default. In fact, the system would fail to start
>> at all if we just changed the default for max_wal_senders and not the
>> default for wal_level.

Well, now that you mention it, I also think that "hot standby" should be
the default. Yes, I know about the overhead, but I also think that the
number of our users who want easy replication *far* outnumber the users
who care about an extra 10% WAL overhead.

> On a slightly tangential note, it would be really nice to be able to
> change things like wal_level and max_wal_senders on the fly.

This would certainly help solve the problem. Heck, if we could even
just change them with a reload (rather than a restart) it would be a
vast improvement.

> ISTM
> that needing to change the setting is actually the smaller portion of
> the gripe; the bigger annoyance is that you bring the system down,
> change one setting, bring it back up, take a hot backup, and then
> realize that you have to shut the system down again to change the
> other setting, because you forgot to do it the first time. Since the
> error message you get on the slave side is pretty explicit, the sheer
> fact of needing to change max_wal_senders is only a minor
> inconvenience; but the possible need to take down the system a second
> time is a major one.

You've summed up the problem nicely. I'll note that even though I've
already set up 3 production systems using SR, I still mess this up about
1/3 the time and have to restart and reclone. There's just too many
settings.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 05:06:31
Message-ID:	4CBE78D7.8070909@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
> Well, now that you mention it, I also think that "hot standby" should be
> the default. Yes, I know about the overhead, but I also think that the
> number of our users who want easy replication *far* outnumber the users
> who care about an extra 10% WAL overhead.
>

I think this whole situation is similar to the resistance to increasing
default_statistics_target. There's additional overhead added by
enabling both of these settings, in return for making it more likely for
the average person to see useful behavior by default. If things are
rejiggered so the advanced user has to turn things off in order to get
optimal performance when not using these features, in return for the
newbie being more likely to get things working in basic form, that's
probably a net win for PostgreSQL advocacy. But much like
default_statistics_target, there needs to be some more formal work done
on quantifying just how bad each of these overheads really are first.
We lost 3-7% on multiple simple benchmarks between 8.3 and 8.4[1]
because of that retuning for ease of use on real-world workloads, and
that's not something you want to repeat too often.

[1] http://suckit.blog.hu/2009/09/29/postgresql_history and the tests
Jignesh did while at Sun

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 13:29:04
Message-ID:	AANLkTim-Hna6qiUvZYK=876cNwDxptZ8e7FJ2ns4oQXq@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 1:06 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Josh Berkus wrote:
>>
>> Well, now that you mention it, I also think that "hot standby" should be
>> the default. Yes, I know about the overhead, but I also think that the
>> number of our users who want easy replication *far* outnumber the users
>> who care about an extra 10% WAL overhead.
>
> I think this whole situation is similar to the resistance to increasing
> default_statistics_target. There's additional overhead added by enabling
> both of these settings, in return for making it more likely for the average
> person to see useful behavior by default. If things are rejiggered so the
> advanced user has to turn things off in order to get optimal performance
> when not using these features, in return for the newbie being more likely to
> get things working in basic form, that's probably a net win for PostgreSQL
> advocacy. But much like default_statistics_target, there needs to be some
> more formal work done on quantifying just how bad each of these overheads
> really are first. We lost 3-7% on multiple simple benchmarks between 8.3
> and 8.4[1] because of that retuning for ease of use on real-world workloads,
> and that's not something you want to repeat too often.

Exactly. It doesn't take many 3-7% slowdowns to add up to being 50%
or 100% slower, and that sucks. In fact, I'm still not convinced that
we were wise to boost default_statistics_target as much as we did. I
argued for a smaller boost at the time.

Actually, I think the best thing for default_statistics_target might
be to scale the target based on the number of rows in the table, e.g.
given N rows:

10 + (N / 1000), if N < 40,000
46 + (N / 10000), if 50,000 < N < 3,540,000
400, if N > 3,540,000

Consider a table with 2,000 rows. With default_statistics_target =
100, we can store up to 100 MCVs; and we break the remaining ~1900
values up into 100 buckets with 19 values/bucket. In most cases, that
is probably overkill. Where you tend to run into problems with
inadequate statistics is with the values that are not quite common
enough to be an MCV, but are still significantly more common than
their companions in the same bucket. However, with only 19 values in
a bucket, you're probably not going to have that problem. If you
scale the table down to 1000 rows you now have 9 values in a bucket,
which makes it *really* unlikely you're going to have that problem.
On the other hand, on a table with 4 million rows, it is entirely
likely that there could be more than 100 values whose frequencies are
worth tracking individually, and odds are good also that even if the
planning time is a little longer to no purpose, it'll still be small
relatively to the query execution time. It's unfortunately
impractical for the size of the MCV list to track linearly with the
size of the table, because there are O(n^2) algorithms in use, but I
think some kind of graduated scheme might enable us to buy back some
of that lost performance without damaging real workloads very much.
Possibly even helping real workloads, because you may very well join
large fact tables against small dimension tables, and odds are good
that under the present scheme the fact tables have more statistics
than they really need.

As to replication, I don't believe the contention that most people
will want to use replication. Many will, and that is fine, but many
also won't. The world is full of development and test machines where
replication is a non-issue, and some people won't even run it in
production because the nature of their application makes the data on
that box non-essential, or because they replicate with Bucardo or
Slony. I completely agree that we should make it easier to get
replication set up without (multiple) server restarts, but imposing a
performance overhead on untuned systems is not the right way to do it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Smith <greg(at)2ndquadrant(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 14:19:02
Message-ID:	25878.1287584342@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Smith <greg(at)2ndquadrant(dot)com> writes:
> Josh Berkus wrote:
>> Well, now that you mention it, I also think that "hot standby" should be
>> the default. Yes, I know about the overhead, but I also think that the
>> number of our users who want easy replication *far* outnumber the users
>> who care about an extra 10% WAL overhead.

> ... But much like
> default_statistics_target, there needs to be some more formal work done
> on quantifying just how bad each of these overheads really are first.

Quite. Josh, have you got any evidence showing that the penalty is
only 10%? There are cases, such as COPY and ALTER TABLE, where
you'd be looking at 2X or worse penalties, because of the existing
optimizations that avoid writing WAL at all for operations where a
single final fsync can serve the purpose. I'm not sure what the
penalty for "typical" workloads is, partly because I'm not sure what
should be considered a "typical" workload for this purpose.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 14:32:49
Message-ID:	4CBEFD91.8070003@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 20.10.2010 17:19, Tom Lane wrote:
> Greg Smith<greg(at)2ndquadrant(dot)com> writes:
>> Josh Berkus wrote:
>>> Well, now that you mention it, I also think that "hot standby" should be
>>> the default. Yes, I know about the overhead, but I also think that the
>>> number of our users who want easy replication *far* outnumber the users
>>> who care about an extra 10% WAL overhead.
>
>> ... But much like
>> default_statistics_target, there needs to be some more formal work done
>> on quantifying just how bad each of these overheads really are first.
>
> Quite. Josh, have you got any evidence showing that the penalty is
> only 10%? There are cases, such as COPY and ALTER TABLE, where
> you'd be looking at 2X or worse penalties, because of the existing
> optimizations that avoid writing WAL at all for operations where a
> single final fsync can serve the purpose. I'm not sure what the
> penalty for "typical" workloads is, partly because I'm not sure what
> should be considered a "typical" workload for this purpose.

Going from wal_level='minimal' to 'archivë́' incurs the penalty on
WAL-logging COPY etc. That's a big penalty. However, the difference
between wal_level='archive' and wal_level='hot_standby' should be tiny.

The big reason for separating those two in 9.0 was that it's all new
code with new ways to fail and, yes, new bugs. It's not smart to expose
people who are not interested in using hot standby to those issues. But
maybe we feel more comfortable merging 'archive' and 'hot_standby'
levels in 9.1.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 14:40:15
Message-ID:	26313.1287585615@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Going from wal_level='minimal' to 'archive' incurs the penalty on
> WAL-logging COPY etc. That's a big penalty. However, the difference
> between wal_level='archive' and wal_level='hot_standby' should be tiny.

I'm not sure I believe that either, because of the costs associated with
logging lock acquisitions.

We really need some actual benchmarks in this area, rather than
handwaving ...

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 14:53:58
Message-ID:	1287586228-sup-6789@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Robert Haas's message of mié oct 20 10:29:04 -0300 2010:

> Actually, I think the best thing for default_statistics_target might
> be to scale the target based on the number of rows in the table, e.g.
> given N rows:
>
> 10 + (N / 1000), if N < 40,000
> 46 + (N / 10000), if 50,000 < N < 3,540,000
> 400, if N > 3,540,000
>
> Consider a table with 2,000 rows. With default_statistics_target =
> 100, we can store up to 100 MCVs; and we break the remaining ~1900
> values up into 100 buckets with 19 values/bucket.

Maybe what should be done about this is to have separate sizes for the
MCV list and the histogram, where the MCV list is automatically sized
during ANALYZE.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 15:00:28
Message-ID:	AANLkTi=QsypMBh0dfr1DFzC2pMmN5wmaCfWW1UtPiM-L@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 10:53 AM, Alvaro Herrera
<alvherre(at)commandprompt(dot)com> wrote:
> Excerpts from Robert Haas's message of mié oct 20 10:29:04 -0300 2010:
>
>> Actually, I think the best thing for default_statistics_target might
>> be to scale the target based on the number of rows in the table, e.g.
>> given N rows:
>>
>> 10 + (N / 1000), if N < 40,000
>> 46 + (N / 10000), if 50,000 < N < 3,540,000
>> 400, if N > 3,540,000
>>
>> Consider a table with 2,000 rows. With default_statistics_target =
>> 100, we can store up to 100 MCVs; and we break the remaining ~1900
>> values up into 100 buckets with 19 values/bucket.
>
> Maybe what should be done about this is to have separate sizes for the
> MCV list and the histogram, where the MCV list is automatically sized
> during ANALYZE.

I thought about that, but I'm not sure there's any particular
advantage. Automatically scaling the histogram seems just as useful
as automatically scaling the MCV list - both things will tend to
reduce the estimation error. For a table with 2,000,000 rows,
automatically setting the statistics target from 100 to the value that
would be computed by the above formula, which happens to be 246, will
help the 101th-246th most common values, because they will now be
MCVs. It will also help all the remaining values, both because
you've pulled 146 fairly common values out of the histogram buckets
and also because each bucket now contains ~8130 values rather than
~20,000.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 19:40:52
Message-ID:	AANLkTimqyhbNpw+E1JKciFeESovszoWe6S=q1Xn1_NfW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 6:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Exactly. It doesn't take many 3-7% slowdowns to add up to being 50%
> or 100% slower, and that sucks. In fact, I'm still not convinced that
> we were wise to boost default_statistics_target as much as we did. I
> argued for a smaller boost at the time.

Well we don't want to let ourselves be paralyzed by FUD so it was
important to identify specific concerns and then tackle those
concerns. Once we identified the worst-case planning cases we profiled
them and found that the inflection point of the curve was fairly
clearly above 100 but that there were cases where values below 1,000
caused problems. So I'm pretty happy with the evidence-based approach.

The problem with being overly conservative is that it gives free rein
to the folks who were shouting that we should just set the default to
1,000. They weren't wrong that the 10 was overly conservative and in
the absence of evidence 1,000 was just as reasonable.

> Actually, I think the best thing for default_statistics_target might
> be to scale the target based on the number of rows in the table, e.g.
> given N rows:

The number of buckets needed isn't related to the population size --
it's related to how wide the ranges you'll be estimating selectivity
for are. That is, with our current code, if you're selecting tuples
within a range a..b and that range happens to be the same size as the
bucket size then you'll get an accurate estimate with a fixed 95th
percentile precision regardless of the size of the table (to an
approximation).

I'm not sure how our selectivity works at all for the degenerate case
of selecting for specific values. I don't understand how histograms
are useful for such estimates at all. I think the MCV lists are
basically an attempt to overcome this problem and as you point out I'm
not sure the statistics target is really the right thign to control
them -- but since I don't think there's any real statistics behind
them I'm not sure there's any right way to control them.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 20:12:00
Message-ID:	AANLkTimzaty5a3cs1qabY5B-GQOmof3gOUuuDDyzJUBj@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 3:40 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Wed, Oct 20, 2010 at 6:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Exactly. It doesn't take many 3-7% slowdowns to add up to being 50%
>> or 100% slower, and that sucks. In fact, I'm still not convinced that
>> we were wise to boost default_statistics_target as much as we did. I
>> argued for a smaller boost at the time.
>
> Well we don't want to let ourselves be paralyzed by FUD so it was
> important to identify specific concerns and then tackle those
> concerns. Once we identified the worst-case planning cases we profiled
> them and found that the inflection point of the curve was fairly
> clearly above 100 but that there were cases where values below 1,000
> caused problems. So I'm pretty happy with the evidence-based approach.

The inflection point of the curve was certainly a good thing for us to
look at but the fact remains that we took a hit on a trivial
benchmark, and we can't afford to take too many of those.

>> Actually, I think the best thing for default_statistics_target might
>> be to scale the target based on the number of rows in the table, e.g.
>> given N rows:
>
> The number of buckets needed isn't related to the population size --
> it's related to how wide the ranges you'll be estimating selectivity
> for are. That is, with our current code, if you're selecting tuples
> within a range a..b and that range happens to be the same size as the
> bucket size then you'll get an accurate estimate with a fixed 95th
> percentile precision regardless of the size of the table (to an
> approximation).

If you have a WHERE clause of the form WHERE x > some_constant, then
the effects vary depending on how that constant is chosen. If it's
the median value, then as you say the statistics target doesn't matter
much at all; but that's not necessarily representative of real life.
For example, suppose x is a date and the constant is Monday of the
current week. As the table grows, the present week's data becomes a
smaller and smaller fraction of the table data. When it gets to be a
tiny fraction of the very last histogram bucket, the estimates start
to get progressively worse. At some point you have to give up and
partition the table for other reasons anyway, but having to do it
because the statistics are off is inexcusable. We've seen people hit
this precise issue on -performance a few times.

> I'm not sure how our selectivity works at all for the degenerate case
> of selecting for specific values. I don't understand how histograms
> are useful for such estimates at all. I think the MCV lists are
> basically an attempt to overcome this problem and as you point out I'm
> not sure the statistics target is really the right thign to control
> them -- but since I don't think there's any real statistics behind
> them I'm not sure there's any right way to control them.

If you have a WHERE clause of the form WHERE x = some_constant, then
you get a much better estimate if some_constant is an MCV. If the
constant is not an MCV, however, you still get better estimates,
because the estimation code knows that no non-MCV can occur more
frequently than any MCV, so increasing the number of MCVs pushes those
estimates closer to reality. It is especially bad when the frequency
"falls off a cliff" at a certain point in the distribution e.g. if
there are 243 values that occur much more frequently than any others,
a stats target of 250 will do much better than 225. But even if
that's not an issue, it still helps. The bottom line here is that I
can't remember any message, ever, on -performance, or any incident
within my personal experience, where it was necessary to increase the
statistics target beyond 50-100 on a table with 10K rows. However,
there are certainly cases where we've recommended that for big tables,
which means there are also people out there who have a performance
problem on a big table but haven't asked for help and therefore
haven't gotten that advice.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 20:33:41
Message-ID:	AANLkTimDe72_SBN6Mq0nSpV7pGEeaa4o3HxEh3htx07R@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 1:12 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Oct 20, 2010 at 3:40 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
>> On Wed, Oct 20, 2010 at 6:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
>>> Actually, I think the best thing for default_statistics_target might
>>> be to scale the target based on the number of rows in the table, e.g.
>>> given N rows:
>>
>> The number of buckets needed isn't related to the population size --
>> it's related to how wide the ranges you'll be estimating selectivity
>> for are.
>
> As the table grows, the present week's data becomes a
> smaller and smaller fraction of the table data.

That's an interesting point. I wonder if we can expose this in some
way that allows users to specify the statistics target in something
more meaningful for them that doesn't change as the ranges in the
table grow. Or even gather stats on the size of the ranges being
queried.

> If you have a WHERE clause of the form WHERE x = some_constant, then
> you get a much better estimate if some_constant is an MCV. If the
> constant is not an MCV, however, you still get better estimates,
> because the estimation code knows that no non-MCV can occur more
> frequently than any MCV, so increasing the number of MCVs pushes those
> estimates closer to reality. It is especially bad when the frequency
> "falls off a cliff" at a certain point in the distribution e.g. if
> there are 243 values that occur much more frequently than any others,
> a stats target of 250 will do much better than 225.

It sounds like what we really need here some way to characterize the
distribution of frequencies. Instead of just computing an upper bound
we should have a kind of histogram showing how many values occur
precisely once, how many occur twice, three times, etc. Or perhaps we
only need to know the most common frequency per bucket. Or, hm...

--
greg

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-20 22:15:15
Message-ID:	4CBF69F3.2070003@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> Maybe what should be done about this is to have separate sizes for the
>> MCV list and the histogram, where the MCV list is automatically sized
>> during ANALYZE.

It's been suggested multiple times that we should base our sample size
on a % of the table, or at least offer that as an option. I've pointed
out (with math, which Simon wrote a prototype for) that doing
block-based sampling instead of random-row sampling would allow us to
collect, say, 2% of a very large table without more I/O than we're doing
now.

Nathan Boley has also shown that we could get tremendously better
estimates without additional sampling if our statistics collector
recognized common patterns such as normal, linear and geometric
distributions. Right now our whole stats system assumes a completely
random distribution.

So, I think we could easily be quite a bit smarter than just increasing
the MCV. Although that might be a nice start.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-20 22:17:31
Message-ID:	4CBF6A7B.9060103@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Quite. Josh, have you got any evidence showing that the penalty is
> only 10%? There are cases, such as COPY and ALTER TABLE, where
> you'd be looking at 2X or worse penalties, because of the existing
> optimizations that avoid writing WAL at all for operations where a
> single final fsync can serve the purpose. I'm not sure what the
> penalty for "typical" workloads is, partly because I'm not sure what
> should be considered a "typical" workload for this purpose.

If we could agree on some workloads, I could run some benchmarks. I'm
not sure what those would be though, given that COPY and ALTER TABLE
aren't generally included in most benchmarks. I could see how
everything else is effected, though.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-20 22:38:45
Message-ID:	AANLkTinC_z9wL3Vb+n_mcGwwsM58-r=Xea_BPghmZNTp@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>
>>> Maybe what should be done about this is to have separate sizes for the
>>> MCV list and the histogram, where the MCV list is automatically sized
>>> during ANALYZE.
>
> It's been suggested multiple times that we should base our sample size
> on a % of the table, or at least offer that as an option.

Why? Afaict this has been suggested multiple times by people who don't
justify it in any way except with handwavy -- larger samples are
better. The sample size is picked based on what sample statistics
tells us we need to achieve a given 95th percentile confidence
interval for the bucket size given.

Robert pointed out one reason we would want smaller buckets for larger
tables but nobody has explained why we would want smaller confidence
intervals for the same size buckets. That amounts to querying larger
tables for the same percentage of the table but wanting more precise
estimates than you want for smaller tables.

> I've pointed
> out (with math, which Simon wrote a prototype for) that doing
> block-based sampling instead of random-row sampling would allow us to
> collect, say, 2% of a very large table without more I/O than we're doing
> now.

Can you explain when this would and wouldn't bias the sample for the
users so they can decide whether to use it or not?

> Nathan Boley has also shown that we could get tremendously better
> estimates without additional sampling if our statistics collector
> recognized common patterns such as normal, linear and geometric
> distributions. Right now our whole stats system assumes a completely
> random distribution.

That's interesting, I hadn't seen that.

> So, I think we could easily be quite a bit smarter than just increasing
> the MCV. Although that might be a nice start.

I think increasing the MCV is too simplistic since we don't really
have any basis for any particular value. I think what we need are some
statistics nerds to come along and say here's this nice tool from
which you can make the following predictions and understand how
increasing or decreasing the data set size affects the accuracy of the
predictions.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-20 22:54:42
Message-ID:	AANLkTi=h5+ar4LBngmcZVcbRct9R7t9Qk7+rHBxpb4Y4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 6:38 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Wed, Oct 20, 2010 at 3:15 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>>>> Maybe what should be done about this is to have separate sizes for the
>>>> MCV list and the histogram, where the MCV list is automatically sized
>>>> during ANALYZE.
>>
>> It's been suggested multiple times that we should base our sample size
>> on a % of the table, or at least offer that as an option.
>
> Why? Afaict this has been suggested multiple times by people who don't
> justify it in any way except with handwavy -- larger samples are
> better. The sample size is picked based on what sample statistics
> tells us we need to achieve a given 95th percentile confidence
> interval for the bucket size given.
>
> Robert pointed out one reason we would want smaller buckets for larger
> tables but nobody has explained why we would want smaller confidence
> intervals for the same size buckets. That amounts to querying larger
> tables for the same percentage of the table but wanting more precise
> estimates than you want for smaller tables.

Yes, I think a percentage of the table is going to break down either
at the high end or the low end. Hand-waving (but based on
experience), for a 1000 row table a statistics target of 10 is
probably approximately right and 100 is too much and 1 is too little.
But for a 1,000,000 row table 10,000 is probably too much and even
1,000 is pushing it. So using a constant percentage of table rows
doesn't feel right. I had a thought today that it might make sense to
use an exponential curve, like min(2 * N^(1/3), 10). I can't really
justify that mathematically, but that doesn't mean it won't work well
in practice.

>> So, I think we could easily be quite a bit smarter than just increasing
>> the MCV. Although that might be a nice start.
>
> I think increasing the MCV is too simplistic since we don't really
> have any basis for any particular value. I think what we need are some
> statistics nerds to come along and say here's this nice tool from
> which you can make the following predictions and understand how
> increasing or decreasing the data set size affects the accuracy of the
> predictions.

I'm not sure that's realistic, because everything depends on what
queries you're running, and you can get arbitrary answers by
postulating arbitrary queries. However, this does not make me excited
about "doing nothing".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-20 23:13:46
Message-ID:	AANLkTinsV6b38vFf-j3o_4J2bp3SiK_h6uuV7hsOdxPu@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 3:54 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> Yes, I think a percentage of the table is going to break down either
> at the high end or the low end. Hand-waving (but based on
> experience), for a 1000 row table a statistics target of 10 is
> probably approximately right and 100 is too much and 1 is too little.
> But for a 1,000,000 row table 10,000 is probably too much and even
> 1,000 is pushing it. So using a constant percentage of table rows
> doesn't feel right. I had a thought today that it might make sense to
> use an exponential curve, like min(2 * N^(1/3), 10). I can't really
> justify that mathematically, but that doesn't mean it won't work well
> in practice.

Well we can analyze it but as you said later, it all depends on what
queries you're running. If we want to aim for the same confidence
interval at all times, ie that the estimated frequency is accurate to
within +/- x% 95% of the time then:

If we're querying ranges a..b which represent a constant percentage of
the table we need a fixed number of buckets and a sample size that
varies very little with respect to the size of the table (effectively
constant).

If we're querying ranges a..b which are constant sized and therefore
represent a smaller percentage of the table as it grows then we need a
number of buckets that's proportional to the size of the table. The
sample size is proportional to the number of buckets (ie, it's a
constant sized sample per bucket).

If we're querying for a specific value which isn't one of the most
common values then I'm not clear how to characterize the accuracy or
precision of our current estimates let alone how they would vary if we
changed our sample sizes.

If we need to estimate ndistinct then we clearly need a sample of the
table the size of which is proportional to the size of the table. And
in practice to get accurate results it has to be a fairly high
percentage -- effectively meaning we should read the whole table.
>> I think increasing the MCV is too simplistic since we don't really
>> have any basis for any particular value. I think what we need are some
>> statistics nerds to come along and say here's this nice tool from
>> which you can make the following predictions and understand how
>> increasing or decreasing the data set size affects the accuracy of the
>> predictions.
>
> I'm not sure that's realistic, because everything depends on what
> queries you're running, and you can get arbitrary answers by
> postulating arbitrary queries. However, this does not make me excited
> about "doing nothing".

Well our planner only needs to answer specific questions. We just
needs stats capable of answering "how many occurrences of x are there"
and "how many values are in the range x..y" for the normal estimation
functions. We have the latter but if there's a stat we're missing for
calculating the former more more robustly that would be great. We also
need ndistinct but that's another story.

--
greg

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 01:03:37
Message-ID:	4CBF9169.4030408@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Why? Afaict this has been suggested multiple times by people who don't
> justify it in any way except with handwavy -- larger samples are
> better. The sample size is picked based on what sample statistics
> tells us we need to achieve a given 95th percentile confidence
> interval for the bucket size given.

I also just realized that I confused myself ... we don't really want
more MCVs. What we want it more *samples* to derive a small number of
MCVs. Right now # of samples and number of MCVs is inexorably bound,
and they shouldn't be. On larger tables, you're correct that we don't
necessarily want more MCVs, we just need more samples to figure out
those MCVs accurately.

> Can you explain when this would and wouldn't bias the sample for the
> users so they can decide whether to use it or not?

Sure. There's some good math in various ACM papers for this. The
basics are that block-based sampling should be accompanied by an
increased sample size, or you are lowering your confidence level. But
since block-based sampling allows you to increase your sample size
without increasing I/O or RAM usage, you *can* take a larger sample ...
a *much* larger sample if you have small rows.

The algorithms for deriving stats from a block-based sample are a bit
more complex, because the code needs to determine the level of physical
correlation in the blocks sampled and skew the stats based on that. So
there would be an increase in CPU time. As a result, we'd probably give
some advice like "random sampling for small tables, block-based for
large ones".

> I think increasing the MCV is too simplistic since we don't really
> have any basis for any particular value. I think what we need are some
> statistics nerds to come along and say here's this nice tool from
> which you can make the following predictions and understand how
> increasing or decreasing the data set size affects the accuracy of the
> predictions.

Agreed.

Nathan?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 01:16:27
Message-ID:	AANLkTimyGht=_Z-tbGkCp8QCO8bfpaWKDsCJwjXYNi0K@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 6:03 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> I also just realized that I confused myself ... we don't really want
> more MCVs. What we want it more *samples* to derive a small number of
> MCVs. Right now # of samples and number of MCVs is inexorably bound,
> and they shouldn't be. On larger tables, you're correct that we don't
> necessarily want more MCVs, we just need more samples to figure out
> those MCVs accurately.

I don't see why the MCVs would need a particularly large sample size
to calculate accurately. Have you done any tests on the accuracy of
the MCV list?

Robert explained why having more MCVs might be useful because we use
the frequency of the least common MCV as an upper bound on the
frequency of any value in the MCV. That seems logical but it's all
about the number of MCV entries not the accuracy of them. And mostly
what it tells me is that we need a robust statistical method and the
data structures it requires for estimating the frequency of a single
value.

Binding the length of the MCV list to the size of the histogram is
arbitrary but so would any other value and I haven't seen anyone
propose any rationale for any particular value. The only rationale I
can see is that we probably want to to take roughly the same amount of
space as the existing stats -- and that means we probably want it to
be roughly the same size.

--
greg

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 01:41:36
Message-ID:	4CBF9A50.8040604@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I don't see why the MCVs would need a particularly large sample size
> to calculate accurately. Have you done any tests on the accuracy of
> the MCV list?

Yes, although I don't have them at my fingertips. In sum, though, you
can't take 10,000 samples from a 1b row table and expect to get a
remotely accurate MCV list.

A while back I did a fair bit of reading on ndistinct and large tables
from the academic literature. The consensus of many papers was that it
took a sample of at least 3% (or 5% for block-based) of the table in
order to have 95% confidence in ndistinct of 3X. I can't imagine that
MCV is easier than this.

> And mostly
> what it tells me is that we need a robust statistical method and the
> data structures it requires for estimating the frequency of a single
> value.

Agreed.

> Binding the length of the MCV list to the size of the histogram is
> arbitrary but so would any other value and I haven't seen anyone
> propose any rationale for any particular value.

histogram size != sample size. It is in our code, but that's a bug and
not a feature.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 01:49:23
Message-ID:	AANLkTimHJEE-0guj0j8-a7S_ioPWaMNKUQ9Y902K2CXJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 7:13 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Wed, Oct 20, 2010 at 3:54 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> Yes, I think a percentage of the table is going to break down either
>> at the high end or the low end. Hand-waving (but based on
>> experience), for a 1000 row table a statistics target of 10 is
>> probably approximately right and 100 is too much and 1 is too little.
>> But for a 1,000,000 row table 10,000 is probably too much and even
>> 1,000 is pushing it. So using a constant percentage of table rows
>> doesn't feel right. I had a thought today that it might make sense to
>> use an exponential curve, like min(2 * N^(1/3), 10). I can't really
>> justify that mathematically, but that doesn't mean it won't work well
>> in practice.
>
> Well we can analyze it but as you said later, it all depends on what
> queries you're running. If we want to aim for the same confidence
> interval at all times, ie that the estimated frequency is accurate to
> within +/- x% 95% of the time then:
>
> If we're querying ranges a..b which represent a constant percentage of
> the table we need a fixed number of buckets and a sample size that
> varies very little with respect to the size of the table (effectively
> constant).
>
> If we're querying ranges a..b which are constant sized and therefore
> represent a smaller percentage of the table as it grows then we need a
> number of buckets that's proportional to the size of the table. The
> sample size is proportional to the number of buckets (ie, it's a
> constant sized sample per bucket).
>
> If we're querying for a specific value which isn't one of the most
> common values then I'm not clear how to characterize the accuracy or
> precision of our current estimates let alone how they would vary if we
> changed our sample sizes.

I think that sums it up pretty well. There's no one right formula. I
think this problem needs an empirical approach rather than a
statistical analysis. We know that it's impractical for the stats
target to be linear in the table size. We also know that constant
values are excessive for small tables and sometimes inadequate for
large one. Therefore, we should pick something that grows, but
sublinearly. Discuss. :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Nathan Boley <npboley(at)gmail(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 01:53:13
Message-ID:	AANLkTikVAH5dM-CD0YJacCXuEg7Tz9eCUs0V0-eN9DvO@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Robert explained why having more MCVs might be useful because we use
> the frequency of the least common MCV as an upper bound on the
> frequency of any value in the MCV.

Where is that being used? The only non-MCV frequency estimate that I
recall seeing is ( nrows - n_ndistinct_rows )/ndistinct. Obviously
changing the number of mcv's affects this by lowering
n_ndistinct_rows, but it's always pretty coarse estimate.

> Binding the length of the MCV list to the size of the histogram is
> arbitrary but so would any other value

Wouldn't the best approach be to stop adding MCV's/histogram buckets
when adding new ones doesn't decrease your prediction error
'substantially'?

One very hacky threshold heuristic is to stop adding MCV's when a
simple equality select ( SELECT col FROM table WHERE col == VALUE )
would switch the plan from an index to a sequential scan ( or vice
versa, although with the current code this would never happen ). ie,
if the non_mcv frequency estimate is 0.1% ( producing an index scan ),
but adding the MCV gives us an estimate of 5% ( pbly producing a seq
scan ) then add that value as an mcv. More sophisticated variations
might also consider plan changes to very suboptimal joins; even more
sophisticated would be to stop when the MAX( curr - optimal plan /
optimal plan ) was below some threshold, say 20%, over a bunch of
recently executed queries.

A similar approach would work for histogram bins, except the queries
of interest are inequality rather than equality selections.

-Nathan

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-21 01:54:19
Message-ID:	AANLkTimcaOSMCRjQpctatoqzVQospi=vhDjV_FfLNBsy@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 6:17 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> Quite. Josh, have you got any evidence showing that the penalty is
>> only 10%? There are cases, such as COPY and ALTER TABLE, where
>> you'd be looking at 2X or worse penalties, because of the existing
>> optimizations that avoid writing WAL at all for operations where a
>> single final fsync can serve the purpose. I'm not sure what the
>> penalty for "typical" workloads is, partly because I'm not sure what
>> should be considered a "typical" workload for this purpose.
>
> If we could agree on some workloads, I could run some benchmarks. I'm
> not sure what those would be though, given that COPY and ALTER TABLE
> aren't generally included in most benchmarks. I could see how
> everything else is effected, though.

I think this whole thing is a complete non-starter. Are we seriously
talking about shipping a configuration that will slow down COPY by 2X
or more, just so that someone who wants replication can do it by
changing one fewer parameter? I find it impossible to believe that's
a good decision, and IMHO we should be focusing on how to make the
parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us
most of the same benefits without throwing away hard-won performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Nathan Boley <npboley(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 02:00:20
Message-ID:	AANLkTinRiR-TnNsW7O+S4H=J-RkA+Txj1RcZQJ5oErL9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 9:53 PM, Nathan Boley <npboley(at)gmail(dot)com> wrote:
>> Robert explained why having more MCVs might be useful because we use
>> the frequency of the least common MCV as an upper bound on the
>> frequency of any value in the MCV.
>
> Where is that being used?

var_eq_const

> The only non-MCV frequency estimate that I
> recall seeing is ( nrows - n_ndistinct_rows )/ndistinct. Obviously
> changing the number of mcv's affects this by lowering
> n_ndistinct_rows, but it's always pretty coarse estimate.

That one's used, too, but the other is used as an upper bound.
n_distinct tends to come out too small on large tables, so that
formula is prone to overestimation. Actually, both formulas are prone
to overestimation.

>> Binding the length of the MCV list to the size of the histogram is
>> arbitrary but so would any other value
>
> Wouldn't the best approach be to stop adding MCV's/histogram buckets
> when adding new ones doesn't decrease your prediction error
> 'substantially'?
>
> One very hacky threshold heuristic is to stop adding MCV's when a
> simple equality select ( SELECT col FROM table WHERE col == VALUE )
> would switch the plan from an index to a sequential scan ( or vice
> versa, although with the current code this would never happen ). ie,
> if the non_mcv frequency estimate is 0.1% ( producing an index scan ),

When this happens depends on the values of a whole boat-load of GUCs...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Nathan Boley <npboley(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 02:10:39
Message-ID:	AANLkTinsGBjF=wh-N0XuujWB_es8VmAqgGMq=OwfxM7Y@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> That one's used, too, but the other is used as an upper bound.
> n_distinct tends to come out too small on large tables, so that
> formula is prone to overestimation. Actually, both formulas are prone
> to overestimation.
>

Right - thanks.

> When this happens depends on the values of a whole boat-load of GUCs...

Well, then doesn't the 'proper' number of buckets depend on a whole
boat-load of GUCs...

-Nathan

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 02:32:21
Message-ID:	1287628341.7085.90.camel@jd-desktop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-20 at 15:15 -0700, Josh Berkus wrote:
> >> Maybe what should be done about this is to have separate sizes for the
> >> MCV list and the histogram, where the MCV list is automatically sized
> >> during ANALYZE.
>
> It's been suggested multiple times that we should base our sample size
> on a % of the table, or at least offer that as an option. I've pointed
> out (with math, which Simon wrote a prototype for) that doing
> block-based sampling instead of random-row sampling would allow us to
> collect, say, 2% of a very large table without more I/O than we're doing
> now.
>
> Nathan Boley has also shown that we could get tremendously better
> estimates without additional sampling if our statistics collector
> recognized common patterns such as normal, linear and geometric
> distributions. Right now our whole stats system assumes a completely
> random distribution.
>
> So, I think we could easily be quite a bit smarter than just increasing
> the MCV. Although that might be a nice start.

References would be nice.

>
> --
> -- Josh Berkus
> PostgreSQL Experts Inc.
> http://www.pgexperts.com
>

--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579
Consulting, Training, Support, Custom Development, Engineering
http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 02:45:07
Message-ID:	10155.1287629107@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
> A while back I did a fair bit of reading on ndistinct and large tables
> from the academic literature. The consensus of many papers was that it
> took a sample of at least 3% (or 5% for block-based) of the table in
> order to have 95% confidence in ndistinct of 3X. I can't imagine that
> MCV is easier than this.

You've got that entirely backwards. ndistinct is hard because there
could be a whole lot of values with very small occurrence counts,
and sampling is not going to help you distinguish between things
that occur but once and things that occur only two or three times.
But whether the table has a lot of the first or a lot of the second
means a difference of 2X or 3X on ndistinct.

MCVs, on the other hand, are MCVs precisely because they occur a lot,
and so they are highly likely to show up in a sample. You *can* get a
decent estimate of the first few MCVs from a sample, you just have to be
cautious about not believing something is an MCV if it's only a small
proportion of your sample.

regards, tom lane

From:	Greg Smith <greg(at)2ndquadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-21 03:01:39
Message-ID:	4CBFAD13.10601@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
> If we could agree on some workloads, I could run some benchmarks. I'm
> not sure what those would be though, given that COPY and ALTER TABLE
> aren't generally included in most benchmarks.

You can usefully and easily benchmark this by timing a simple pgbench
initialization at a decently large scale. The COPY used to populate the
giant accounts table takes advantage of the WAL bypass fast path if
available, and you can watch performance tank the minute one of the
options that disables it is turned on.

--
Greg Smith, 2ndQuadrant US greg(at)2ndQuadrant(dot)com Baltimore, MD
PostgreSQL Training, Services and Support www.2ndQuadrant.us

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: default_statistics_target WAS: max_wal_senders must die
Date:	2010-10-21 04:03:31
Message-ID:	AANLkTi=1WULMrNxYUQite54cWuDx_-SowiDUNh+593nh@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 20, 2010 at 6:41 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> A while back I did a fair bit of reading on ndistinct and large tables
> from the academic literature. The consensus of many papers was that it
> took a sample of at least 3% (or 5% for block-based) of the table in
> order to have 95% confidence in ndistinct of 3X. I can't imagine that
> MCV is easier than this.

Interestingly I also read up on this but found a different and even
more pessimistic conclusions. Basically unless you're willing to read
about 50% or more of the table you can't make useful estimates at all
and even then the estimates are pretty unreliable. Which makes a lot
of sense since a handful of entries can easily completely change
ndistinct

> histogram size != sample size. It is in our code, but that's a bug and
> not a feature.

For the histogram there's a solid statistical reason why the two are related.

For ndistinct I agree you would need to sample a proportion of the
table and from what I read you really want that proportion to be 100%.

For the MCV I'm not entirely clear yet what the right answer is. It's
possible you're right but then I don't see a good algorithm for
calculating mcv accurately for large sample sizes using a reasonable
amount of resources.

--
greg

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-21 20:21:12
Message-ID:	4CC0A0B8.9050103@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10/20/10 6:54 PM, Robert Haas wrote:
> I find it impossible to believe that's
> a good decision, and IMHO we should be focusing on how to make the
> parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us
> most of the same benefits without throwing away hard-won performance.

I'd be happy to accept that. Is it possible, though?

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-21 20:23:46
Message-ID:	AANLkTiki-9KcFHAMjBfE5C+OcfNVhxtivW=K_920_j5y@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 21, 2010 at 4:21 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> On 10/20/10 6:54 PM, Robert Haas wrote:
>> I find it impossible to believe that's
>> a good decision, and IMHO we should be focusing on how to make the
>> parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us
>> most of the same benefits without throwing away hard-won performance.
>
> I'd be happy to accept that. Is it possible, though?

I sketched an outline of the problem AIUI here:

http://archives.postgresql.org/pgsql-hackers/2010-10/msg01348.php

I think it's possible; I'm not quite sure how hard it is.
Unfortunately, I've not had as much PG-hacking time lately as I'd
like...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Greg Smith <greg(at)2ndquadrant(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-22 00:32:41
Message-ID:	201010220032.o9M0WfB01500@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Wed, Oct 20, 2010 at 3:40 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> > On Wed, Oct 20, 2010 at 6:29 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> >> Exactly. ?It doesn't take many 3-7% slowdowns to add up to being 50%
> >> or 100% slower, and that sucks. ?In fact, I'm still not convinced that
> >> we were wise to boost default_statistics_target as much as we did. ?I
> >> argued for a smaller boost at the time.
> >
> > Well we don't want to let ourselves be paralyzed by FUD so it was
> > important to identify specific concerns and then tackle those
> > concerns. Once we identified the worst-case planning cases we profiled
> > them and found that the inflection point of the curve was fairly
> > clearly above 100 but that there were cases where values below 1,000
> > caused problems. So I'm pretty happy with the evidence-based approach.
>
> The inflection point of the curve was certainly a good thing for us to
> look at but the fact remains that we took a hit on a trivial
> benchmark, and we can't afford to take too many of those.

Agreed. If people start wondering if our new major releases are perhaps
_slower_ than previous ones, we have lost a huge amount of momentum.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-22 00:33:08
Message-ID:	201010220033.o9M0X8801629@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas wrote:
> On Thu, Oct 21, 2010 at 4:21 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> > On 10/20/10 6:54 PM, Robert Haas wrote:
> >> I find it impossible to believe that's
> >> a good decision, and IMHO we should be focusing on how to make the
> >> parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us
> >> most of the same benefits without throwing away hard-won performance.
> >
> > I'd be happy to accept that. ?Is it possible, though?
>
> I sketched an outline of the problem AIUI here:
>
> http://archives.postgresql.org/pgsql-hackers/2010-10/msg01348.php
>
> I think it's possible; I'm not quite sure how hard it is.
> Unfortunately, I've not had as much PG-hacking time lately as I'd
> like...

Have we documented these TODOs?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 01:25:34
Message-ID:	AANLkTi=Wf3ELSuYooRUYH1sOUFmqaOvt5y1Py3ZyKy+m@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Oct 21, 2010 at 8:33 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> Robert Haas wrote:
>> On Thu, Oct 21, 2010 at 4:21 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> > On 10/20/10 6:54 PM, Robert Haas wrote:
>> >> I find it impossible to believe that's
>> >> a good decision, and IMHO we should be focusing on how to make the
>> >> parameters PGC_SIGHUP rather than PGC_POSTMASTER, which would give us
>> >> most of the same benefits without throwing away hard-won performance.
>> >
>> > I'd be happy to accept that. ?Is it possible, though?
>>
>> I sketched an outline of the problem AIUI here:
>>
>> http://archives.postgresql.org/pgsql-hackers/2010-10/msg01348.php
>>
>> I think it's possible; I'm not quite sure how hard it is.
>> Unfortunately, I've not had as much PG-hacking time lately as I'd
>> like...
>
> Have we documented these TODOs?

I have not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 08:08:53
Message-ID:	1288166933.1587.1246.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2010-10-19 at 15:32 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > On Tue, Oct 19, 2010 at 12:18 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >> On 10/19/2010 09:06 AM, Greg Smith wrote:
> >>> I think Magnus's idea to bump the default to 5 triages the worst of the
> >>> annoyance here, without dropping the feature (which has uses) or waiting
> >>> for new development to complete.
>
> > Setting max_wal_senders to a non-zero value causes additional work to
> > be done every time a transaction commits, aborts, or is prepared.
>
> Yes.

Sorry guys, but that is completely wrong. There is no additional work to
be done each time a transaction commits, even with sync rep. And I don't
mean "nearly zero", I mean nada.

> This isn't just a numeric parameter; it's also a boolean
> indicating "do I want to pay the overhead to be prepared to be a
> replication master?".

Agreed, but its to do with wal_level.

> Josh has completely failed to make a case that
> that should be the default.

Agreed.

> In fact, the system would fail to start
> at all if we just changed the default for max_wal_senders and not the
> default for wal_level.

Agree with that as a problem.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 17:05:11
Message-ID:	4CC85BC7.6030408@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> Josh has completely failed to make a case that
>> that should be the default.
>
> Agreed.

In what way have a failed to make a case?

1. The more settings our users need to change to make replication work,
the more difficult and frustrating it is for them. See Robert's example
of the current work path earlier in this thread.

2. Therefore it benefits our users to have as many settings which can be
set without penalty default to ones which are replication-compatible.

3. If there is no specific performance penalty for the master being
willing to accept a replication connection, then placing a limit on the
# of potential replication connections is an obscure high-end corner
case, and the common case is the user who wants no limit.

This seems like a very simple case of making things easier for our users
... setting a default to what most users want. Why is it opaque to you
guys?

In each step of working on replication management and configuration, I
see this list focusing on high-end corner cases and ignoring 99% of our
users, who just want a "replication = on" switch. I've seen this in
this discussion, and I've seen it in the sync rep discussion. While
it's fun to talk about huge pools of servers with quorum commit and
load-balancing connection ratios, 99% of our users just have 2 servers
they want to replicate, and do it in a low administration way.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 19:10:58
Message-ID:	1288206658.1587.2119.camel@ebony
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-27 at 10:05 -0700, Josh Berkus wrote:
> >> Josh has completely failed to make a case that
> >> that should be the default.
> >
> > Agreed.
>
> In what way have a failed to make a case?

I just removed a huge hurdle on the journey to simplification. That
doesn't mean I think you have come up with an acceptable solution,
though there probably is one.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndQuadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 19:33:55
Message-ID:	26037.1288208035@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>>> Josh has completely failed to make a case that
>>> that should be the default.
>>
>> Agreed.

> In what way have a failed to make a case?

You're assuming that we should set up the default behavior to support
replication and penalize those who aren't using it. Considering that
we haven't even *had* replication until now, it seems a pretty safe
bet that the majority of our users aren't using it and won't appreciate
that default. We routinely expend large amounts of effort to avoid
cross-version performance regressions, and I don't see that this one
is acceptable when others aren't.

I entirely agree that it ought to be easier to set up replication.
But there's a difference between having a big red EASY button for people
to push, and pushing it for them.

regards, tom lane

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Simon Riggs <simon(at)2ndQuadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndQuadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 19:40:12
Message-ID:	1288208412.8120.104.camel@jd-desktop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-27 at 15:33 -0400, Tom Lane wrote:
> Josh Berkus <josh(at)agliodbs(dot)com> writes:
> >>> Josh has completely failed to make a case that
> >>> that should be the default.
> >>
> >> Agreed.
>
> > In what way have a failed to make a case?
>
> You're assuming that we should set up the default behavior to support
> replication and penalize those who aren't using it.
> Considering that
> we haven't even *had* replication until now, it seems a pretty safe
> bet that the majority of our users aren't using it and won't appreciate
> that default. We routinely expend large amounts of effort to avoid
> cross-version performance regressions, and I don't see that this one
> is acceptable when others aren't.
>
> I entirely agree that it ought to be easier to set up replication.
> But there's a difference between having a big red EASY button for people
> to push, and pushing it for them.

Replication is an option, not a requirement. So +1 on Tom's argument
here.

>
> regards, tom lane
>

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 20:42:42
Message-ID:	4CC88EC2.2080803@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> You're assuming that we should set up the default behavior to support
> replication and penalize those who aren't using it.

What's the penalty? Simon just said that there isn't one.

And there's a difference between saying that I "failed to make a case"
vs. "the cost is too great". Saying the former is saying that my
argument lacks merit (or content) entirely, rather than saying that it's
not sufficient. I made a case, the case just didn't persuade you ... yet.

> I entirely agree that it ought to be easier to set up replication.
> But there's a difference between having a big red EASY button for people
> to push, and pushing it for them.

If we have a single boolean GUC called "replication", I would be happy.
Even if it defaulted to "off".

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Greg Smith <greg(at)2ndquadrant(dot)com>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 21:33:41
Message-ID:	AANLkTim3XgPtCYM3Nh8ADun=nmVx+pbzNgTE_RP7KhL9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 27, 2010 at 12:33 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> You're assuming that we should set up the default behavior to support
> replication and penalize those who aren't using it. Considering that
> we haven't even *had* replication until now, it seems a pretty safe
> bet that the majority of our users aren't using it and won't appreciate
> that default. We routinely expend large amounts of effort to avoid
> cross-version performance regressions, and I don't see that this one
> is acceptable when others aren't.

So I think we're talking about two different things.

a) whether to default to enabling the no-wal optimizations for newly
created tables which break replication

b) whether to impose a limit on the number of replication slaves by default

c) whether we can make these flags changable without restarting

I think (a) is a non-starter but if we can acheive (b) and (c) without
(a) then we're in pretty good shape. Someone would still have to
enable replication manually but they could do it without restarting,
and once they do it that one parameter would be sufficient, there
wouldn't be any hidden options lying in wait to trap them.

I think (c) is doable -- if we remember when the last non-logged
operation was we can refuse replication connections unless replication
is active and the replica isn't requesting any wal from prior to the
last unlogged operation.

--
greg

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 22:01:38
Message-ID:	29182.1288216898@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus <josh(at)agliodbs(dot)com> writes:
>> You're assuming that we should set up the default behavior to support
>> replication and penalize those who aren't using it.

> What's the penalty? Simon just said that there isn't one.

I don't know what Simon is thinking, but I think he's nuts. There is is
obvious extra overhead in COMMIT:

/*
* Wake up all walsenders to send WAL up to the COMMIT record
* immediately if replication is enabled
*/
if (max_wal_senders > 0)
WalSndWakeup();

which AFAICT is injecting multiple kernel calls into what's not only
a hot-spot but a critical section (ie, any error -> PANIC).

That's not even considering the extra WAL that is generated when you
move up from wal_level = "minimal". That's probably the bigger
performance issue in practice.

> And there's a difference between saying that I "failed to make a case"
> vs. "the cost is too great".

I said, and meant, that you didn't make the case at all; you just
presumed it was obvious that we should change the defaults to be
replication-friendly. I don't think it is. As I said, I think that
only a minority of our users are going to want replication.

regards, tom lane

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 23:13:42
Message-ID:	4CC8B226.6050603@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> That's not even considering the extra WAL that is generated when you
> move up from wal_level = "minimal". That's probably the bigger
> performance issue in practice.

Yeah, I think we've established that we can't change that.

> I said, and meant, that you didn't make the case at all; you just
> presumed it was obvious that we should change the defaults to be
> replication-friendly. I don't think it is. As I said, I think that
> only a minority of our users are going to want replication.

50% of PGX's active clients have either already converted to 9.0
replication or have scheduled a conversion with us. I expect that to be
80% by the time 9.1 comes out, and the main reason why it's not 100% is
that a few clients specifically need Slony (partial replication or
similar) or ad-hoc replication systems.

Every time I do a walk-through of how to do replication at a PG event
it's packed. I've talked to dozens of people who are planning to
implement 9.0 replication at conferences, and it's outpaced "how does it
compare to MySQL" for stuff people ask me about at booths.

From where I sit, you're *dramatically* underestimating the demand for
replication. Maybe other people haven't had the same experiences, but
I'm seeing an avalanche of demand.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 23:22:37
Message-ID:	1288221757.15416.23.camel@jd-desktop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-27 at 16:13 -0700, Josh Berkus wrote:
> > That's not even considering the extra WAL that is generated when you
> > move up from wal_level = "minimal". That's probably the bigger
> > performance issue in practice.
>
> Yeah, I think we've established that we can't change that.
>
> > I said, and meant, that you didn't make the case at all; you just
> > presumed it was obvious that we should change the defaults to be
> > replication-friendly. I don't think it is. As I said, I think that
> > only a minority of our users are going to want replication.
>
> 50% of PGX's active clients have either already converted to 9.0
> replication or have scheduled a conversion with us. I expect that to be
> 80% by the time 9.1 comes out, and the main reason why it's not 100% is
> that a few clients specifically need Slony (partial replication or
> similar) or ad-hoc replication systems.

That's interesting. ZERO % of CMD's clients have converted to 9.0 and
many have no current inclination to do so because they are already
easily served by Londiste, Slony, DRBD or Log Shipping.

I would also agree that the minority of our users will want replication.
The majority of CMD customers, PGX customers, EDB Customers will want
replication but that is by far NOT the majority of our (.Org) users.

Sincerely,

Joshua D. Drake

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 23:45:04
Message-ID:	4CC8B980.7090501@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I would also agree that the minority of our users will want replication.
> The majority of CMD customers, PGX customers, EDB Customers will want
> replication but that is by far NOT the majority of our (.Org) users.

That just means that *you don't know* how many .org users plan to
implement replication, whether it's a minority or majority.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 23:52:38
Message-ID:	AANLkTikpdOBQGi4tnTE_eM8m+=Zi-XKfTf8xf=MoVepK@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 27, 2010 at 7:45 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
>> I would also agree that the minority of our users will want replication.
>> The majority of CMD customers, PGX customers, EDB Customers will want
>> replication but that is by far NOT the majority of our (.Org) users.
>
> That just means that *you don't know* how many .org users plan to
> implement replication, whether it's a minority or majority.

None of us know. What I do know is that I don't want PostgreSQL to be
slower out of the box.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 23:53:55
Message-ID:	1288223635.15416.27.camel@jd-desktop
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2010-10-27 at 19:52 -0400, Robert Haas wrote:
> On Wed, Oct 27, 2010 at 7:45 PM, Josh Berkus <josh(at)agliodbs(dot)com> wrote:
> >> I would also agree that the minority of our users will want replication.
> >> The majority of CMD customers, PGX customers, EDB Customers will want
> >> replication but that is by far NOT the majority of our (.Org) users.
> >
> > That just means that *you don't know* how many .org users plan to
> > implement replication, whether it's a minority or majority.
>
> None of us know. What I do know is that I don't want PostgreSQL to be
> slower out of the box.

Poll TIME!

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>

From:	Josh Berkus <josh(at)agliodbs(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-10-27 23:57:43
Message-ID:	4CC8BC77.6070200@agliodbs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> None of us know. What I do know is that I don't want PostgreSQL to be
> slower out of the box.

Understandable. So it seems like the answer is getting replication down
to one configuration variable for the common case. That eliminates the
cycle of "oops, need to set X and restart/reload" without paying
performance penalties on standalone servers.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-28 15:45:11
Message-ID:	1288280368-sup-3462@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Excerpts from Tom Lane's message of mié oct 27 19:01:38 -0300 2010:

> I don't know what Simon is thinking, but I think he's nuts. There is is
> obvious extra overhead in COMMIT:
>
> /*
> * Wake up all walsenders to send WAL up to the COMMIT record
> * immediately if replication is enabled
> */
> if (max_wal_senders > 0)
> WalSndWakeup();
>
> which AFAICT is injecting multiple kernel calls into what's not only
> a hot-spot but a critical section (ie, any error -> PANIC).

Hmm, I wonder if that could be moved out of the critical section
somehow. Obviously the point here is to allow wal senders to react
before we write to clog (which is expensive by itself); but it seems
hard to wake up some other process without incurring exactly the same
cost (which is basically SetLatch) ... the only difference is that it
would be a single one instead of one per walsender.

BTW I note that there are no elog(ERROR) calls in that code path at all,
because syscall errors are ignored, so PANIC is not a concern (as the
code stands currently, at least). ISTM it would be good to have a
comment on SetLatch stating that it's used inside critical sections,
though.

--
Álvaro Herrera <alvherre(at)commandprompt(dot)com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: max_wal_senders must die
Date:	2010-10-28 15:55:27
Message-ID:	17626.1288281327@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> BTW I note that there are no elog(ERROR) calls in that code path at all,
> because syscall errors are ignored, so PANIC is not a concern (as the
> code stands currently, at least). ISTM it would be good to have a
> comment on SetLatch stating that it's used inside critical sections,
> though.

Yeah, I was thinking the same while reading the code yesterday. It
already notes that it's used in interrupt handlers, but the critical
section angle is an additional reason not to elog(ERROR).

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Josh Berkus <josh(at)agliodbs(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-11-13 04:09:06
Message-ID:	201011130409.oAD496x22133@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Josh Berkus wrote:
>
> > None of us know. What I do know is that I don't want PostgreSQL to be
> > slower out of the box.
>
> Understandable. So it seems like the answer is getting replication down
> to one configuration variable for the common case. That eliminates the
> cycle of "oops, need to set X and restart/reload" without paying
> performance penalties on standalone servers.

Right. I propose that we set max_wal_senders to unlimited when
wal_level = hot_standby. When they tell us they are using hot_standby
via wal_level, why make them change another setting (max_wal_senders)?

Basically, we don't need to turn everything on by default, but some
settings should trigger other behavior automatically.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-11-13 04:27:53
Message-ID:	9556.1289622473@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> Right. I propose that we set max_wal_senders to unlimited when
> wal_level = hot_standby.

It's a memory allocation parameter ... you can't just set it to
"unlimited", at least not without a nontrivial amount of work.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-11-13 13:56:09
Message-ID:	AANLkTimxcFDjvYyxLiiRBOnSCB1xJbO6xWx9k7TeSqoA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 12, 2010 at 11:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> Right. I propose that we set max_wal_senders to unlimited when
>> wal_level = hot_standby.
>
> It's a memory allocation parameter ... you can't just set it to
> "unlimited", at least not without a nontrivial amount of work.

Yes. This thread would benefit from less uneducated speculation and
more examination of what the parameter actually does. I've looked at
how it's set up in the hopes of finding an easy fix and haven't come
up with anything all that wonderful yet. In particular, there is this
code, that gets run on every commit (unless max_wal_senders is zero):

/* Wake up all walsenders */
void
WalSndWakeup(void)
{
int i;

for (i = 0; i < max_wal_senders; i++)
SetLatch(&WalSndCtl->walsnds[i].latch);
}

Notice that this code is able to iterate over all of the WAL senders
that might exist without taking any sort of lock. You do NOT want to
unnecessarily iterate over the entire procarray here. The obvious fix
is to keep an array that contains only the latches for the WAL senders
that actually exist at the moment, but then you have to insert a lock
acquisition and release in here, or risk backends failing to see an
update to the shared variable that identifies the end of the list,
which seems like a pretty bad idea too.

One idea I've had is that we might want to think about defining an
operation that is effectively "store, with a memory barrier". For
example, on x86, this could be implemented using xchg. I think if you
have a single-word variable in shared memory that is always updated
using a locked store, then individual backends should be able to read
it unlocked without risk of seeing a stale value. So in the above
example you could have an array in shared memory and a variable
updated in the manner just described indicating how many slots of the
array are in use, and backends could iterate up to the end of the
array without needing a lock on the number of slots. Of course, you
still have to figure out how to compact the array when a WAL sender
exits, but maybe that could be done with an extra layer of
indirection.

Come to think of it, I'm not really sure I understand what protects
SetLatch() against memory ordering hazards. Is that actually safe?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-11-13 15:07:21
Message-ID:	24987.1289660841@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> One idea I've had is that we might want to think about defining an
> operation that is effectively "store, with a memory barrier". For
> example, on x86, this could be implemented using xchg. I think if you
> have a single-word variable in shared memory that is always updated
> using a locked store, then individual backends should be able to read
> it unlocked without risk of seeing a stale value.

You're still guilty of fuzzy thinking here. What does "stale" mean?
To do anything useful, you need to be able to fetch the value, execute
some sequence of operations that depends on the value, and be assured
that the value you fetched remains relevant throughout that sequence.
A memory barrier by itself doesn't help.

I have seen one or two places where we could use a memory barrier
primitive that doesn't include a lock, but they had to do with ensuring
that different writes would be seen to have occurred in a particular
order. It is not obvious how we could use one here.

Now, one could also argue that commit is already a sufficiently
heavyweight operation that taking/releasing one more LWLock won't
matter too much. But it would be nice to back that argument with
some evidence.

> Come to think of it, I'm not really sure I understand what protects
> SetLatch() against memory ordering hazards. Is that actually safe?

Hmm ... that's a good question. It certainly *looks* like it could
malfunction on machines with weak memory ordering.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-11-13 22:19:55
Message-ID:	AANLkTi=QjQLMSCXcTNhVY1q32=UV3S4vkkFuXD-BJhMq@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 13, 2010 at 10:07 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> One idea I've had is that we might want to think about defining an
>> operation that is effectively "store, with a memory barrier". For
>> example, on x86, this could be implemented using xchg. I think if you
>> have a single-word variable in shared memory that is always updated
>> using a locked store, then individual backends should be able to read
>> it unlocked without risk of seeing a stale value.
>
> You're still guilty of fuzzy thinking here. What does "stale" mean?

Well, that obviously depends on what algorithm you use to add and
remove items from the array. I know my thinking is fuzzy; I was
trying to say "I'm mulling over whether we could do something like
this" rather than "I know exactly how to make this work".

*thinks it over*

Here's an algorithm that might work. Let's assume that we're still
going to allocate an array of size max_wal_senders, but we want to
make max_wal_senders relatively large butl avoid iterating over the
entire array on each commit. Store a variable in shared memory called
FirstWalSenderOffset which is always updating using a memory barrier
operation. This value is -1 if there are no currently connected WAL
senders, or the array slot of a currently connected walsender if there
is one. Each array slot also has a structure member called
NextWalSenderOffset, so that the list of connect WAL senders is
effectively stored as a linked list, but with array slot numbers
rather than pointers.

To add a WAL sender, we initialize the new array slot, setting
NextWalSenderOffset to FirstWalSenderOffset, and then set
FirstWalSenderOffset to the slot number of the newly allocated slot.
To remove a WAL sender, we loop through the notional linked list and,
when we find the "pointer" to the slot we want to remove, we override
it with a pointer to the slot to which the removed entry points. All
of these stores are done using the store-with-memory-barrier, so that
readers can iterate over the linked list without a lock. However, we
can't let two processes try to MODIFY the list at the same time (at
least, not without a more powerful memory ordering primitive; COMPXCHG
might be enough) so just protect list modification with a global
spinlock; updates won't be frequent.

This algorithm makes the assumption that it's OK for a scan to miss
WAL senders that are added mid-scan. But the current code makes that
assumption, too: a new WAL sender could grab an array slot we've
already passed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-11-14 20:45:00
Message-ID:	4CE04A4C.3090003@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 13.11.2010 17:07, Tom Lane wrote:
> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>> Come to think of it, I'm not really sure I understand what protects
>> SetLatch() against memory ordering hazards. Is that actually safe?
>
> Hmm ... that's a good question. It certainly *looks* like it could
> malfunction on machines with weak memory ordering.

Can you elaborate?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: max_wal_senders must die
Date:	2010-11-14 20:55:33
Message-ID:	10468.1289768133@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> On 13.11.2010 17:07, Tom Lane wrote:
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> Come to think of it, I'm not really sure I understand what protects
>>> SetLatch() against memory ordering hazards. Is that actually safe?
>>
>> Hmm ... that's a good question. It certainly *looks* like it could
>> malfunction on machines with weak memory ordering.

> Can you elaborate?

Weak memory ordering means that stores into shared memory initiated by
one processor are not guaranteed to be observed to occur in the same
sequence by another processor. This implies first that the latch code
could malfunction all by itself, if two processes manipulate a latch at
about the same time, and second (probably much less likely) that there
could be a malfunction involving a process that's waited on a latch not
seeing the shared-memory status updates that another process did "before"
setting the latch.

This is not at all hypothetical --- my first attempt at rewriting the
sinval signaling code, a couple years back, failed on PPC machines in
the buildfarm because of exactly this type of issue.

The quick-and-dirty way to fix this is to attach a spinlock to each
latch, because we already have memory ordering sync instructions in
the spinlock primitives. Doing better would probably involve developing
a new set of processor-specific primitives --- which would be pretty
easy for the processors where we have gcc inline asm, but it would take
some research for the platforms where we're relying on magic OS-provided
subroutines.

regards, tom lane

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 07:15:08
Message-ID:	4CE0DDFC.7090801@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 14.11.2010 22:55, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> On 13.11.2010 17:07, Tom Lane wrote:
>>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>>> Come to think of it, I'm not really sure I understand what protects
>>>> SetLatch() against memory ordering hazards. Is that actually safe?
>>>
>>> Hmm ... that's a good question. It certainly *looks* like it could
>>> malfunction on machines with weak memory ordering.
>
>> Can you elaborate?
>
> Weak memory ordering means that stores into shared memory initiated by
> one processor are not guaranteed to be observed to occur in the same
> sequence by another processor. This implies first that the latch code
> could malfunction all by itself, if two processes manipulate a latch at
> about the same time, and second (probably much less likely) that there
> could be a malfunction involving a process that's waited on a latch not
> seeing the shared-memory status updates that another process did "before"
> setting the latch.
>
> This is not at all hypothetical --- my first attempt at rewriting the
> sinval signaling code, a couple years back, failed on PPC machines in
> the buildfarm because of exactly this type of issue.

Hmm, SetLatch only sets one flag, so I don't see how it could
malfunction all by itself. And I would've thought that declaring the
Latch variable "volatile" prevents rearrangements.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 13:22:32
Message-ID:	AANLkTikU5n9q4-3sCP-EwUThMV2WutpMMeCqBjnY9hYt@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 2:15 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>> Can you elaborate?
>>
>> Weak memory ordering means that stores into shared memory initiated by
>> one processor are not guaranteed to be observed to occur in the same
>> sequence by another processor. This implies first that the latch code
>> could malfunction all by itself, if two processes manipulate a latch at
>> about the same time, and second (probably much less likely) that there
>> could be a malfunction involving a process that's waited on a latch not
>> seeing the shared-memory status updates that another process did "before"
>> setting the latch.
>>
>> This is not at all hypothetical --- my first attempt at rewriting the
>> sinval signaling code, a couple years back, failed on PPC machines in
>> the buildfarm because of exactly this type of issue.
>
> Hmm, SetLatch only sets one flag, so I don't see how it could malfunction
> all by itself. And I would've thought that declaring the Latch variable
> "volatile" prevents rearrangements.

It's not a question of code rearrangement. Suppose at time zero, the
latch is unset, but owned. At approximately the same time, SetLatch()
is called in one process and WaitLatch() in another process.
SetLatch() sees that the latch is not set and sends SIGUSR1 to the
other process. The other process receives the signal but, since
waiting is not yet set, it ignores the signal. It then drains the
self-pipe and examines latch->is_set. But as it turns out, the update
by the process which called SetLatch() isn't yet visible to this
process, because this process has a copy of those bytes in some
internal cache that isn't guaranteed to be fully coherent. So even
though SetLatch() already changed latch->is_set to true, it still
looks false here. Therefore, we go to sleep on the latch.

At this point, we are very likely screwed. If we're lucky, yet a
third process will come along, also see the latch as still unset (even
though it is), and set it again, waking up the owner. But if we're
unlucky, by the time that third process comes along, the memory update
will have become visible everywhere and all future calls to SetLatch()
will exit quickly, leaving the poor shmuck who waited on the latch
sleeping for all eternity.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 13:45:40
Message-ID:	4CE13984.1070900@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15.11.2010 15:22, Robert Haas wrote:
> On Mon, Nov 15, 2010 at 2:15 AM, Heikki Linnakangas
> <heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>>>> Can you elaborate?
>>>
>>> Weak memory ordering means that stores into shared memory initiated by
>>> one processor are not guaranteed to be observed to occur in the same
>>> sequence by another processor. This implies first that the latch code
>>> could malfunction all by itself, if two processes manipulate a latch at
>>> about the same time, and second (probably much less likely) that there
>>> could be a malfunction involving a process that's waited on a latch not
>>> seeing the shared-memory status updates that another process did "before"
>>> setting the latch.
>>>
>>> This is not at all hypothetical --- my first attempt at rewriting the
>>> sinval signaling code, a couple years back, failed on PPC machines in
>>> the buildfarm because of exactly this type of issue.
>>
>> Hmm, SetLatch only sets one flag, so I don't see how it could malfunction
>> all by itself. And I would've thought that declaring the Latch variable
>> "volatile" prevents rearrangements.
>
> It's not a question of code rearrangement.

Rearrangement of code, rearrangement of CPU instructions, or
rearrangement of the order the changes in the memory become visible to
other processes. The end result is the same.

> Suppose at time zero, the
> latch is unset, but owned. At approximately the same time, SetLatch()
> is called in one process and WaitLatch() in another process.
> SetLatch() sees that the latch is not set and sends SIGUSR1 to the
> other process. The other process receives the signal but, since
> waiting is not yet set, it ignores the signal. It then drains the
> self-pipe and examines latch->is_set. But as it turns out, the update
> by the process which called SetLatch() isn't yet visible to this
> process, because this process has a copy of those bytes in some
> internal cache that isn't guaranteed to be fully coherent. So even
> though SetLatch() already changed latch->is_set to true, it still
> looks false here. Therefore, we go to sleep on the latch.

Surely marking the latch pointer volatile would force the store to
is_set to be flushed, if not immediately, at least before the kill()
system call. No?

Looking at Tom's patch in sinvaladt.c, the problem there was that the
the store of the shared maxMsgNum variable could become visible to other
processes after the store of the message itself. Using a volatile
pointer for maxMsgNum would not have helped with that, because the
operations on other variables might still be rearranged with respect to
the store of maxMsgNum. SetLatch is simpler, there is only one variable
(ok, two, but your scenario didn't involve a change in owner_pid). It
seems safe to assume that the store becomes visible before the system call.

Tom's other scenario, where changing some other variable in shared
memory might not have become visible to other processes when SetLatch()
runs, seems more plausible (if harder to run into in practice). But if
the variable is meant to be examined by other processes, then you should
use a lock to protect it. Otherwise you'll have concurrency issues
anyway. Or at least use a volatile pointer, I believe it's safe to
assume that two operations using a volatile pointer will not be
rearranged wrt. each other.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 13:55:01
Message-ID:	AANLkTikMUJ3SwCFt8PK5P9Jqjukj+mHDGLwRXCtZk-+-@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 8:45 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
>> It's not a question of code rearrangement.
>
> Rearrangement of code, rearrangement of CPU instructions, or rearrangement
> of the order the changes in the memory become visible to other processes.
> The end result is the same.

I'll let Tom speak to this, because he understands it better than I
do, but I don't think this is really true. Rearrangement of code is
something that the compiler has control over, and volatile addresses
that issue by preventing the compiler from making certain assumptions,
but the order in which memory operations become visible is a result of
the architecture of the memory bus, and the compiler doesn't directly
do anything about that. It won't for example emit a cmpxchg
instruction rather than a simple store just because you declared the
variable volatile.

>> Suppose at time zero, the
>> latch is unset, but owned. At approximately the same time, SetLatch()
>> is called in one process and WaitLatch() in another process.
>> SetLatch() sees that the latch is not set and sends SIGUSR1 to the
>> other process. The other process receives the signal but, since
>> waiting is not yet set, it ignores the signal. It then drains the
>> self-pipe and examines latch->is_set. But as it turns out, the update
>> by the process which called SetLatch() isn't yet visible to this
>> process, because this process has a copy of those bytes in some
>> internal cache that isn't guaranteed to be fully coherent. So even
>> though SetLatch() already changed latch->is_set to true, it still
>> looks false here. Therefore, we go to sleep on the latch.
>
> Surely marking the latch pointer volatile would force the store to is_set to
> be flushed, if not immediately, at least before the kill() system call. No?

I don't think so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 14:51:58
Message-ID:	13400.1289832718@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>> Hmm, SetLatch only sets one flag, so I don't see how it could malfunction
>>> all by itself. And I would've thought that declaring the Latch variable
>>> "volatile" prevents rearrangements.
>>
>> It's not a question of code rearrangement.

Precisely. "volatile" prevents the compiler from rearranging the
instruction sequence in a way that would *issue* stores out-of-order.
However, that doesn't prevent the hardware from *executing* the stores
out-of-order from the point of view of a different processor. As Robert
noted, the risk cases here come from caching; in particular, that a
dirty cache line might get flushed to main memory later than some other
dirty cache line. There are some architectures that guarantee that this
won't happen (no doubt at significant expenditure of gates). There are
others that don't, preferring to optimize the single-processor case.
On those, you need an "msync" type of instruction to force dirty cache
lines out to main memory between any two stores whose effects had better
become visible to another processor in a certain order. I'm not sure if
it's universal, but on PPC there are actually two different sync
concepts involved, a write barrier that does the above and a read
barrier that ensures the reading processor is up-to-date.

> I believe it's safe to
> assume that two operations using a volatile pointer will not be
> rearranged wrt. each other.

This is entirely wrong, so far as cross-processor visibility of changes
is concerned. Whether it should be true in some ideal reading of the C
spec is not relevant: there are common architectures in which it is not
true.

The window for trouble is normally pretty small, and in particular any
kernel call or context swap is usually going to force an msync. So I'm
not sure that there are any risk spots in practice right now with
SetLatch. But if we expand our use of it, we can be 100% certain we'll
get bit eventually.

> Tom's other scenario, where changing some other variable in shared
> memory might not have become visible to other processes when SetLatch()
> runs, seems more plausible (if harder to run into in practice). But if
> the variable is meant to be examined by other processes, then you should
> use a lock to protect it.

In that case, of what use is the latch stuff? The whole point with that
(or at least a lot of the point) is to not have to take locks.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 15:09:17
Message-ID:	AANLkTi=BPxJ3PJ8Z6dUf3e=s0B0BXAFuWiJOMT6E1dfF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 9:51 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>>>> Hmm, SetLatch only sets one flag, so I don't see how it could malfunction
>>>> all by itself. And I would've thought that declaring the Latch variable
>>>> "volatile" prevents rearrangements.
>>>
>>> It's not a question of code rearrangement.
>
> Precisely. "volatile" prevents the compiler from rearranging the
> instruction sequence in a way that would *issue* stores out-of-order.
> However, that doesn't prevent the hardware from *executing* the stores
> out-of-order from the point of view of a different processor. As Robert
> noted, the risk cases here come from caching; in particular, that a
> dirty cache line might get flushed to main memory later than some other
> dirty cache line. There are some architectures that guarantee that this
> won't happen (no doubt at significant expenditure of gates).

And in fact if this (interesting!) video is any indication, that
problem is only going to get worse as core counts go up. This guy
built a lock-free, wait-free hash table implementation that can run on
a system with hundreds of cores. I'm just guessing here, but I
strongly suspect that keeping memory in full sync across that many
processors would just kill performance, so they shrug their shoulders
and don't. The application programmer gets to pick up the pieces.

http://video.google.com/videoplay?docid=2139967204534450862#

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 15:26:43
Message-ID:	4CE15133.8000109@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 15.11.2010 16:51, Tom Lane wrote:
> Heikki Linnakangas<heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> I believe it's safe to
>> assume that two operations using a volatile pointer will not be
>> rearranged wrt. each other.
>
> This is entirely wrong, so far as cross-processor visibility of changes
> is concerned.

Ok.

In SetLatch, is it enough to add the SpinLockAcquire() call *after*
checking that is_set is not already set? Ie. still do the quick exit
without holding a lock. Or do we need a memory barrier operation before
the fetch, to ensure that we see if the other process has just cleared
the flag with ResetLatch() ? Presumable ResetLatch() needs to call
SpinLockAcquire() anyway to ensure that other processes see the clearing
of the flag.

>> Tom's other scenario, where changing some other variable in shared
>> memory might not have become visible to other processes when SetLatch()
>> runs, seems more plausible (if harder to run into in practice). But if
>> the variable is meant to be examined by other processes, then you should
>> use a lock to protect it.
>
> In that case, of what use is the latch stuff? The whole point with that
> (or at least a lot of the point) is to not have to take locks.

The use case for a latch is to wake up another process to examine other
piece of shared memory (or a file or something else), and take action
based on that other state if needed. Access to that other piece of
shared memory needs locking or some other means of concurrency control,
regardless of the mechanism used to wake up the other process.

Take the walsender latches for example. The "other piece of shared
memory" is the current WAL flush location. The latch is used to wake up
a walsender after flushing some WAL. The latch itself doesn't protect
the access to the WAL flush pointer in any way, GetFlushRecPtr() uses a
spinlock for that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-15 16:12:00
Message-ID:	15085.1289837520@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> In SetLatch, is it enough to add the SpinLockAcquire() call *after*
> checking that is_set is not already set? Ie. still do the quick exit
> without holding a lock. Or do we need a memory barrier operation before
> the fetch, to ensure that we see if the other process has just cleared
> the flag with ResetLatch() ? Presumable ResetLatch() needs to call
> SpinLockAcquire() anyway to ensure that other processes see the clearing
> of the flag.

Hmm ... I just remembered the reason why we didn't use a spinlock in
these functions already. Namely, that it's unsafe for a signal handler
to try to acquire a spinlock that the interrupted code might be holding.
So I think a bit more thought is needed here. Maybe we need to bite the
bullet and do memory barriers ...

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-18 21:13:43
Message-ID:	AANLkTik9Py9zLr9UvAeY1m=GUmtHvdD2HExVOD4qM26R@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 15, 2010 at 11:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> In SetLatch, is it enough to add the SpinLockAcquire() call *after*
>> checking that is_set is not already set? Ie. still do the quick exit
>> without holding a lock. Or do we need a memory barrier operation before
>> the fetch, to ensure that we see if the other process has just cleared
>> the flag with ResetLatch() ? Presumable ResetLatch() needs to call
>> SpinLockAcquire() anyway to ensure that other processes see the clearing
>> of the flag.
>
> Hmm ... I just remembered the reason why we didn't use a spinlock in
> these functions already. Namely, that it's unsafe for a signal handler
> to try to acquire a spinlock that the interrupted code might be holding.
> So I think a bit more thought is needed here. Maybe we need to bite the
> bullet and do memory barriers ...

The signal handler just checks a process-local, volatile variable
called "waiting" (which should be fine) and then sends a self-pipe
byte. It deliberately *doesn't* take a spinlock. So unless I'm
missing something (which is perfectly possible) protecting a few more
things with a spinlock should be safe enough.

Of course, there's still a potential *performance* problem if we end
up doing a kernel call while holding a spin lock, but I'm uncertain
whether we should worry about that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-18 22:17:10
Message-ID:	11792.1290118630@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Mon, Nov 15, 2010 at 11:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Hmm ... I just remembered the reason why we didn't use a spinlock in
>> these functions already. Namely, that it's unsafe for a signal handler
>> to try to acquire a spinlock that the interrupted code might be holding.

> The signal handler just checks a process-local, volatile variable
> called "waiting" (which should be fine) and then sends a self-pipe
> byte. It deliberately *doesn't* take a spinlock.

I'm not talking about latch_sigusr1_handler. I'm talking about the
several already-existing signal handlers that call SetLatch. Now maybe
it's possible to prove that none of those can fire in a process that's
mucking with the same latch in-line, but that sounds fragile as heck;
and anyway what of future usage? Given that precedent, somebody is
going to write something unsafe at some point, and it'll fail only often
enough to be seen in the field but not in our testing.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 02:17:40
Message-ID:	AANLkTimysOTF2sajOf1E=YG1ETVFqaUQKkqi6fuwaKhP@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 18, 2010 at 5:17 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> On Mon, Nov 15, 2010 at 11:12 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> Hmm ... I just remembered the reason why we didn't use a spinlock in
>>> these functions already. Namely, that it's unsafe for a signal handler
>>> to try to acquire a spinlock that the interrupted code might be holding.
>
>> The signal handler just checks a process-local, volatile variable
>> called "waiting" (which should be fine) and then sends a self-pipe
>> byte. It deliberately *doesn't* take a spinlock.
>
> I'm not talking about latch_sigusr1_handler. I'm talking about the
> several already-existing signal handlers that call SetLatch. Now maybe
> it's possible to prove that none of those can fire in a process that's
> mucking with the same latch in-line, but that sounds fragile as heck;
> and anyway what of future usage? Given that precedent, somebody is
> going to write something unsafe at some point, and it'll fail only often
> enough to be seen in the field but not in our testing.

Oh, I get it. You're right. We can't possibly assume that that we're
not trying to set the latch for our own process, because that's the
whole point of having the self-pipe code in the first place.

How about changing the API so that the caller must use one function,
say, SetOwnedLatch(), to set a latch that they own, and another
function, say, SetNonOwnedLatch(), to set a latch that they do not
own? The first can simply set is_set (another process might fail to
see that the latch has been set, but the worst thing they can do is
set it over again, which should be fairly harmless) and send a
self-pipe byte. The second can take the spin lock, set is_set, read
the owner PID, release the spin lock, and send a signal. WaitLatch()
can similarly take the spin lock before reading is_set. This requires
the caller to know whether they are setting their own latch or someone
else's latch, but I don't believe that's a problem given current
usage.

I'm all in favor of having some memory ordering primitives so that we
can try to implement better algorithms, but if we use it here it
amounts to a fairly significant escalation in the minimum requirements
to compile PG (which is bad) rather than just a performance
optimization (which is good).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 04:38:14
Message-ID:	18241.1290141494@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I'm all in favor of having some memory ordering primitives so that we
> can try to implement better algorithms, but if we use it here it
> amounts to a fairly significant escalation in the minimum requirements
> to compile PG (which is bad) rather than just a performance
> optimization (which is good).

I don't believe there would be any escalation in compilation
requirements: we already have the ability to invoke stronger primitives
than these. What is needed is research to find out what the primitives
are called, on platforms where we aren't relying on direct asm access.

My feeling is it's time to bite the bullet and do that work. We
shouldn't cripple the latch operations because of laziness at the
outset.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 08:07:14
Message-ID:	201011190907.15109.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 05:38:14 Tom Lane wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> > I'm all in favor of having some memory ordering primitives so that we
> > can try to implement better algorithms, but if we use it here it
> > amounts to a fairly significant escalation in the minimum requirements
> > to compile PG (which is bad) rather than just a performance
> > optimization (which is good).
>
> I don't believe there would be any escalation in compilation
> requirements: we already have the ability to invoke stronger primitives
> than these. What is needed is research to find out what the primitives
> are called, on platforms where we aren't relying on direct asm access.
>
> My feeling is it's time to bite the bullet and do that work. We
> shouldn't cripple the latch operations because of laziness at the
> outset.
I don't think developing the code is the actual code is that hard - s_lock.c
contains nearly everything necessary.
An 'lock xchg' or similar is only marginally slower then the barrier-only
implementation. So doing a TAS() on a slock_t in private memory should be an
easy enough fallback implementation.

So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses
spinlocks for that purpose - no idea where that is true these days.

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:14:58
Message-ID:	AANLkTimHq5__AKGXRWM59rOfUgVPgewoyK1q-98ObUG6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 18, 2010 at 11:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I'm all in favor of having some memory ordering primitives so that we
>> can try to implement better algorithms, but if we use it here it
>> amounts to a fairly significant escalation in the minimum requirements
>> to compile PG (which is bad) rather than just a performance
>> optimization (which is good).
>
> I don't believe there would be any escalation in compilation
> requirements: we already have the ability to invoke stronger primitives
> than these. What is needed is research to find out what the primitives
> are called, on platforms where we aren't relying on direct asm access.

I don't believe that's correct, although it's possible that I may be
missing something. On any platform where we have TAS(), that should
be sufficient to set the flag, but how will we read the flag? A
simple fetch isn't guaranteed to be sufficient; for some
architectures, you might need to insert a read fence, and I don't
think we have anything like that defined right now. We've got
special-cases in s_lock.h for all kinds of crazy architectures; and
it's not obvious what would be needed. For example some operating
system I've never heard of called SINIX has this:

#include "abi_mutex.h"
typedef abilock_t slock_t;

#define TAS(lock) (!acquire_lock(lock))
#define S_UNLOCK(lock) release_lock(lock)
#define S_INIT_LOCK(lock) init_lock(lock)
#define S_LOCK_FREE(lock) (stat_lock(lock) == UNLOCKED)

It's far from obvious to me how to make this do what we need - I have
a sneaking suspicion it can't be done with those primitives at all -
and I bet neither of us has a machine on which it can be tested. Now
maybe we no longer care about supporting SINIX anyway, but the point
is that if we make this change, every platform for which we don't have
working TAS and read-fence operations becomes an unsupported platform.
Forget about --disable-spinlocks; there is no such thing. That
strikes me as an utterly unacceptable amount of collateral damage to
avoid a basically harmless API change, not to mention a ton of work.
You might be able to convince me that it's no longer important to
support platforms without a working spinlock implementation (although
I think it's rather nice that we can - might encourage someone to try
out PG and then contribute an implementation for their favorite
platform) but this is also going to break platforms that nominally
have TAS now (some of the TAS implementations aren't really TAS, as in
the above case, and we may not be able to easily determine what's
required for a read-fence even where TAS is a real TAS).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:16:24
Message-ID:	AANLkTi=+dYw6Q_OCckT0jb612H8mfAqd+kGQ-QuVB+id@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses
> spinlocks for that purpose - no idea where that is true these days.

Me neither, which is exactly the problem. Under Tom's proposal, any
architecture we don't explicitly provide for, breaks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:27:51
Message-ID:	AANLkTikk=HJSGkHFEx7dom=beTki9PkXokGA3BD5d3GF@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 9:16 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses
>> spinlocks for that purpose - no idea where that is true these days.
>
> Me neither, which is exactly the problem. Under Tom's proposal, any
> architecture we don't explicitly provide for, breaks.

Just a small point of clarification - you need to have both that
unknown archtecture, and that architecture has to have postgres
process running simultaneously on difference CPUs with different
caches that are incoherent to have those problems.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:29:10
Message-ID:	201011191529.11332.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 15:16:24 Robert Haas wrote:
> On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses
> > spinlocks for that purpose - no idea where that is true these days.
> Me neither, which is exactly the problem. Under Tom's proposal, any
> architecture we don't explicitly provide for, breaks.
I doubt its that much of a problem as !defined(HAS_TEST_AND_SET) will be so
slow that there would be noise from that side more often...

Besides, we can just jump into the kernel and back in that case (which the TAS
implementation already does), that does more than just a fence...

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:31:06
Message-ID:	AANLkTim1hLMGODNZNcAB2g3FRfNXkNTMhNKObjMUkRY6@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 9:27 AM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Fri, Nov 19, 2010 at 9:16 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>> So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses
>>> spinlocks for that purpose - no idea where that is true these days.
>>
>> Me neither, which is exactly the problem. Under Tom's proposal, any
>> architecture we don't explicitly provide for, breaks.
>
> Just a small point of clarification - you need to have both that
> unknown archtecture, and that architecture has to have postgres
> process running simultaneously on difference CPUs with different
> caches that are incoherent to have those problems.

Sure you do. But so what? Are you going to compile PostgreSQL and
implement TAS as a simple store and read-fence as a simple load? How
likely is that to work out well?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:35:44
Message-ID:	AANLkTimPLKarL3OjCRvV7azqTyimB0Wg7gHcau+q1w9v@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 9:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

>> Just a small point of clarification - you need to have both that
>> unknown archtecture, and that architecture has to have postgres
>> process running simultaneously on difference CPUs with different
>> caches that are incoherent to have those problems.
>
> Sure you do. But so what? Are you going to compile PostgreSQL and
> implement TAS as a simple store and read-fence as a simple load? How
> likely is that to work out well?

If I was trying to "port" PostgreSQL to some strange architecture, and
my strange architecture didtt' have all the normal TAS and memory
bariers stuff because it was only a UP system with no cache, then yes,
and it would work out well ;-)

If it was some strange SMP architecture, I wouldn't expect *anything*
to work out well if the architecture doesn't have some sort of
TAS/memory barrier/cache-coherency stuff in it ;-)

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:36:36
Message-ID:	201011191536.36458.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 15:29:10 Andres Freund wrote:
> Besides, we can just jump into the kernel and back in that case (which the
> TAS implementation already does), that does more than just a fence...
Or if you don't believe that is enough initialize a lock on the stack, lock
and forget it...

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:38:37
Message-ID:	AANLkTim=Eo8WsSPht9HPD6MERU34uM2_3AaMOd5N5iDW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 9:29 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Friday 19 November 2010 15:16:24 Robert Haas wrote:
>> On Fri, Nov 19, 2010 at 3:07 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> > So the complicated case seems to be !defined(HAS_TEST_AND_SET) which uses
>> > spinlocks for that purpose - no idea where that is true these days.
>> Me neither, which is exactly the problem. Under Tom's proposal, any
>> architecture we don't explicitly provide for, breaks.
> I doubt its that much of a problem as !defined(HAS_TEST_AND_SET) will be so
> slow that there would be noise from that side more often...
>
> Besides, we can just jump into the kernel and back in that case (which the TAS
> implementation already does), that does more than just a fence...

Eh, really? If there's a workaround for platforms for which we don't
know what the appropriate read-fencing incantation is, then I'd feel
more comfortable about doing this. But I don't see how to make that
work. The whole problem here is that API is designed in such a way
that the signal handler might be invoked when the lock that it needs
to grab is already held by the same process. The reason memory
barriers solve the problem is because they'll be atomically released
when we jump into the signal handler, but that is not true of a
spin-lock or a semaphore.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:49:45
Message-ID:	AANLkTimh54Ka5JQAZQa0qzUv8t4RNtJYAH9ZEoeXGLmm@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 9:35 AM, Aidan Van Dyk <aidan(at)highrise(dot)ca> wrote:
> On Fri, Nov 19, 2010 at 9:31 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> Just a small point of clarification - you need to have both that
>>> unknown archtecture, and that architecture has to have postgres
>>> process running simultaneously on difference CPUs with different
>>> caches that are incoherent to have those problems.
>>
>> Sure you do. But so what? Are you going to compile PostgreSQL and
>> implement TAS as a simple store and read-fence as a simple load? How
>> likely is that to work out well?
>
> If I was trying to "port" PostgreSQL to some strange architecture, and
> my strange architecture didtt' have all the normal TAS and memory
> bariers stuff because it was only a UP system with no cache, then yes,
> and it would work out well ;-)

I get your point, but obviously this case isn't very interesting or
likely in 2010.

> If it was some strange SMP architecture, I wouldn't expect *anything*
> to work out well if the architecture doesn't have some sort of
> TAS/memory barrier/cache-coherency stuff in it ;-)

Well, you'd be pleasantly surprised to find that you could at least
get it to compile using --disable-spinlocks. Yeah, the performance
would probably be lousy and you might run out of semaphores, but at
least for basic stuff it would run. Ripping that out just to avoid an
API change in code we committed two months ago seems a bit extreme,
especially since it's also going to implementing a read-fence
operation on every platform we want to continue supporting. Maybe you
could default the read-fence to just a simple read for platforms that
are not known to have an issue, but all the platforms where TAS is
calling some OS-provided routine that does mysterious magic under the
covers are going to need attention; and I just don't think that
cleaning up everything that's going to break is a very worthwhile
investment of our limited development resources, even if it doesn't
result in needlessly dropping platform support.

If we're going to work on memory primitives, I would much rather see
us put that effort into, say, implementing more efficient LWLock
algorithms to solve the bottlenecks that the MOSBENCH guys found,
rather than spending it on trying to avoid a minor API complication
for the latch facility.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:49:56
Message-ID:	201011191549.57372.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 15:38:37 Robert Haas wrote:
> Eh, really? If there's a workaround for platforms for which we don't
> know what the appropriate read-fencing incantation is, then I'd feel
> more comfortable about doing this. But I don't see how to make that
> work. The whole problem here is that API is designed in such a way
> that the signal handler might be invoked when the lock that it needs
> to grab is already held by the same process. The reason memory
> barriers solve the problem is because they'll be atomically released
> when we jump into the signal handler, but that is not true of a
> spin-lock or a semaphore.
Well, its not generally true - you are right there. But there is a wide range
for syscalls available where its inherently true (which is what I sloppily
referred to). And you are allowed to call a, although quite restricted, set of
system calls even in signal handlers. I don't have the list for older posix
versions in mind, but for 2003 you can choose something from several like
write, lseek,setpgid which inherently have to serialize. And I am quite sure
there were sensible calls for earlier versions.

Andres

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:50:06
Message-ID:	201011191550.06550.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 15:14:58 Robert Haas wrote:
> On Thu, Nov 18, 2010 at 11:38 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> >> I'm all in favor of having some memory ordering primitives so that we
> >> can try to implement better algorithms, but if we use it here it
> >> amounts to a fairly significant escalation in the minimum requirements
> >> to compile PG (which is bad) rather than just a performance
> >> optimization (which is good).
> >
> > I don't believe there would be any escalation in compilation
> > requirements: we already have the ability to invoke stronger primitives
> > than these. What is needed is research to find out what the primitives
> > are called, on platforms where we aren't relying on direct asm access.
>
> I don't believe that's correct, although it's possible that I may be
> missing something. On any platform where we have TAS(), that should
> be sufficient to set the flag, but how will we read the flag? A
> simple fetch isn't guaranteed to be sufficient; for some
> architectures, you might need to insert a read fence, and I don't
> think we have anything like that defined right now.
A TAS is both a read and write fence. After that you don't *need* to fetch it.
And even if it were only a write fence on some platforms - if we consistently
issue write fences at the relevant places that ought to be enough.

Andres

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:51:17
Message-ID:	201011191551.17695.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 15:49:45 Robert Haas wrote:
> If we're going to work on memory primitives, I would much rather see
> us put that effort into, say, implementing more efficient LWLock
> algorithms to solve the bottlenecks that the MOSBENCH guys found,
> rather than spending it on trying to avoid a minor API complication
> for the latch facility.
But for that you will need more infrastructure in that area anyway.

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:54:58
Message-ID:	AANLkTin8A3NJD47idGqP3Dtj1_e6oyiUceBUHErbtXra@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 9:51 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On Friday 19 November 2010 15:49:45 Robert Haas wrote:
>> If we're going to work on memory primitives, I would much rather see
>> us put that effort into, say, implementing more efficient LWLock
>> algorithms to solve the bottlenecks that the MOSBENCH guys found,
>> rather than spending it on trying to avoid a minor API complication
>> for the latch facility.
> But for that you will need more infrastructure in that area anyway.

True, but you don't have to do it all at once. You can continue to do
the same old stuff on the platforms you currently support, and use the
newer stuff on platforms where the right thing to do is readily
apparent, like x64 and x86_64. And people can add support for their
favorite platforms gradually over time, rather than having a flag day
where we stop supporting everything we don't know what to do with.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 14:58:39
Message-ID:	AANLkTikjwMonzTY_6qufx9bNuKDZXmJo6o2nXnqDAzZ4@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 9:49 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Well, its not generally true - you are right there. But there is a wide range
> for syscalls available where its inherently true (which is what I sloppily
> referred to). And you are allowed to call a, although quite restricted, set of
> system calls even in signal handlers. I don't have the list for older posix
> versions in mind, but for 2003 you can choose something from several like
> write, lseek,setpgid which inherently have to serialize. And I am quite sure
> there were sensible calls for earlier versions.

Well, it's not quite enough just to call into the kernel to serialize
on "some point of memory", because your point is to make sure that
*this particular piece of memory* is coherent. It doesn't matter if
the kernel has proper fencing in it's stuff if the memory it's
guarding is in another cacheline, because that won't *necessarily*
force cache coherency in your local lock/variable memory.

--
Aidan Van Dyk Create like a god,
aidan(at)highrise(dot)ca command like a king,
http://www.highrise.ca/ work like a slave.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 15:01:45
Message-ID:	29117.1290178905@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> If we're going to work on memory primitives, I would much rather see
> us put that effort into, say, implementing more efficient LWLock
> algorithms to solve the bottlenecks that the MOSBENCH guys found,
> rather than spending it on trying to avoid a minor API complication
> for the latch facility.

I haven't read all of this very long thread yet, but I will point out
that you seem to be arguing from the position that memory ordering
primitives will only be useful for the latch code. This is nonsense
of the first order. We already know that the sinval signalling
mechanism could use it to avoid needing a spinlock. I submit that
it's very likely that fixing communication bottlenecks elsewhere
will similarly require memory ordering primitives if we are to avoid
the stupid "use a lock" approach. I think it's time to build that
infrastructure.

BTW, I agree with Andres' point that we can probably default memory
barriers to be no-ops on unknown platforms. Weak memory ordering
isn't a common architectural choice. A look through s_lock.h suggests
that PPC and MIPS are the only supported arches that need to worry
about this.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 15:05:03
Message-ID:	AANLkTinN6t=VFtgwLQYpZSUvktiPP_v1Xq1sPtoQ_KgJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 10:01 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> If we're going to work on memory primitives, I would much rather see
>> us put that effort into, say, implementing more efficient LWLock
>> algorithms to solve the bottlenecks that the MOSBENCH guys found,
>> rather than spending it on trying to avoid a minor API complication
>> for the latch facility.
>
> I haven't read all of this very long thread yet, but I will point out
> that you seem to be arguing from the position that memory ordering
> primitives will only be useful for the latch code. This is nonsense
> of the first order. We already know that the sinval signalling
> mechanism could use it to avoid needing a spinlock. I submit that
> it's very likely that fixing communication bottlenecks elsewhere
> will similarly require memory ordering primitives if we are to avoid
> the stupid "use a lock" approach. I think it's time to build that
> infrastructure.

I completely agree, but I'm not too sure I want to drop support for
any platform for which we haven't yet implemented such primitives.
What's different about this case is that "fall back to taking the spin
lock" is not a workable option.

> BTW, I agree with Andres' point that we can probably default memory
> barriers to be no-ops on unknown platforms. Weak memory ordering
> isn't a common architectural choice. A look through s_lock.h suggests
> that PPC and MIPS are the only supported arches that need to worry
> about this.

That's good to hear. I'm more worried, however, about architectures
where we supposedly have TAS but it isn't really TAS but some
OS-provided "acquire a lock" primitive. That won't generalize nicely
to what we need for this case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 15:06:33
Message-ID:	201011191606.34417.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 15:58:39 Aidan Van Dyk wrote:
> On Fri, Nov 19, 2010 at 9:49 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Well, its not generally true - you are right there. But there is a wide
> > range for syscalls available where its inherently true (which is what I
> > sloppily referred to). And you are allowed to call a, although quite
> > restricted, set of system calls even in signal handlers. I don't have
> > the list for older posix versions in mind, but for 2003 you can choose
> > something from several like write, lseek,setpgid which inherently have
> > to serialize. And I am quite sure there were sensible calls for earlier
> > versions.
>
> Well, it's not quite enough just to call into the kernel to serialize
> on "some point of memory", because your point is to make sure that
> *this particular piece of memory* is coherent. It doesn't matter if
> the kernel has proper fencing in it's stuff if the memory it's
> guarding is in another cacheline, because that won't *necessarily*
> force cache coherency in your local lock/variable memory.
Yes and no. It provides the same guarantees as our current approach of using
spinlocks for exactly that - that it theoretically is not enough is an
independent issue (but *definitely* an issue).

Andres

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Aidan Van Dyk <aidan(at)highrise(dot)ca>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 15:12:26
Message-ID:	4CE693DA.3020802@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/19/2010 03:58 PM, Aidan Van Dyk wrote:
> Well, it's not quite enough just to call into the kernel to serialize
> on "some point of memory", because your point is to make sure that
> *this particular piece of memory* is coherent.

Well, that certainly doesn't apply to full fences, that are not specific
to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or
'mf' on ia64.

Regards

Markus Wanner

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 15:36:06
Message-ID:	29758.1290180966@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> ... The reason memory
> barriers solve the problem is because they'll be atomically released
> when we jump into the signal handler, but that is not true of a
> spin-lock or a semaphore.

Hm, I wonder whether your concern is stemming from a wrong mental
model. There is nothing to "release". In my view, a memory barrier
primitive is a sequence point, having the properties that all writes
issued before the barrier shall become visible to other processors
before any writes after it, and also that no reads issued after the
barrier shall be executed until those writes have become visible.
(PPC can separate those two aspects, but I think we probably don't
need to get that detailed for our purposes.) On most processors,
the barrier primitive will just be ((void) 0) because they don't
deal in out-of-order writes anyway.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 15:44:30
Message-ID:	29927.1290181470@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I completely agree, but I'm not too sure I want to drop support for
> any platform for which we haven't yet implemented such primitives.
> What's different about this case is that "fall back to taking the spin
> lock" is not a workable option.

The point I was trying to make is that the fallback position can
reasonably be a no-op.

> That's good to hear. I'm more worried, however, about architectures
> where we supposedly have TAS but it isn't really TAS but some
> OS-provided "acquire a lock" primitive. That won't generalize nicely
> to what we need for this case.

I did say we need some research ;-). We need to look into what's the
appropriate primitive for any such OSes that are available for PPC or
MIPS. I don't feel a need to be paranoid about it for other
architectures.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 15:51:00
Message-ID:	155.1290181860@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Wanner <markus(at)bluegap(dot)ch> writes:
> Well, that certainly doesn't apply to full fences, that are not specific
> to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or
> 'mf' on ia64.

Hm, what do those do exactly? We've never had any such thing in the
Intel-ish spinlock asm, but if out-of-order writes are possible I should
think we'd need 'em. Or does "lock xchgb" imply an mfence?

regards, tom lane

From:	Markus Wanner <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 16:14:55
Message-ID:	4CE6A27F.60003@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 11/19/2010 04:51 PM, Tom Lane wrote:
> Hm, what do those do exactly?

"Performs a serializing operation on all load-from-memory and
store-to-memory instructions that were issued prior the MFENCE
instruction." [1]

Given the memory ordering guarantees of x86, this instruction might only
be relevant for SMP systems, though.

> Or does "lock xchgb" imply an mfence?

Probably on older architectures (given the name "bus locked exchange"),
but OTOH I wouldn't bet on that still being true. Locking the entire bus
sounds like a prohibitively expensive operation with today's amounts of
cores per system.

Regards

Markus Wanner

[1]: random google hit on 'mfence':
http://siyobik.info/index.php?module=x86&id=170

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 16:18:58
Message-ID:	201011191718.59085.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 16:51:00 Tom Lane wrote:
> Markus Wanner <markus(at)bluegap(dot)ch> writes:
> > Well, that certainly doesn't apply to full fences, that are not specific
> > to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or
> > 'mf' on ia64.
> Hm, what do those do exactly? We've never had any such thing in the
> Intel-ish spinlock asm, but if out-of-order writes are possible I should
> think we'd need 'em. Or does "lock xchgb" imply an mfence?
Out of order writes are definitely possible if you consider multiple
processors.
Locked statments like 'lock xaddl;' guarantee that the specific operands (or
their cachelines) are visible on all processors and are done atomically - but
its not influencing the whole cache like mfence would.

Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 16:25:57
Message-ID:	1162.1290183957@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> Locked statments like 'lock xaddl;' guarantee that the specific operands (or
> their cachelines) are visible on all processors and are done atomically - but
> its not influencing the whole cache like mfence would.

Where is this "locking the whole cache" meme coming from? What we're
looking for has nothing to do with locking anything. It's primarily
a directive to the processor to flush any dirty cache lines out to
main memory. It's not going to block any other processors.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 16:31:42
Message-ID:	201011191731.42929.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 17:25:57 Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > Locked statments like 'lock xaddl;' guarantee that the specific operands
> > (or their cachelines) are visible on all processors and are done
> > atomically - but its not influencing the whole cache like mfence would.
> Where is this "locking the whole cache" meme coming from? What we're
> looking for has nothing to do with locking anything. It's primarily
> a directive to the processor to flush any dirty cache lines out to
> main memory. It's not going to block any other processors.
I was never talking about 'locking the whole cache' - I was talking about
flushing/fencing it like a "global" read/write barrier would. And "lock
xchgb/xaddl" does not imply anything for other cachelines but its own.

I only used 'locked' in the context of 'lock xaddl'.

Am I misunderstanding you?

Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 16:57:09
Message-ID:	1744.1290185829@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> I was never talking about 'locking the whole cache' - I was talking about
> flushing/fencing it like a "global" read/write barrier would. And "lock
> xchgb/xaddl" does not imply anything for other cachelines but its own.

If that's the case, why aren't the parallel regression tests falling
over constantly? My recollection is that when I broke the sinval code
by assuming strong memory ordering without spinlocks, it didn't take
long at all for the PPC buildfarm members to expose the problem.
If it's possible for Intel-ish processors to exhibit weak memory
ordering behavior, I'm quite sure that our current code would be showing
bugs everywhere.

The impression I had of current Intel designs is that they ensure global
cache coherency, ie if one processor has a dirty cache line the others
know that, and will go get the updated data before attempting to access
that piece of memory.

regards, tom lane

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 17:06:55
Message-ID:	AANLkTin+Q2zbJ7pK=6fQbKooWF7W_LCWkenDgNi+ROJJ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 10:44 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I completely agree, but I'm not too sure I want to drop support for
>> any platform for which we haven't yet implemented such primitives.
>> What's different about this case is that "fall back to taking the spin
>> lock" is not a workable option.
>
> The point I was trying to make is that the fallback position can
> reasonably be a no-op.

Hmm, maybe you're right. I was assuming weak memory ordering was a
reasonably common phenomenon, but if it only applies to a very small
number of architectures and we're pretty confident we know which ones
they are, your approach would be far less frightening than I
originally thought. But is that really true?

I think it would be useful to try to build up a library of primitives
in this area. For this particular task, we really only need a
write-with-fence primitive and a read-with-fence primitive. On strong
memory ordering machines, these can just do a store and a read,
respectively; on weak memory ordering machines, they can insert
whatever fencing operations are needed on either the store side or the
load side. I think it would also be useful to provide macros for
compare-and-swap and fetch-and-add on platforms where they are
available. Then we could potentially write code like this:

#ifdef HAVE_COMPARE_AND_SWAP
...do it the lock-free way...
#else
...oh, well, do it with spinlocks...
#endif

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 17:13:48
Message-ID:	2132.1290186828@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> I think it would be useful to try to build up a library of primitives
> in this area. For this particular task, we really only need a
> write-with-fence primitive and a read-with-fence primitive.

That's really entirely the wrong way to think about it. You need a
fence primitive, full stop. It's a sequence point, not an operation
in itself. It guarantees that reads/writes occurring before or after
it aren't resequenced around it. I don't even understand what "write
with fence" means --- is the write supposed to be fenced against other
writes before it, or other writes after it?

> I think it would also be useful to provide macros for
> compare-and-swap and fetch-and-add on platforms where they are
> available.

That would be a great deal more work, because it's not a no-op anywhere;
and our need for it is still rather hypothetical. I'm surprised to see
you advocating that when you didn't want to touch fencing a moment ago.

regards, tom lane

From:	Florian Weimer <fweimer(at)bfk(dot)de>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	pgsql-hackers(at)postgresql(dot)org, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 17:14:17
Message-ID:	8239qx5j86.fsf@mid.bfk.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Andres Freund:

> I was never talking about 'locking the whole cache' - I was talking about
> flushing/fencing it like a "global" read/write barrier would. And "lock
> xchgb/xaddl" does not imply anything for other cachelines but its own.

My understanding is that once you've seen the result of an atomic
operation on i386 and amd64, you are guaranteed to observe all prior
writes performed by the thread which did the atomic operation, too.
Explicit fencing is only necessary if you need synchronization without
atomic operations.

--
Florian Weimer <fweimer(at)bfk(dot)de>
BFK edv-consulting GmbH http://www.bfk.de/
Kriegsstraße 100 tel: +49-721-96201-1
D-76133 Karlsruhe fax: +49-721-96201-99

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Markus Wanner <markus(at)bluegap(dot)ch>
Cc:	Aidan Van Dyk <aidan(at)highrise(dot)ca>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 17:46:00
Message-ID:	2709.1290188760@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Markus Wanner <markus(at)bluegap(dot)ch> writes:
>> Well, that certainly doesn't apply to full fences, that are not specific
>> to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or
>> 'mf' on ia64.

> Hm, what do those do exactly?

I poked around in the Intel manuals a bit. They do have mfence (also
lfence and sfence) but so far as I can tell, those are only used to
manage loads and stores that are issued by special instructions that
explicitly mark the operation as weakly ordered. So the reason we're
not seeing bugs is presumably that C compilers don't generate such
instructions. Also, Intel architectures do guarantee cache consistency
across multiple processors (and it costs them a lot...)

I found a fairly interesting and detailed paper about memory fencing
in the Linux kernel:
http://www.rdrop.com/users/paulmck/scalability/paper/ordering.2007.09.19a.pdf

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "Bruce Momjian" <bruce(at)momjian(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 18:29:07
Message-ID:	4CE66D930200002500037C23@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> I think it would be useful to try to build up a library of
>> primitives in this area. For this particular task, we really
>> only need a write-with-fence primitive and a read-with-fence
>> primitive.
>
> That's really entirely the wrong way to think about it. You need
> a fence primitive, full stop. It's a sequence point, not an
> operation in itself. It guarantees that reads/writes occurring
> before or after it aren't resequenced around it. I don't even
> understand what "write with fence" means --- is the write supposed
> to be fenced against other writes before it, or other writes after
> it?

I was taking it to mean something similar to the memory guarantees
around synchronized blocks in Java. At the start of a synchronized
block you discard any cached data which you've previously read from
or written to main memory, and must read everything fresh from that
point. At the end of a synchronized block you must write any
locally written values to main memory, although you retain them in
your thread-local cache for possible re-use. Reads or writes from
outside the synchronized block can be "pulled into" the block and
reordered in among the reads and writes within the block (which may
also be reordered) unless there's another block to contain them.

It works fine once you have your head around it, and allows for
significant optimization in a heavily multi-threaded application. I
have no idea whether such a model would be useful for PostgreSQL. If
I understand Tom he is proposing what sounds roughly like what could
be achieved in the Java memory model by keeping all code for a
process within a single synchronized block, with the fence being a
point where you end it (flushing all local writes to main memory)
and start a new one (forcing a discard of locally cached data).

Of course I'm ignoring the locking aspect of synchronized blocks and
just discussing the memory access aspect of them. (A synchronized
block in Java always references some [any] Object, and causes an
exclusive lock to be held on the object from one end of the block to
the other.)

-Kevin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
Cc:	"Robert Haas" <robertmhaas(at)gmail(dot)com>, "Josh Berkus" <josh(at)agliodbs(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "Bruce Momjian" <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 18:51:36
Message-ID:	3930.1290192696@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> That's really entirely the wrong way to think about it. You need
>> a fence primitive, full stop. It's a sequence point, not an
>> operation in itself.

> I was taking it to mean something similar to the memory guarantees
> around synchronized blocks in Java. At the start of a synchronized
> block you discard any cached data which you've previously read from
> or written to main memory, and must read everything fresh from that
> point. At the end of a synchronized block you must write any
> locally written values to main memory, although you retain them in
> your thread-local cache for possible re-use.

That is basically the model that we have implemented in the spinlock
primitives: taking a spinlock corresponds to starting a "synchronized
block" and releasing the spinlock ends it. On processors that need
it, the spinlock macros include memory fence instructions that implement
the above semantics.

However, for lock-free interactions I think this model isn't terribly
helpful: it's not clear what is "inside" and what is "outside" the sync
block, and forcing your code into that model doesn't improve either
clarity or performance. What you typically need is a guarantee about
the order in which writes become visible. To give a concrete example,
the sinval bug I was mentioning earlier boiled down to assuming that a
write into an element of the sinval message array would become visible
to other processors before the change of the last-message pointer
variable became visible to them. Without a fence instruction, that
doesn't hold on WMO processors, and so they were able to fetch a stale
message value. In some cases you also need to guarantee the order of
reads.

regards, tom lane

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, "Andres Freund" <andres(at)anarazel(dot)de>, "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>, "Robert Haas" <robertmhaas(at)gmail(dot)com>, "Aidan Van Dyk" <aidan(at)highrise(dot)ca>, "Bruce Momjian" <bruce(at)momjian(dot)us>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 19:00:06
Message-ID:	4CE674D60200002500037C2E@gw.wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> What you typically need is a guarantee about the order in which
> writes become visible.

> In some cases you also need to guarantee the order of reads.

Doesn't that suggest different primitives?

-Kevin

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 19:03:27
Message-ID:	201011192003.27712.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 18:46:00 Tom Lane wrote:
> I wrote:
> > Markus Wanner <markus(at)bluegap(dot)ch> writes:
> >> Well, that certainly doesn't apply to full fences, that are not specific
> >> to a particular piece of memory. I'm thinking of 'mfence' on x86_64 or
> >> 'mf' on ia64.
> >
> > Hm, what do those do exactly?
>
> I poked around in the Intel manuals a bit. They do have mfence (also
> lfence and sfence) but so far as I can tell, those are only used to
> manage loads and stores that are issued by special instructions that
> explicitly mark the operation as weakly ordered. So the reason we're
> not seeing bugs is presumably that C compilers don't generate such
> instructions.
Well. Some memcpy() implementations use string (or SIMD) operations which are
weakly ordered though.

> Also, Intel architectures do guarantee cache consistency
> across multiple processors (and it costs them a lot...)
Only if you are talking about the *same* locations though. See example 8.2.3.4

Combined with:
"
For the Intel486 and Pentium processors, the LOCK# signal is always asserted
on the bus during a LOCK operation, even if the area of memory being locked is
cached in the processor. For the P6 and more recent processor families, if
the area of memory being locked during a LOCK operation is cached in the
processor that is performing the LOCK operation as write-back memory and is
completely contained in a cache line, the processor may not assert the LOCK#
signal on the bus. Instead, it will modify the memory location internally and
allow it’s cache coherency mechanism to ensure that the operation is carried
out atomically. This operation is called “cache locking.” The cache coherency
mechanism automatically prevents two or more processors that have cached the
same area of memory from simultaneously modifying data in that area.
"

Which means something like (in intel's terminology) can happen:

initially x = 0

P1: mov [_X], 1
P1: lock xchg Y, 1

P2. lock xchg [_Z], 1
P2: mov r1, [_X]

A valid result is that r1 on P2 is 0.

I think that is not biting pg because it always uses the same spinlocks at the
reading and writing side - but I am not that sure about that.

Andres

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 19:33:45
Message-ID:	201011192033.45551.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Friday 19 November 2010 20:03:27 Andres Freund wrote:
> Which means something like (in intel's terminology) can happen:
>
> initially x = 0
>
> P1: mov [_X], 1
> P1: lock xchg Y, 1
>
> P2. lock xchg [_Z], 1
> P2: mov r1, [_X]
>
> A valid result is that r1 on P2 is 0.
>
> I think that is not biting pg because it always uses the same spinlocks at
> the reading and writing side - but I am not that sure about that.
Which also seems to mean that a simple read memory barrier that does __asm__
__volatile__("lock; xaddl $0, ???") seems not to be enough unless you use the
same address for all those barriers which would cause horrible cacheline
bouncing.

Am I missing something?

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 20:44:10
Message-ID:	AANLkTikw5n3ymjofBeo-f-77kgKKBprFxC+WwyL6Zn+m@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 1:51 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> However, for lock-free interactions I think this model isn't terribly
> helpful: it's not clear what is "inside" and what is "outside" the sync
> block, and forcing your code into that model doesn't improve either
> clarity or performance. What you typically need is a guarantee about
> the order in which writes become visible. To give a concrete example,
> the sinval bug I was mentioning earlier boiled down to assuming that a
> write into an element of the sinval message array would become visible
> to other processors before the change of the last-message pointer
> variable became visible to them. Without a fence instruction, that
> doesn't hold on WMO processors, and so they were able to fetch a stale
> message value. In some cases you also need to guarantee the order of
> reads.

But what about timings vs. random other stuff? Like in this case
there's a problem if the signal arrives before the memory update to
latch->is_set becomes visible. I don't know what we need to do to
guarantee that.

This page seems to indicate that x86 is OK as far as this is concerned
- we can simply store a 1 and everyone will see it:

http://coding.derkeiler.com/Archive/Assembler/comp.lang.asm.x86/2004-08/0979.html

...but if we were to, say, increment a counter at that location, it
would not be safe without a LOCK prefix (further messages in the
thread indicate that you might also have a problem if the address in
question is unaligned).

It's not obvious to me, however, what might be required on other processors.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 22:59:01
Message-ID:	17975.1290207541@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> But what about timings vs. random other stuff? Like in this case
> there's a problem if the signal arrives before the memory update to
> latch->is_set becomes visible. I don't know what we need to do to
> guarantee that.

I don't believe there's an issue there. A context swap into the kernel
is certainly going to include msync. If you're afraid otherwise, you
could put an msync before the kill() call, but I think it's a waste of
effort.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 23:08:07
Message-ID:	18143.1290208087@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On Friday 19 November 2010 18:46:00 Tom Lane wrote:
>> I poked around in the Intel manuals a bit. They do have mfence (also
>> lfence and sfence) but so far as I can tell, those are only used to
>> manage loads and stores that are issued by special instructions that
>> explicitly mark the operation as weakly ordered. So the reason we're
>> not seeing bugs is presumably that C compilers don't generate such
>> instructions.

> Well. Some memcpy() implementations use string (or SIMD) operations which are
> weakly ordered though.

I'd expect memcpy to msync at completion of the move if it does that
kind of thing. Otherwise it's failing to ensure that the move is really
done before it returns.

> "
> For the Intel486 and Pentium processors, the LOCK# signal is always asserted
> on the bus during a LOCK operation, even if the area of memory being locked is
> cached in the processor. For the P6 and more recent processor families, if
> the area of memory being locked during a LOCK operation is cached in the
> processor that is performing the LOCK operation as write-back memory and is
> completely contained in a cache line, the processor may not assert the LOCK#
> signal on the bus. Instead, it will modify the memory location internally and
> allow its cache coherency mechanism to ensure that the operation is carried
> out atomically. This operation is called cache locking. The cache coherency
> mechanism automatically prevents two or more processors that have cached the
> same area of memory from simultaneously modifying data in that area.
> "

Like it says, the cache coherency mechanism prevents this from being a
problem for us. Once the change is made in a processor's cache, it's
the cache's job to ensure that all processors see it --- and on Intel
architectures, the cache does take care of that.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Markus Wanner <markus(at)bluegap(dot)ch>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Josh Berkus <josh(at)agliodbs(dot)com>
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-19 23:57:38
Message-ID:	201011200057.39258.andres@anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Saturday 20 November 2010 00:08:07 Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On Friday 19 November 2010 18:46:00 Tom Lane wrote:
> >> I poked around in the Intel manuals a bit. They do have mfence (also
> >> lfence and sfence) but so far as I can tell, those are only used to
> >> manage loads and stores that are issued by special instructions that
> >> explicitly mark the operation as weakly ordered. So the reason we're
> >> not seeing bugs is presumably that C compilers don't generate such
> >> instructions.
> >
> > Well. Some memcpy() implementations use string (or SIMD) operations which
> > are weakly ordered though.

> Like it says, the cache coherency mechanism prevents this from being a
> problem for us. Once the change is made in a processor's cache, it's
> the cache's job to ensure that all processors see it --- and on Intel
> architectures, the cache does take care of that.
Check example 8.2.3.4 of 3a. - in my opinion that makes my example correct.

Andres

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-20 19:05:33
Message-ID:	AANLkTim8v6EMbA7LtS7AxYTiMPNxhnxuXz3=XYSkD6MW@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 19, 2010 at 5:59 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> But what about timings vs. random other stuff? Like in this case
>> there's a problem if the signal arrives before the memory update to
>> latch->is_set becomes visible. I don't know what we need to do to
>> guarantee that.
>
> I don't believe there's an issue there. A context swap into the kernel
> is certainly going to include msync. If you're afraid otherwise, you
> could put an msync before the kill() call, but I think it's a waste of
> effort.

So what DO we need to guard against here?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-20 21:07:24
Message-ID:	4162.1290287244@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> So what DO we need to guard against here?

I think the general problem can be stated as "process A changes two or
more values in shared memory in a fairly short span of time, and process
B, which is concurrently examining the same variables, sees those
changes occur in a different order than A thought it made them in".

In practice we do not need to worry about changes made with a kernel
call in between, as any sort of context swap will cause the kernel to
force cache synchronization.

Also, the intention is that the locking primitives will take care of
this for any shared structures that are protected by a lock. (There
were some comments upthread suggesting maybe our lock code is not
bulletproof; but if so that's something to fix in the lock code, not
a logic error in code using the locks.)

So what this boils down to is being an issue for shared data structures
that we access without using locks. As, for example, the latch
structures.

The other case that I can think of offhand is the signal multiplexing
flags. I think we're all right there so far as the flags themselves are
concerned because only one atomic update is involved on each side:
there's no possibility of inconsistency due to cache visibility skew.
But we'd be at some risk if we were using any such flag as a cue to go
look at some other shared-memory state.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-21 00:49:30
Message-ID:	AANLkTi=-p3hmqz2-zaPFhcQk_-htzy9p3ad5d+qS9DzM@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 20, 2010 at 9:07 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> In practice we do not need to worry about changes made with a kernel
> call in between, as any sort of context swap will cause the kernel to
> force cache synchronization.
>

Note that not all kernel calls are equal these days. Some
(gettimeofday on Linux) are implemented as very lightweight calls that
don't do memory map changes and might not trigger the kinds of
synchronization you're talking about. Any IPC kernel calls like kill
will though I'm sure.

--
greg

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-21 13:18:10
Message-ID:	AANLkTiki6D5Z5CB7jaDm9uSC3iSTQrbGr1mCb0Z=duMq@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 20, 2010 at 4:07 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Robert Haas <robertmhaas(at)gmail(dot)com> writes:
>> So what DO we need to guard against here?
>
> I think the general problem can be stated as "process A changes two or
> more values in shared memory in a fairly short span of time, and process
> B, which is concurrently examining the same variables, sees those
> changes occur in a different order than A thought it made them in".
>
> In practice we do not need to worry about changes made with a kernel
> call in between, as any sort of context swap will cause the kernel to
> force cache synchronization.
>
> Also, the intention is that the locking primitives will take care of
> this for any shared structures that are protected by a lock. (There
> were some comments upthread suggesting maybe our lock code is not
> bulletproof; but if so that's something to fix in the lock code, not
> a logic error in code using the locks.)
>
> So what this boils down to is being an issue for shared data structures
> that we access without using locks. As, for example, the latch
> structures.

So is the problem case a race involving owning/disowning a latch vs.
setting that same latch?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-22 11:54:55
Message-ID:	4CEA5A0F.1030602@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 21.11.2010 15:18, Robert Haas wrote:
> On Sat, Nov 20, 2010 at 4:07 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>> So what DO we need to guard against here?
>>
>> I think the general problem can be stated as "process A changes two or
>> more values in shared memory in a fairly short span of time, and process
>> B, which is concurrently examining the same variables, sees those
>> changes occur in a different order than A thought it made them in".
>>
>> In practice we do not need to worry about changes made with a kernel
>> call in between, as any sort of context swap will cause the kernel to
>> force cache synchronization.
>>
>> Also, the intention is that the locking primitives will take care of
>> this for any shared structures that are protected by a lock. (There
>> were some comments upthread suggesting maybe our lock code is not
>> bulletproof; but if so that's something to fix in the lock code, not
>> a logic error in code using the locks.)
>>
>> So what this boils down to is being an issue for shared data structures
>> that we access without using locks. As, for example, the latch
>> structures.
>
> So is the problem case a race involving owning/disowning a latch vs.
> setting that same latch?

No. (or maybe that as well, but that's not what we've been concerned
about here). As far as I've understood correctly, the problem is that
process A does something like this:

/* set a shared variable */
((volatile bool *) shmem)->variable = true;
/* Wake up process B to notice that we changed the variable */
SetLatch();

And process B does this:

for (;;)
{
ResetLatch();
if (((volatile bool *) shmem)->variable)
DoStuff();

WaitLatch();
}

This is the documented usage pattern of latches. The problem arises if
process A runs just before ResetLatch, but the effect of setting the
shared variable doesn't become visible until after the if-test in
process B. Process B will clear the is_set flag in ResetLatch(), but it
will not DoStuff(), so it in effect misses the wakeup from process A and
goes back to sleep even though it would have work to do.

This situation doesn't arise in the current use of latches, because the
shared state comparable to shmem->variable in the above example is
protected by a spinlock. But it might become an issue in some future use
case.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kevin Grittner <Kevin(dot)Grittner(at)wicourts(dot)gov>, Josh Berkus <josh(at)agliodbs(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Aidan Van Dyk <aidan(at)highrise(dot)ca>, Bruce Momjian <bruce(at)momjian(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Latches with weak memory ordering (Re: max_wal_senders must die)
Date:	2010-11-24 03:13:08
Message-ID:	AANLkTimC+tpwAqePGQQcctmbQ_SBeTBLMHLnh_3yTTNB@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 22, 2010 at 6:54 AM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> On 21.11.2010 15:18, Robert Haas wrote:
>>
>> On Sat, Nov 20, 2010 at 4:07 PM, Tom Lane<tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>
>>> Robert Haas<robertmhaas(at)gmail(dot)com> writes:
>>>>
>>>> So what DO we need to guard against here?
>>>
>>> I think the general problem can be stated as "process A changes two or
>>> more values in shared memory in a fairly short span of time, and process
>>> B, which is concurrently examining the same variables, sees those
>>> changes occur in a different order than A thought it made them in".
>>>
>>> In practice we do not need to worry about changes made with a kernel
>>> call in between, as any sort of context swap will cause the kernel to
>>> force cache synchronization.
>>>
>>> Also, the intention is that the locking primitives will take care of
>>> this for any shared structures that are protected by a lock. (There
>>> were some comments upthread suggesting maybe our lock code is not
>>> bulletproof; but if so that's something to fix in the lock code, not
>>> a logic error in code using the locks.)
>>>
>>> So what this boils down to is being an issue for shared data structures
>>> that we access without using locks. As, for example, the latch
>>> structures.
>>
>> So is the problem case a race involving owning/disowning a latch vs.
>> setting that same latch?
>
> No. (or maybe that as well, but that's not what we've been concerned about
> here). As far as I've understood correctly, the problem is that process A
> does something like this:
>
> /* set a shared variable */
> ((volatile bool *) shmem)->variable = true;
> /* Wake up process B to notice that we changed the variable */
> SetLatch();
>
> And process B does this:
>
> for (;;)
> {
> ResetLatch();
> if (((volatile bool *) shmem)->variable)
> DoStuff();
>
> WaitLatch();
> }
>
> This is the documented usage pattern of latches. The problem arises if
> process A runs just before ResetLatch, but the effect of setting the shared
> variable doesn't become visible until after the if-test in process B.
> Process B will clear the is_set flag in ResetLatch(), but it will not
> DoStuff(), so it in effect misses the wakeup from process A and goes back to
> sleep even though it would have work to do.
>
> This situation doesn't arise in the current use of latches, because the
> shared state comparable to shmem->variable in the above example is protected
> by a spinlock. But it might become an issue in some future use case.

Eh, so, should we do anything about this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company