Re: Relation cache invalidation on replica

Lists: pgsql-hackers
From: Васильев Дмитрий <d(dot)vasilyev(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Relation cache invalidation on replica
Date: 2016-02-26 12:41:22
Message-ID: CAB-SwXY6oH=9twBkXJtgR4UC1NqT-vpYAtxCseME62ADwyK5OA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Session opened on replica doesn't see concurrently created indexes at this
time on master.

We have master and replica:

1. master: pgbench -i -s 10

2. replica:
explain (analyze,verbose) select * from pgbench_accounts where abalance = 1;

3. master:
ALTER INDEX pgbench_accounts_abalance_idx RENAME TO
pgbench_accounts_abalance_idx_delme;
CREATE INDEX CONCURRENTLY pgbench_accounts_abalance_idx ON pgbench_accounts
USING btree (abalance);
DROP INDEX pgbench_accounts_abalance_idx_delme;

4. at this time on replica:

explain (analyze,verbose) select * from pgbench_accounts where abalance = 1;
pgbench=# explain (analyze,verbose) select * from pgbench_accounts where
abalance = 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using pgbench_accounts_abalance_idx on public.pgbench_accounts
(cost=0.42..4.44 rows=1 width=97) (actual time=655.781..655.781 rows=0
loops=1)
Output: aid, bid, abalance, filler
Index Cond: (pgbench_accounts.abalance = 1)
Planning time: 9388.259 ms
Execution time: 655.900 ms
(5 rows)

pgbench=# explain (analyze,verbose) select * from pgbench_accounts where
abalance = 1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using pgbench_accounts_abalance_idx_delme on
public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual
time=0.014..0.014 rows=0 loops=1)
Output: aid, bid, abalance, filler
Index Cond: (pgbench_accounts.abalance = 1)
Planning time: 0.321 ms
Execution time: 0.049 ms
(5 rows)

pgbench=# explain (analyze,verbose) select * from pgbench_accounts where
abalance = 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Seq Scan on public.pgbench_accounts (cost=0.00..28894.00 rows=1 width=97)
(actual time=3060.451..3060.451 rows=0 loops=1)
Output: aid, bid, abalance, filler
Filter: (pgbench_accounts.abalance = 1)
Rows Removed by Filter: 1000000
Planning time: 0.087 ms
Execution time: 3060.484 ms
(6 rows)

pgbench=# \d+ pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers | Storage | Stats target | Description
--------------------------------------------------+-----------
aid | integer | not null | plain | |
bid | integer | | plain | |
abalance | integer | | plain | |
filler | character(84) | | extended | |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_abalance_idx" btree (abalance)
Options: fillfactor=100

​New opened session successfully uses this index.
Tested with PostgreSQL 9.5.1.

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


From: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
To: Васильев Дмитрий <d(dot)vasilyev(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-26 13:36:17
Message-ID: CAPpHfdswps_Wobo-0wnCzp5P1Od4L_utwQ7CZ7nxqz8pkG0X2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Feb 26, 2016 at 3:41 PM, Васильев Дмитрий <d(dot)vasilyev(at)postgrespro(dot)ru
> wrote:

> Session opened on replica doesn't see concurrently created indexes at this
> time on master.
>

As I get, on standby index is visible when you run SQL queries on catalog
tables (that is what \d+ does), but planner doesn't pick it. Assuming that
in new session planner picks this index, it seems to be bug for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


From: Васильев Дмитрий <d(dot)vasilyev(at)postgrespro(dot)ru>
To: Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-26 13:44:54
Message-ID: CAB-SwXaOhg=bhNHgd5YBzprYaZ+m0wWEJkqGjrKE=WDEJh=OFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

This is obviously a bug because without "concurrently" create index this do
not reproduce.

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

2016-02-26 16:36 GMT+03:00 Alexander Korotkov <a(dot)korotkov(at)postgrespro(dot)ru>:

> On Fri, Feb 26, 2016 at 3:41 PM, Васильев Дмитрий <
> d(dot)vasilyev(at)postgrespro(dot)ru> wrote:
>
>> Session opened on replica doesn't see concurrently created indexes at
>> this time on master.
>>
>
> As I get, on standby index is visible when you run SQL queries on catalog
> tables (that is what \d+ does), but planner doesn't pick it. Assuming that
> in new session planner picks this index, it seems to be bug for me.
>
> ------
> Alexander Korotkov
> Postgres Professional: http://www.postgrespro.com
> The Russian Postgres Company
>


From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-26 15:05:55
Message-ID: 56D069D3.7040407@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The reason of the problem is that invalidation messages are not
delivered to replica after the end of concurrent create index.
Invalidation messages are included in xlog as part of transaction commit
record.
Concurrent index create is split into three transaction, last of which
is just performing inplace update of index tuple, marking it as valid
and invalidating cache. But as far as this transaction is not assigned
XID, no transaction record is created in WAL and send to replicas. As a
result, replica doesn't receive this invalidation messages.

To fix the problem it is just enough to assign XID to transaction.
It can be done by adding GetCurrentTransactionId() call to the end of
DefineIdnex function:

diff --git a/src/backend/commands/indexcmds.c
b/src/backend/commands/indexcmds.c
index 13b04e6..1024603 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -881,6 +881,12 @@ DefineIndex(Oid relationId,
CacheInvalidateRelcacheByRelid(heaprelid.relId);

/*
+ * Force WAL commit record to ensure that replica receives invalidation
+ * messages.
+ */
+ GetCurrentTransactionId();
+
+ /*
* Last thing to do is release the session-level lock on the
parent table.
*/
UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);

On 26.02.2016 15:41, Васильев Дмитрий wrote:
> Session opened on replica doesn't see concurrently created indexes at
> this time on master.
>
> We have master and replica:
>
> 1. master: pgbench -i -s 10
>
> 2. replica:
> explain (analyze,verbose) select * from pgbench_accounts where
> abalance = 1;
>
> 3. master:
> ALTER INDEX pgbench_accounts_abalance_idx RENAME TO
> pgbench_accounts_abalance_idx_delme;
> CREATE INDEX CONCURRENTLY pgbench_accounts_abalance_idx ON
> pgbench_accounts USING btree (abalance);
> DROP INDEX pgbench_accounts_abalance_idx_delme;
>
> 4. at this time on replica:
>
> explain (analyze,verbose) select * from pgbench_accounts where
> abalance = 1;
> pgbench=# explain (analyze,verbose) select * from pgbench_accounts
> where abalance = 1;
> QUERY PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using pgbench_accounts_abalance_idx on
> public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual
> time=655.781..655.781 rows=0 loops=1)
> Output: aid, bid, abalance, filler
> Index Cond: (pgbench_accounts.abalance = 1)
> Planning time: 9388.259 ms
> Execution time: 655.900 ms
> (5 rows)
>
> pgbench=# explain (analyze,verbose) select * from pgbench_accounts
> where abalance = 1;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
> Index Scan using pgbench_accounts_abalance_idx_delme on
> public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual
> time=0.014..0.014 rows=0 loops=1)
> Output: aid, bid, abalance, filler
> Index Cond: (pgbench_accounts.abalance = 1)
> Planning time: 0.321 ms
> Execution time: 0.049 ms
> (5 rows)
>
> pgbench=# explain (analyze,verbose) select * from pgbench_accounts
> where abalance = 1;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------
> Seq Scan on public.pgbench_accounts (cost=0.00..28894.00 rows=1
> width=97) (actual time=3060.451..3060.451 rows=0 loops=1)
> Output: aid, bid, abalance, filler
> Filter: (pgbench_accounts.abalance = 1)
> Rows Removed by Filter: 1000000
> Planning time: 0.087 ms
> Execution time: 3060.484 ms
> (6 rows)
>
> pgbench=# \d+ pgbench_accounts
> Table "public.pgbench_accounts"
> Column | Type | Modifiers | Storage | Stats target | Description
> --------------------------------------------------+-----------
> aid | integer | not null | plain | |
> bid | integer | | plain | |
> abalance | integer | | plain | |
> filler | character(84) | | extended | |
> Indexes:
> "pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
> "pgbench_accounts_abalance_idx" btree (abalance)
> Options: fillfactor=100
>
> ​New opened session successfully uses this index.
> Tested with PostgreSQL 9.5.1.
>
> ---
> Dmitry Vasilyev
> Postgres Professional: http://www.postgrespro.com
> Russian Postgres Company
>

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-27 00:29:58
Message-ID: 20160227002958.peftvmcx4dxwe244@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
> The reason of the problem is that invalidation messages are not delivered to
> replica after the end of concurrent create index.
> Invalidation messages are included in xlog as part of transaction commit
> record.
> Concurrent index create is split into three transaction, last of which is
> just performing inplace update of index tuple, marking it as valid and
> invalidating cache. But as far as this transaction is not assigned XID, no
> transaction record is created in WAL and send to replicas. As a result,
> replica doesn't receive this invalidation messages.

Ugh, that's a fairly ugly bug.

> To fix the problem it is just enough to assign XID to transaction.
> It can be done by adding GetCurrentTransactionId() call to the end of
> DefineIdnex function:

I think it'd be a better idea to always create a commit record if
there's pending invalidation messages. It looks to me that that'd be a
simple addition to RecordTransactionCommit. Otherwise we'd end up with
something rather fragile, relying on people to recognize such
problems. E.g. Vacuum and analyze also have this problem.

Tom, everyone, do you see any reason not to go with such an approach?
Not sending invalidations we've queued up seems like it could cause
rather serious problems with HS and/or logical decoding.

Basically I'm thinking about assigning an xid
if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval)

since we really don't ever want to miss sending out WAL in those
cases. Strictly speaking we don't need an xid, but it seems rather
fragile to suddenly have commit records without one.

Andres


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-27 00:33:52
Message-ID: CANP8+jK7sG1=rbGqq-8unRSZj0SGDDLjVm5uOW0-bZAO3hR-+Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 February 2016 at 00:29, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
> > The reason of the problem is that invalidation messages are not
> delivered to
> > replica after the end of concurrent create index.
> > Invalidation messages are included in xlog as part of transaction commit
> > record.
> > Concurrent index create is split into three transaction, last of which is
> > just performing inplace update of index tuple, marking it as valid and
> > invalidating cache. But as far as this transaction is not assigned XID,
> no
> > transaction record is created in WAL and send to replicas. As a result,
> > replica doesn't receive this invalidation messages.
>
> Ugh, that's a fairly ugly bug.

Looking now.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-27 01:16:34
Message-ID: CANP8+jKtJmfZcfwqe8p_DZwQKPUywq8hZ-KvLoe7ri3nPX0yLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 February 2016 at 00:33, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:

> On 27 February 2016 at 00:29, Andres Freund <andres(at)anarazel(dot)de> wrote:
>
>> On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
>> > The reason of the problem is that invalidation messages are not
>> delivered to
>> > replica after the end of concurrent create index.
>> > Invalidation messages are included in xlog as part of transaction commit
>> > record.
>> > Concurrent index create is split into three transaction, last of which
>> is
>> > just performing inplace update of index tuple, marking it as valid and
>> > invalidating cache. But as far as this transaction is not assigned XID,
>> no
>> > transaction record is created in WAL and send to replicas. As a result,
>> > replica doesn't receive this invalidation messages.
>>
>> Ugh, that's a fairly ugly bug.
>
>
> Looking now.
>

If the above is true, then the proposed fix wouldn't work either.

No point in sending a cache invalidation message on the standby if you
haven't also written WAL, since the catalog re-read would just see the old
row.

heap_inplace_update() does write WAL, which blows away the starting premise.

So I'm not seeing this as an extant bug in an open source version of
PostgreSQL, in my current understanding.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


From: Andres Freund <andres(at)anarazel(dot)de>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-27 01:23:18
Message-ID: 20160227012318.fnjtmtg5q5qvllqk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-02-27 01:16:34 +0000, Simon Riggs wrote:
> If the above is true, then the proposed fix wouldn't work either.
>
> No point in sending a cache invalidation message on the standby if you
> haven't also written WAL, since the catalog re-read would just see the old
> row.
>
> heap_inplace_update() does write WAL, which blows away the starting premise.

I'm not following here. heap_inplace_update() indeed writes WAL, but it
does *NOT* (and may not) assign an xid. Thus we're not emitting the
relcache invalidation queued in DefineIndex(), as
RecordTransactionCommit() currently skips emitting a commit record if
there's no xid.

> So I'm not seeing this as an extant bug in an open source version of
> PostgreSQL, in my current understanding.

But the first message in the thread demonstrated a reproducible problem?

Andres


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-27 01:45:57
Message-ID: CANP8+jLBdAopmx-GRkmooa1=pNM3Nh=ymCEFRWjZTiKxLeJJmg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 February 2016 at 01:23, Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2016-02-27 01:16:34 +0000, Simon Riggs wrote:
> > If the above is true, then the proposed fix wouldn't work either.
> >
> > No point in sending a cache invalidation message on the standby if you
> > haven't also written WAL, since the catalog re-read would just see the
> old
> > row.
> >
> > heap_inplace_update() does write WAL, which blows away the starting
> premise.
>
> I'm not following here. heap_inplace_update() indeed writes WAL, but it
> does *NOT* (and may not) assign an xid. Thus we're not emitting the
> relcache invalidation queued in DefineIndex(), as
> RecordTransactionCommit() currently skips emitting a commit record if
> there's no xid.

OK.

Surely then the fix is to make heap_inplace_update() assign an xid? That
way any catalog change will always generate a commit record containing the
invalidation that goes with the change. No need to fix up the breakage
later.

The other heap_insert|update|delete functions (and similar) all assign xid,
so it is consistent for us to do that for inplace_update also.

diff --git a/src/backend/access/heap/heapam.c
b/src/backend/access/heap/heapam.c
index f443742..94282a0 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -6022,6 +6022,7 @@ heap_abort_speculative(Relation relation, HeapTuple
tuple)
void
heap_inplace_update(Relation relation, HeapTuple tuple)
{
+ TransactionId xid = GetCurrentTransactionId();
Buffer buffer;
Page page;
OffsetNumber offnum;

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
consistent_xid_assignment_for_inplace.v1.patch application/octet-stream 440 bytes

From: Andres Freund <andres(at)anarazel(dot)de>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-27 02:14:05
Message-ID: 20160227021405.mcowwq3u2vcg4g6n@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2016-02-27 01:45:57 +0000, Simon Riggs wrote:
> Surely then the fix is to make heap_inplace_update() assign an xid? That
> way any catalog change will always generate a commit record containing the
> invalidation that goes with the change. No need to fix up the breakage
> later.

Well, we could, but it'd also break things where we rely on
heap_inplace_update not assigning an xid. I'm not seing why that's
better than my proposal of doing this
(http://archives.postgresql.org/message-id/20160227002958.peftvmcx4dxwe244%40alap3.anarazel.de),
by emitting a commit record in RecordTransactionCommmit() if nrels != 0
|| nmsgs != 0 || RelcacheInitFileInval
.

> The other heap_insert|update|delete functions (and similar) all assign xid,
> so it is consistent for us to do that for inplace_update also.

I don't think that follows. Inplace updates an absolute special case,
where are *not allowed* to include an xid in the tuple.

Andres


From: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-27 07:52:12
Message-ID: 56D155AC.2080805@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 02/27/2016 04:16 AM, Simon Riggs wrote:
> On 27 February 2016 at 00:33, Simon Riggs <simon(at)2ndquadrant(dot)com <mailto:simon(at)2ndquadrant(dot)com>> wrote:
>
> On 27 February 2016 at 00:29, Andres Freund <andres(at)anarazel(dot)de <mailto:andres(at)anarazel(dot)de>> wrote:
>
> On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
> > The reason of the problem is that invalidation messages are not delivered to
> > replica after the end of concurrent create index.
> > Invalidation messages are included in xlog as part of transaction commit
> > record.
> > Concurrent index create is split into three transaction, last of which is
> > just performing inplace update of index tuple, marking it as valid and
> > invalidating cache. But as far as this transaction is not assigned XID, no
> > transaction record is created in WAL and send to replicas. As a result,
> > replica doesn't receive this invalidation messages.
>
> Ugh, that's a fairly ugly bug.
>
>
> Looking now.
>
>
> If the above is true, then the proposed fix wouldn't work either.
>
> No point in sending a cache invalidation message on the standby if you haven't also written WAL, since the catalog re-read would just see the old row.
>
> heap_inplace_update() does write WAL, which blows away the starting premise.
>
> So I'm not seeing this as an extant bug in an open source version of PostgreSQL, in my current understanding.
>

Inplace update really creates record in WAL and this is why index is marked as valid at replica.
But invalidation messages are sent only with transaction commit record and such record is not created in this case,
because there is no assigned XID.

This is a real bug which originally observed by one of our customers with different versions of Postgres (last one them have tried was 9.5.1).
Then we reproduced it locally and determined the reason of the problem.
Repro scenario is very simple: you just need to create large enough table (pgbench with scale factor 100 works well in my case)
so that "create index concurrently" takes substantial amount of time. If, while this statement is in progress, you will execute some query at replica which
uses this index, then it will cache state of relation without index. And after even when index is actually constructed, it will never be used in this backend (but other backends at replica will use it).

I am not sure about the best way of fixing the problem.
I have not tested Andreas proposal:

if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval)

if it actually fixes the problem.
Assigning XID in heap_inplace_update definitely should work.
It is better than forcing assignment XID in DefineIndex? I am not sure, because this problem seems to be related only with concurrent update
(but may be I am wrong).
At least not all inplace updates should cause catalog invalidation and so require XID assignment.

> --
> Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2016-02-28 09:53:36
Message-ID: CANP8+jJCFYnBQy0VJpnv+EKxHDsH8-ABv4QsJ+VD=R1_mw=YEQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27 February 2016 at 07:52, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru
> wrote:

> On 02/27/2016 04:16 AM, Simon Riggs wrote:
>
> On 27 February 2016 at 00:33, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>
>> On 27 February 2016 at 00:29, Andres Freund <andres(at)anarazel(dot)de> wrote:
>>
>>> On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
>>> > The reason of the problem is that invalidation messages are not
>>> delivered to
>>> > replica after the end of concurrent create index.
>>> > Invalidation messages are included in xlog as part of transaction
>>> commit
>>> > record.
>>> > Concurrent index create is split into three transaction, last of which
>>> is
>>> > just performing inplace update of index tuple, marking it as valid and
>>> > invalidating cache. But as far as this transaction is not assigned
>>> XID, no
>>> > transaction record is created in WAL and send to replicas. As a result,
>>> > replica doesn't receive this invalidation messages.
>>>
>>> Ugh, that's a fairly ugly bug.
>>
>>
>> Looking now.
>>
>
> If the above is true, then the proposed fix wouldn't work either.
>
> No point in sending a cache invalidation message on the standby if you
> haven't also written WAL, since the catalog re-read would just see the old
> row.
>
> heap_inplace_update() does write WAL, which blows away the starting
> premise.
>
> So I'm not seeing this as an extant bug in an open source version of
> PostgreSQL, in my current understanding.
>
>
> Inplace update really creates record in WAL and this is why index is
> marked as valid at replica.
> But invalidation messages are sent only with transaction commit record and
> such record is not created in this case,
> because there is no assigned XID.
>
> This is a real bug which originally observed by one of our customers with
> different versions of Postgres (last one them have tried was 9.5.1).
> Then we reproduced it locally and determined the reason of the problem.
> Repro scenario is very simple: you just need to create large enough table
> (pgbench with scale factor 100 works well in my case)
> so that "create index concurrently" takes substantial amount of time. If,
> while this statement is in progress, you will execute some query at replica
> which
> uses this index, then it will cache state of relation without index. And
> after even when index is actually constructed, it will never be used in
> this backend (but other backends at replica will use it).
>

OK, so this is a fairly restricted bug. I was wondering how we'd missed it
for years. It wouldn't affect all backends, just those that accessed the
index before it was valid. New backends and restarts wouldn't be affected.

> I am not sure about the best way of fixing the problem.
> I have not tested Andreas proposal:
>
> if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval)
>
> if it actually fixes the problem.
> Assigning XID in heap_inplace_update definitely should work.
> It is better than forcing assignment XID in DefineIndex? I am not sure,
> because this problem seems to be related only with concurrent update
> (but may be I am wrong).
> At least not all inplace updates should cause catalog invalidation and so
> require XID assignment.
>

We have various proposals for fixing this, so on consideration here's what
I think we should do...

1. Ignore my first patch to always set an xid. Andres thought that this may
break something else could be true, so is not worth the risk.

2. Apply Konstantin's patch to fix this situation for the specific case
only.

3. Take Andres' idea and build that in as protection. We currently check
that nrels != 0 and throw an ERROR. We should do the same thing if there is
an invalidation event, so that we catch errors not just ignore them and
issue the commit anyway. This will check that there are no other cases in
other code.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment Content-Type Size
cache_inval_has_commit_rec.v1.patch application/octet-stream 730 bytes

From: Victor Yegorov <vyegorov(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2017-03-13 06:22:22
Message-ID: CAGnEboihavcf2L8MepW6U10qdCjJ=Kbq6+rEXK1rmVwKBi4cOQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2016-02-28 11:53 GMT+02:00 Simon Riggs <simon(at)2ndquadrant(dot)com>:

> We have various proposals for fixing this, so on consideration here's what
> I think we should do...
>
> 1. Ignore my first patch to always set an xid. Andres thought that this
> may break something else could be true, so is not worth the risk.
>
> 2. Apply Konstantin's patch to fix this situation for the specific case
> only.
>
> 3. Take Andres' idea and build that in as protection. We currently check
> that nrels != 0 and throw an ERROR. We should do the same thing if there is
> an invalidation event, so that we catch errors not just ignore them and
> issue the commit anyway. This will check that there are no other cases in
> other code.
>

I have come across this old thread.

I think we're hitting this particular issue quite frequently when
rebuilding indexes on master after big-volume data manipulations.

We have `pgbouncer` in transaction mode for both, master and slave,
therefore it's quite typical to have sessions on slave, that
were using indexes before those we're re-created. Sad, but right now
maintenance is a big performance hit for our applications,
we have to re-connect them to start using indexes again.

Are there any chances to get fix for this issue released in 10.0 and,
perhaps, backpatched also?

--
Victor Yegorov


From: Andres Freund <andres(at)anarazel(dot)de>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>,Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>,Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>,PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2017-03-13 07:22:22
Message-ID: 8B83B90B-3968-41BF-9309-16F9F37FD56A@anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On March 12, 2017 11:22:22 PM PDT, Victor Yegorov <vyegorov(at)gmail(dot)com> wrote:
>2016-02-28 11:53 GMT+02:00 Simon Riggs <simon(at)2ndquadrant(dot)com>:
>
>> We have various proposals for fixing this, so on consideration here's
>what
>> I think we should do...
>>
>> 1. Ignore my first patch to always set an xid. Andres thought that
>this
>> may break something else could be true, so is not worth the risk.
>>
>> 2. Apply Konstantin's patch to fix this situation for the specific
>case
>> only.
>>
>> 3. Take Andres' idea and build that in as protection. We currently
>check
>> that nrels != 0 and throw an ERROR. We should do the same thing if
>there is
>> an invalidation event, so that we catch errors not just ignore them
>and
>> issue the commit anyway. This will check that there are no other
>cases in
>> other code.
>>
>
>I have come across this old thread.
>
>I think we're hitting this particular issue quite frequently when
>rebuilding indexes on master after big-volume data manipulations.
>
>We have `pgbouncer` in transaction mode for both, master and slave,
>therefore it's quite typical to have sessions on slave, that
>were using indexes before those we're re-created. Sad, but right now
>maintenance is a big performance hit for our applications,
>we have to re-connect them to start using indexes again.
>
>Are there any chances to get fix for this issue released in 10.0 and,
>perhaps, backpatched also?

I'm not at my computer right now, but I recall committing something like my approach.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.


From: Victor Yegorov <vyegorov(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2017-04-19 14:07:49
Message-ID: CAGnEbogjOh57Pf6MvpDWirW=5Q=UgH5-eDztbrxhPp51LWwM7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2017-03-13 9:22 GMT+02:00 Andres Freund <andres(at)anarazel(dot)de>:

> >I think we're hitting this particular issue quite frequently when
> >rebuilding indexes on master after big-volume data manipulations.
> >
> >We have `pgbouncer` in transaction mode for both, master and slave,
> >therefore it's quite typical to have sessions on slave, that
> >were using indexes before those we're re-created. Sad, but right now
> >maintenance is a big performance hit for our applications,
> >we have to re-connect them to start using indexes again.
> >
> >Are there any chances to get fix for this issue released in 10.0 and,
> >perhaps, backpatched also?
>
> I'm not at my computer right now, but I recall committing something like
> my approach.

Andres, can you point me on the commit you're mentioning, please?

--
Victor Yegorov


From: Andres Freund <andres(at)anarazel(dot)de>
To: Victor Yegorov <vyegorov(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2017-04-19 20:08:28
Message-ID: 20170419200828.r5zcvddxicrndafp@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-04-19 17:07:49 +0300, Victor Yegorov wrote:
> 2017-03-13 9:22 GMT+02:00 Andres Freund <andres(at)anarazel(dot)de>:
>
> > >I think we're hitting this particular issue quite frequently when
> > >rebuilding indexes on master after big-volume data manipulations.
> > >
> > >We have `pgbouncer` in transaction mode for both, master and slave,
> > >therefore it's quite typical to have sessions on slave, that
> > >were using indexes before those we're re-created. Sad, but right now
> > >maintenance is a big performance hit for our applications,
> > >we have to re-connect them to start using indexes again.
> > >
> > >Are there any chances to get fix for this issue released in 10.0 and,
> > >perhaps, backpatched also?
> >
> > I'm not at my computer right now, but I recall committing something like
> > my approach.
>
>
> Andres, can you point me on the commit you're mentioning, please?

I was thinking of c6ff84b06a68b71719aa1aaa5f6704d8db1b51f8

- Andres


From: Victor Yegorov <vyegorov(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Relation cache invalidation on replica
Date: 2017-04-19 21:15:05
Message-ID: CAGnEbogWKmwSnfg3=8d5=BkkuLu91O3GVYLO_gjby46vk-ok=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

2017-04-19 23:08 GMT+03:00 Andres Freund <andres(at)anarazel(dot)de>:

> I was thinking of c6ff84b06a68b71719aa1aaa5f6704d8db1b51f8
>

Thanks a lot!

I found, that this got into 9.6 already via the Release Notes:
https://www.postgresql.org/docs/current/static/release-9-6.html#AEN131397
Will certainly check this out soon!

--
Victor Yegorov