Quick Links

WAL logging problem in 9.4.3?

Lists:	pgsql-hackers

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	WAL logging problem in 9.4.3?
Date:	2015-07-02 22:05:24
Message-ID:	20150702220524.GA9392@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hoi,

I ran into this in our CI setup and I thinks it's an actual bug. The
issue appears to be that when a table is created *and truncated* in a
single transaction, that the WAL log is logging a truncate record it
shouldn't, such that if the database crashes you have a broken index.
It would also lose any data that was in the table at commit time.

I didn't test 9.4.4 yet, though I don't see anything in the release
notes that resemble this.

Detail:

=== Start with an empty database

martijn(at)martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.

ctmp=# begin;
BEGIN
ctmp=# create table test(id serial primary key);
CREATE TABLE
ctmp=# truncate table test;
TRUNCATE TABLE
ctmp=# commit;
COMMIT
ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
relname | relfilenode
-------------+-------------
test | 16389
test_id_seq | 16387
test_pkey | 16393
(3 rows)

=== Note: if you do a CHECKPOINT here the issue doesn't happen
=== obviously.

ctmp=# \q
martijn(at)martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
[sudo] password for martijn:
-rw------- 1 messagebus ssl-cert 8192 Jul 2 23:34 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert 8192 Jul 2 23:34 /data/postgres/base/16385/16393

=== Note the index file is 8KB.
=== At this point nuke the database server (in this case it was simply
=== destroying the container it was running in.

=== Dump the xlogs just to show what got recorded. Note there's a
=== truncate for the data file and the index file.

martijn(at)martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740

=== Start the DB up again

database_1 | LOG: database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
database_1 | LOG: database system was not properly shut down; automatic recovery in progress
database_1 | LOG: redo starts at 0/16A92A8
database_1 | LOG: record with zero length at 0/16BE740
database_1 | LOG: redo done at 0/16BE710
database_1 | LOG: last completed transaction was at log time 2015-07-02 21:34:45.664989+00
database_1 | LOG: database system is ready to accept connections
database_1 | LOG: autovacuum launcher started

=== Oops, the index file is empty now

martijn(at)martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
-rw------- 1 messagebus ssl-cert 8192 Jul 2 23:37 /data/postgres/base/16385/16387
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:34 /data/postgres/base/16385/16389
-rw------- 1 messagebus ssl-cert 0 Jul 2 23:37 /data/postgres/base/16385/16393

martijn(at)martijn-jessie:$ psql ctmp -h localhost -U username
Password for user username:
psql (9.4.3)
Type "help" for help.

=== And now the index is broken. I think the only reason it doesn't
=== complain about the data file is because zero bytes there is OK. But if
=== the table had data before it would be gone now.

ctmp=# select * from test;
ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes

ctmp=# select version();
version
-----------------------------------------------------------------------------------------------
PostgreSQL 9.4.3 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit
(1 row)

Hope this helps.
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-02 22:21:02
Message-ID:	20150702222102.GG30708@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
> === Start with an empty database

My guess is you have wal_level = minimal?

> ctmp=# begin;
> BEGIN
> ctmp=# create table test(id serial primary key);
> CREATE TABLE
> ctmp=# truncate table test;
> TRUNCATE TABLE
> ctmp=# commit;
> COMMIT
> ctmp=# select relname, relfilenode from pg_class where relname like 'test%';
> relname | relfilenode
> -------------+-------------
> test | 16389
> test_id_seq | 16387
> test_pkey | 16393
> (3 rows)
>

> === Note the index file is 8KB.
> === At this point nuke the database server (in this case it was simply
> === destroying the container it was running in.

How did you continue from there? The container has persistent storage?
Or are you repapplying the WAL to somewhere else?

> === Dump the xlogs just to show what got recorded. Note there's a
> === truncate for the data file and the index file.

That should be ok.

> martijn(at)martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
> rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
> rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
> rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
> rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
> rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
> rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
> rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
> rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
> pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740

Note that the truncate will lead to a new, different, relfilenode.

> === Start the DB up again
>
> database_1 | LOG: database system was interrupted; last known up at 2015-07-02 21:08:05 UTC
> database_1 | LOG: database system was not properly shut down; automatic recovery in progress
> database_1 | LOG: redo starts at 0/16A92A8
> database_1 | LOG: record with zero length at 0/16BE740
> database_1 | LOG: redo done at 0/16BE710
> database_1 | LOG: last completed transaction was at log time 2015-07-02 21:34:45.664989+00
> database_1 | LOG: database system is ready to accept connections
> database_1 | LOG: autovacuum launcher started
>
> === Oops, the index file is empty now

That's probably just the old index file?

> martijn(at)martijn-jessie:$ sudo ls -l /data/postgres/base/16385/{16389,16387,16393}
> -rw------- 1 messagebus ssl-cert 8192 Jul 2 23:37 /data/postgres/base/16385/16387
> -rw------- 1 messagebus ssl-cert 0 Jul 2 23:34 /data/postgres/base/16385/16389
> -rw------- 1 messagebus ssl-cert 0 Jul 2 23:37 /data/postgres/base/16385/16393
>
> martijn(at)martijn-jessie:$ psql ctmp -h localhost -U username
> Password for user username:
> psql (9.4.3)
> Type "help" for help.
>
> === And now the index is broken. I think the only reason it doesn't
> === complain about the data file is because zero bytes there is OK. But if
> === the table had data before it would be gone now.
>
> ctmp=# select * from test;
> ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes

Hm. I can't reproduce this. Can you include a bit more details about how
to reproduce?

Greetings,

Andres Freund

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 05:20:49
Message-ID:	20150703052048.GA20285@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
> Hi,
>
> On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
> > === Start with an empty database
>
> My guess is you have wal_level = minimal?

Default config, was just initdb'd. So yes, the default wal_level =
minimal.

> > === Note the index file is 8KB.
> > === At this point nuke the database server (in this case it was simply
> > === destroying the container it was running in.
>
> How did you continue from there? The container has persistent storage?
> Or are you repapplying the WAL to somewhere else?

The container has persistant storage on the host. What I think is
actually unusual is that the script that started postgres was missing
an 'exec" so postgres never gets the signal to shutdown.

> > martijn(at)martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
> > rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
> > rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
> > rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
> > rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
> > pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740
>
> Note that the truncate will lead to a new, different, relfilenode.

Really? Comparing the relfilenodes gives the same values before and
after the truncate.
>
> > ctmp=# select * from test;
> > ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
>
> Hm. I can't reproduce this. Can you include a bit more details about how
> to reproduce?

Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
can probably construct a Dockerfile that reproduces it pretty reliably.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.
-- Arthur Schopenhauer

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 05:34:44
Message-ID:	CAHGQGwEXta1tMF6o-FcDcjMgCF=jnZLj3v3EoePvsEN4aexJnA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 2:20 PM, Martijn van Oosterhout
<kleptog(at)svana(dot)org> wrote:
> On Fri, Jul 03, 2015 at 12:21:02AM +0200, Andres Freund wrote:
>> Hi,
>>
>> On 2015-07-03 00:05:24 +0200, Martijn van Oosterhout wrote:
>> > === Start with an empty database
>>
>> My guess is you have wal_level = minimal?
>
> Default config, was just initdb'd. So yes, the default wal_level =
> minimal.
>
>> > === Note the index file is 8KB.
>> > === At this point nuke the database server (in this case it was simply
>> > === destroying the container it was running in.
>>
>> How did you continue from there? The container has persistent storage?
>> Or are you repapplying the WAL to somewhere else?
>
> The container has persistant storage on the host. What I think is
> actually unusual is that the script that started postgres was missing
> an 'exec" so postgres never gets the signal to shutdown.
>
>> > martijn(at)martijn-jessie:$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /data/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16389|16387|16393'
>> > rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 0/016A9240, prev 0/016A9200, bkp: 0000, desc: checkpoint: redo 0/16A9240; tli 1; prev tli 1; fpw true; xid 0/686; oid 16387; multi 1; offset 0; oldest xid 673 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; shutdown
>> > rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A92D0, prev 0/016A92A8, bkp: 0000, desc: file create: base/16385/16387
>> > rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016B5E50, prev 0/016B5D88, bkp: 0000, desc: log: rel 1663/16385/16387
>> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016B5F10, prev 0/016B5E50, bkp: 0000, desc: file create: base/16385/16389
>> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BB028, prev 0/016BAFD8, bkp: 0000, desc: file create: base/16385/16393
>> > rmgr: Sequence len (rec/tot): 158/ 190, tx: 686, lsn: 0/016BE4F8, prev 0/016BE440, bkp: 0000, desc: log: rel 1663/16385/16387
>> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6B0, prev 0/016BE660, bkp: 0000, desc: file truncate: base/16385/16389 to 0 blocks
>> > rmgr: Storage len (rec/tot): 16/ 48, tx: 686, lsn: 0/016BE6E0, prev 0/016BE6B0, bkp: 0000, desc: file truncate: base/16385/16393 to 0 blocks
>> > pg_xlogdump: FATAL: error in WAL record at 0/16BE710: record with zero length at 0/16BE740
>>
>> Note that the truncate will lead to a new, different, relfilenode.
>
> Really? Comparing the relfilenodes gives the same values before and
> after the truncate.

Yep, the relfilenodes are not changed in this case because CREATE TABLE and
TRUNCATE were executed in the same transaction block.

>> > ctmp=# select * from test;
>> > ERROR: could not read block 0 in file "base/16385/16393": read only 0 of 8192 bytes
>>
>> Hm. I can't reproduce this. Can you include a bit more details about how
>> to reproduce?
>
> Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
> can probably construct a Dockerfile that reproduces it pretty reliably.

I could reproduce the problem in the master branch by doing
the following steps.

1. start the PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
begin;
create table test(id serial primary key);
truncate table test;
commit;
3. shutdown the server with immediate mode
4. restart the server (crash recovery occurs)
5. execute the following SQL statement
select * from test;

The optimization of TRUNCATE opereation that we can use when
CREATE TABLE and TRUNCATE are executed in the same transaction block
seems to cause the problem. In this case, only index file truncation is
logged, and index creation in btbuild() is not logged because wal_level
is minimal. Then at the subsequent crash recovery, index file is truncated
to 0 byte... Very simple fix is to log an index creation in that case,
but not sure if that's ok to do..

Regards,

--
Fujii Masao

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 06:01:37
Message-ID:	20150703060137.GB20285@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
> > Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
> > can probably construct a Dockerfile that reproduces it pretty reliably.
>
> I could reproduce the problem in the master branch by doing
> the following steps.

Thank you, I wasn't sure if you could kill the server fast enough
without containers, but it looks like immediate mode is enough.

> 1. start the PostgreSQL server with wal_level = minimal
> 2. execute the following SQL statements
> begin;
> create table test(id serial primary key);
> truncate table test;
> commit;
> 3. shutdown the server with immediate mode
> 4. restart the server (crash recovery occurs)
> 5. execute the following SQL statement
> select * from test;
>
> The optimization of TRUNCATE opereation that we can use when
> CREATE TABLE and TRUNCATE are executed in the same transaction block
> seems to cause the problem. In this case, only index file truncation is
> logged, and index creation in btbuild() is not logged because wal_level
> is minimal. Then at the subsequent crash recovery, index file is truncated
> to 0 byte... Very simple fix is to log an index creation in that case,
> but not sure if that's ok to do..

Looks plausible to me.

For reference I attach a small tarball for reproduction with docker.

1. Unpack tarball into empty dir (it has three small files)
2. docker build -t test .
3. docker run -v /tmp/pgtest:/data test
4. docker run -v /tmp/pgtest:/data test

Data dir is in /tmp/pgtest

Attachment	Content-Type	Size
postgresql-test.tgz	application/x-gtar	987 bytes

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 14:44:10
Message-ID:	CAHGQGwFtOMv5h8eKX6zf+9_0UxpcTsyH9xAnyv6KdGH4BonLiw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 3:01 PM, Martijn van Oosterhout
<kleptog(at)svana(dot)org> wrote:
> On Fri, Jul 03, 2015 at 02:34:44PM +0900, Fujii Masao wrote:
>> > Hmm, for me it is 100% reproducable. Are you familiar with Docker? I
>> > can probably construct a Dockerfile that reproduces it pretty reliably.
>>
>> I could reproduce the problem in the master branch by doing
>> the following steps.
>
> Thank you, I wasn't sure if you could kill the server fast enough
> without containers, but it looks like immediate mode is enough.
>
>> 1. start the PostgreSQL server with wal_level = minimal
>> 2. execute the following SQL statements
>> begin;
>> create table test(id serial primary key);
>> truncate table test;
>> commit;
>> 3. shutdown the server with immediate mode
>> 4. restart the server (crash recovery occurs)
>> 5. execute the following SQL statement
>> select * from test;
>>
>> The optimization of TRUNCATE opereation that we can use when
>> CREATE TABLE and TRUNCATE are executed in the same transaction block
>> seems to cause the problem. In this case, only index file truncation is
>> logged, and index creation in btbuild() is not logged because wal_level
>> is minimal. Then at the subsequent crash recovery, index file is truncated
>> to 0 byte... Very simple fix is to log an index creation in that case,
>> but not sure if that's ok to do..

In 9.2 or before, this problem doesn't occur because no such error is thrown
even if an index file size is zero. But in 9.3 or later, since the planner
tries to read a meta page of an index to get the height of the btree tree,
an empty index file causes such error. The planner was changed that way by
commit 31f38f28, and the problem seems to be an oversight of that commit.

I'm not familiar with that change of the planner, but ISTM that we can
simply change _bt_getrootheight() so that 0 is returned if an index file is
empty, i.e., meta page cannot be read, in order to work around the problem.
Thought?

Regards,

--
Fujii Masao

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 14:52:00
Message-ID:	28320.1435935120@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> The optimization of TRUNCATE opereation that we can use when
> CREATE TABLE and TRUNCATE are executed in the same transaction block
> seems to cause the problem. In this case, only index file truncation is
> logged, and index creation in btbuild() is not logged because wal_level
> is minimal. Then at the subsequent crash recovery, index file is truncated
> to 0 byte... Very simple fix is to log an index creation in that case,
> but not sure if that's ok to do..

> In 9.2 or before, this problem doesn't occur because no such error is thrown
> even if an index file size is zero. But in 9.3 or later, since the planner
> tries to read a meta page of an index to get the height of the btree tree,
> an empty index file causes such error. The planner was changed that way by
> commit 31f38f28, and the problem seems to be an oversight of that commit.

What? You want to blame the planner for failing because the index was
left corrupt by broken WAL replay? A failure would occur anyway at
execution.

regards, tom lane

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 16:39:42
Message-ID:	CAHGQGwGzxAyKLUm+z2n43-ee6u976jLmQAz=pPBVzSueD3CAig@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 3, 2015 at 11:52 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
>> The optimization of TRUNCATE opereation that we can use when
>> CREATE TABLE and TRUNCATE are executed in the same transaction block
>> seems to cause the problem. In this case, only index file truncation is
>> logged, and index creation in btbuild() is not logged because wal_level
>> is minimal. Then at the subsequent crash recovery, index file is truncated
>> to 0 byte... Very simple fix is to log an index creation in that case,
>> but not sure if that's ok to do..
>
>> In 9.2 or before, this problem doesn't occur because no such error is thrown
>> even if an index file size is zero. But in 9.3 or later, since the planner
>> tries to read a meta page of an index to get the height of the btree tree,
>> an empty index file causes such error. The planner was changed that way by
>> commit 31f38f28, and the problem seems to be an oversight of that commit.
>
> What? You want to blame the planner for failing because the index was
> left corrupt by broken WAL replay? A failure would occur anyway at
> execution.

Yep, right. I was not thinking that such index with file size 0 is corrupted
because the reported problem didn't happen before that commit was added.
But that's my fault. Such index can cause an error even in other code paths.

Okay, so probably we need to change WAL replay of TRUNCATE so that
the index file is truncated to one containing only meta page instead of
empty one. That is, the WAL replay of TRUNCATE would need to call
index_build() after smgrtruncate() maybe.

Then how should we implement that? Invent new WAL record type that
calls smgrtruncate() and index_build() during WAL replay? Or add the
special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
call index_build() only if the flag is found? Any other good idea?
Anyway ISTM that we might need to add or modify WAL record.

Regards,

--
Fujii Masao

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 16:49:31
Message-ID:	20150703164931.GI3291@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-04 01:39:42 +0900, Fujii Masao wrote:
> Okay, so probably we need to change WAL replay of TRUNCATE so that
> the index file is truncated to one containing only meta page instead of
> empty one. That is, the WAL replay of TRUNCATE would need to call
> index_build() after smgrtruncate() maybe.
>
> Then how should we implement that? Invent new WAL record type that
> calls smgrtruncate() and index_build() during WAL replay? Or add the
> special flag to XLOG_SMGR_TRUNCATE record, and make WAL replay
> call index_build() only if the flag is found? Any other good idea?
> Anyway ISTM that we might need to add or modify WAL record.

It's easy enough to log something like a metapage with
log_newpage().

But the more interesting question is why that's not hhappening
today. RelationTruncateIndexes() does call the index_build() which
should end up WAL logging the index creation.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 16:53:56
Message-ID:	10146.1435942436@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> Okay, so probably we need to change WAL replay of TRUNCATE so that
> the index file is truncated to one containing only meta page instead of
> empty one. That is, the WAL replay of TRUNCATE would need to call
> index_build() after smgrtruncate() maybe.

That seems completely unworkable. For one thing, index_build would expect
to be able to do catalog lookups, but we can't assume that the catalogs
are in a good state yet.

I think the responsibility has to be on the WAL-writing end to emit WAL
instructions that lead to a correct on-disk state. Putting complex
behavior into the reading side is fundamentally misguided.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 17:02:29
Message-ID:	20150703170229.GJ3291@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-03 18:49:31 +0200, Andres Freund wrote:
> But the more interesting question is why that's not hhappening
> today. RelationTruncateIndexes() does call the index_build() which
> should end up WAL logging the index creation.

So that's because there's an XLogIsNeeded() preventing it.

Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
right now missing how the whole "skip wal logging if relation has just
been truncated" optimization can ever actually be crashsafe unless we
use a new relfilenode (which we don't!).

Sure, we do an heap_sync() at the the end of the transaction. That's
nice and all. But it doesn't help if we crash and re-start WAL apply
from a checkpoint before the table was created. Because that'll replay
the truncation.

That's much worse than just the indexes - the rows added by a COPY
without WAL logging will also be truncated away, no?

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 17:14:26
Message-ID:	20150703171426.GA2841@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 03, 2015 at 12:53:56PM -0400, Tom Lane wrote:
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> > Okay, so probably we need to change WAL replay of TRUNCATE so that
> > the index file is truncated to one containing only meta page instead of
> > empty one. That is, the WAL replay of TRUNCATE would need to call
> > index_build() after smgrtruncate() maybe.
>
> That seems completely unworkable. For one thing, index_build would expect
> to be able to do catalog lookups, but we can't assume that the catalogs
> are in a good state yet.
>
> I think the responsibility has to be on the WAL-writing end to emit WAL
> instructions that lead to a correct on-disk state. Putting complex
> behavior into the reading side is fundamentally misguided.

Am I missing something. ISTM that if the truncate record was simply not
logged at all everything would work fine. The whole point is that the
table was created in this transaction and so if it exists the table on
disk must be the correct representation.

The broken index is just one symptom. The heap also shouldn't be
truncated at all. If you insert a row before commit then after replay
the tuple should be there still.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 17:21:21
Message-ID:	20150703172121.GL3291@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
> Am I missing something. ISTM that if the truncate record was simply not
> logged at all everything would work fine. The whole point is that the
> table was created in this transaction and so if it exists the table on
> disk must be the correct representation.

That'd not work either. Consider:

BEGIN;
CREATE TABLE ...
INSERT;
TRUNCATE;
INSERT;
COMMIT;

If you replay that without a truncation wal record the second INSERT
will try to add stuff to already occupied space. And they can have
different lengths and stuff, so you cannot just ignore that fact.

> The broken index is just one symptom.

Agreed. I think the problem is something else though. Namely that we
reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
broken afaics. We need to allocate a new relfilenode and write stuff
into that. Then we can forgo WAL logging the truncation record.

> If you insert a row before commit then after replay the tuple should be there still.

The insert would be WAL logged. COPY skips wal logging tho.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 17:26:05
Message-ID:	20150703172605.GM3291@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
> Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
> right now missing how the whole "skip wal logging if relation has just
> been truncated" optimization can ever actually be crashsafe unless we
> use a new relfilenode (which we don't!).

We actually used to use a different relfilenode, but optimized that
away: cab9a0656c36739f59277b34fea8ab9438395869

commit cab9a0656c36739f59277b34fea8ab9438395869
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: Sun Aug 23 19:23:41 2009 +0000

Make TRUNCATE do truncate-in-place when processing a relation that was created
or previously truncated in the current (sub)transaction. This is safe since
if the (sub)transaction later rolls back, we'd just discard the rel's current
physical file anyway. This avoids unreasonable growth in the number of
transient files when a relation is repeatedly truncated. Per a performance
gripe a couple weeks ago from Todd Cook.

to me the reasoning here looks flawed.

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 19:57:36
Message-ID:	20150703195736.GB2841@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 03, 2015 at 07:21:21PM +0200, Andres Freund wrote:
> On 2015-07-03 19:14:26 +0200, Martijn van Oosterhout wrote:
> > Am I missing something. ISTM that if the truncate record was simply not
> > logged at all everything would work fine. The whole point is that the
> > table was created in this transaction and so if it exists the table on
> > disk must be the correct representation.
>
> That'd not work either. Consider:
>
> BEGIN;
> CREATE TABLE ...
> INSERT;
> TRUNCATE;
> INSERT;
> COMMIT;
>
> If you replay that without a truncation wal record the second INSERT
> will try to add stuff to already occupied space. And they can have
> different lengths and stuff, so you cannot just ignore that fact.

I was about to disagree with you by suggesting that if the table was
created in this transaction then WAL logging is skipped. But testing
shows that inserts are indeed logged, as you point out.

With inserts the WAL records look as follows (relfilenodes changed):

martijn(at)martijn-jessie:~/git/ctm/docker$ sudo /usr/lib/postgresql/9.4/bin/pg_xlogdump -p /tmp/pgtest/postgres/pg_xlog/ 000000010000000000000001 |grep -wE '16386|16384|16390'
rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 0/016A79C8, prev 0/016A79A0, bkp: 0000, desc: file create: base/12139/16384
rmgr: Sequence len (rec/tot): 158/ 190, tx: 683, lsn: 0/016B4258, prev 0/016B2508, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016B4318, prev 0/016B4258, bkp: 0000, desc: file create: base/12139/16386
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016B9468, prev 0/016B9418, bkp: 0000, desc: file create: base/12139/16390
rmgr: Sequence len (rec/tot): 158/ 190, tx: 683, lsn: 0/016BC938, prev 0/016BC880, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Sequence len (rec/tot): 158/ 190, tx: 683, lsn: 0/016BCAF0, prev 0/016BCAA0, bkp: 0000, desc: log: rel 1663/12139/16384
rmgr: Heap len (rec/tot): 35/ 67, tx: 683, lsn: 0/016BCBB0, prev 0/016BCAF0, bkp: 0000, desc: insert(init): rel 1663/12139/16386; tid 0/1
rmgr: Btree len (rec/tot): 20/ 52, tx: 683, lsn: 0/016BCBF8, prev 0/016BCBB0, bkp: 0000, desc: newroot: rel 1663/12139/16390; root 1 lev 0
rmgr: Btree len (rec/tot): 34/ 66, tx: 683, lsn: 0/016BCC30, prev 0/016BCBF8, bkp: 0000, desc: insert: rel 1663/12139/16390; tid 1/1
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016BCC78, prev 0/016BCC30, bkp: 0000, desc: file truncate: base/12139/16386 to 0 blocks
rmgr: Storage len (rec/tot): 16/ 48, tx: 683, lsn: 0/016BCCA8, prev 0/016BCC78, bkp: 0000, desc: file truncate: base/12139/16390 to 0 blocks
rmgr: Heap len (rec/tot): 35/ 67, tx: 683, lsn: 0/016BCCD8, prev 0/016BCCA8, bkp: 0000, desc: insert(init): rel 1663/12139/16386; tid 0/1
rmgr: Btree len (rec/tot): 20/ 52, tx: 683, lsn: 0/016BCD20, prev 0/016BCCD8, bkp: 0000, desc: newroot: rel 1663/12139/16390; root 1 lev 0
rmgr: Btree len (rec/tot): 34/ 66, tx: 683, lsn: 0/016BCD58, prev 0/016BCD20, bkp: 0000, desc: insert: rel 1663/12139/16390; tid 1/1

relname | relfilenode
-------------+-------------
test | 16386
test_id_seq | 16384
test_pkey | 16390
(3 rows)

And amazingly, the database cluster successfuly recovers and there's no
error now. So the problem is *only* because there is no data in the
table at commit time. Which indicates that it's the 'newroot" record
that saves the day normally. And it's apparently generated by the
first insert.

> Agreed. I think the problem is something else though. Namely that we
> reuse the relfilenode for heap_truncate_one_rel(). That's just entirely
> broken afaics. We need to allocate a new relfilenode and write stuff
> into that. Then we can forgo WAL logging the truncation record.

Would that properly initialise the index though?

Anyway, this is way outside my expertise, so I'll bow out now. Let me
know if I can be of more assistance.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 20:05:17
Message-ID:	15205.1435953917@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout <kleptog(at)svana(dot)org> writes:
> With inserts the WAL records look as follows (relfilenodes changed):
> ...
> And amazingly, the database cluster successfuly recovers and there's no
> error now. So the problem is *only* because there is no data in the
> table at commit time. Which indicates that it's the 'newroot" record
> that saves the day normally. And it's apparently generated by the
> first insert.

Yeah, because the correct "empty" state of a btree index is to have a
metapage but no root page, so the first insert forces creation of a root
page. And, by chance, btree_xlog_newroot restores the metapage from
scratch, so this works even if the metapage had been missing or corrupt.

However, things would still break if the first access to the index was
a read attempt rather than an insert.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 22:32:16
Message-ID:	20150703223216.GO3291@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-03 19:26:05 +0200, Andres Freund wrote:
> On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
> > Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
> > right now missing how the whole "skip wal logging if relation has just
> > been truncated" optimization can ever actually be crashsafe unless we
> > use a new relfilenode (which we don't!).
>
> We actually used to use a different relfilenode, but optimized that
> away: cab9a0656c36739f59277b34fea8ab9438395869
>
> commit cab9a0656c36739f59277b34fea8ab9438395869
> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Date: Sun Aug 23 19:23:41 2009 +0000
>
> Make TRUNCATE do truncate-in-place when processing a relation that was created
> or previously truncated in the current (sub)transaction. This is safe since
> if the (sub)transaction later rolls back, we'd just discard the rel's current
> physical file anyway. This avoids unreasonable growth in the number of
> transient files when a relation is repeatedly truncated. Per a performance
> gripe a couple weeks ago from Todd Cook.
>
> to me the reasoning here looks flawed.

It looks to me we need to re-neg on this a bit. I think we can still be
more efficient than the general codepath: We can drop the old
relfilenode immediately. But pg_class.relfilenode has to differ from the
old after the truncation.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 22:38:37
Message-ID:	21533.1435963117@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2015-07-03 19:26:05 +0200, Andres Freund wrote:
>> commit cab9a0656c36739f59277b34fea8ab9438395869
>> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> Date: Sun Aug 23 19:23:41 2009 +0000
>>
>> Make TRUNCATE do truncate-in-place when processing a relation that was created
>> or previously truncated in the current (sub)transaction. This is safe since
>> if the (sub)transaction later rolls back, we'd just discard the rel's current
>> physical file anyway. This avoids unreasonable growth in the number of
>> transient files when a relation is repeatedly truncated. Per a performance
>> gripe a couple weeks ago from Todd Cook.
>>
>> to me the reasoning here looks flawed.

> It looks to me we need to re-neg on this a bit. I think we can still be
> more efficient than the general codepath: We can drop the old
> relfilenode immediately. But pg_class.relfilenode has to differ from the
> old after the truncation.

Why exactly? The first truncation in the (sub)xact would have assigned a
new relfilenode, why do we need another one? The file in question will
go away on crash/rollback in any case, and no other transaction can see
it yet.

I'm prepared to believe that some bit of logic is doing the wrong thing in
this state, but I do not agree that truncate-in-place is unworkable.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-03 23:25:23
Message-ID:	20150703232523.GP3291@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-03 18:38:37 -0400, Tom Lane wrote:
> > Why exactly? The first truncation in the (sub)xact would have assigned a
> new relfilenode, why do we need another one? The file in question will
> go away on crash/rollback in any case, and no other transaction can see
> it yet.

Consider:

BEGIN;
CREATE TABLE;
INSERT largeval;
TRUNCATE;
INSERT 1;
COPY;
INSERT 2;
COMMIT;

INSERT 1 is going to be WAL logged. For that to work correctly TRUNCATE
has to be WAL logged, as otherwise there'll be conflicting/overlapping
tuples on the target page.

But:

The truncation itself is not fully wal logged, neither is the COPY. Both
rely on heap_sync()/immedsync(). For that to be correct the current
relfilenode's truncation may *not* be wal-logged, because the contents
of the COPY or the truncation itself will only be on-disk, not in the
WAL.

Only being on-disk but not in the WAL is a problem if we crash and
replay the truncate record.

> I'm prepared to believe that some bit of logic is doing the wrong
> thing in this state, but I do not agree that truncate-in-place is
> unworkable.

Unless we're prepared to make everything that potentially WAL logs
something do the rel->rd_createSubid == mySubid && dance, I can't see
that working.

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-06 13:43:15
Message-ID:	CAHGQGwGGM2OBh4WOTd8wWqxKzRy+WdJX+-fxxPCWNz4eeJ1aQQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2015-07-03 19:02:29 +0200, Andres Freund wrote:
>> Maybe I'm just daft right now (35C outside, 32 inside, so ...), but I'm
>> right now missing how the whole "skip wal logging if relation has just
>> been truncated" optimization can ever actually be crashsafe unless we
>> use a new relfilenode (which we don't!).

Agreed... When I ran the following test scenario, I found that
the loaded data disappeared after the crash recovery.

1. start PostgreSQL server with wal_level = minimal
2. execute the following SQL statements
\copy (SELECT num FROM generate_series(1,10) num) to /tmp/num.csv with csv
BEGIN;
CREATE TABLE test (i int primary key);
TRUNCATE TABLE test;
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test; -- returns 10

3. shutdown the server with immediate mode
4. restart the server
5. execute the following SQL statement after crash recovery ends
SELECT COUNT(*) FROM test; -- returns 0..

In #2, 10 rows were copied and the transaction was committed.
The subsequent statement of "select count(*)" obviously returned 10.
However, after crash recovery, in #5, the same statement returned 0.
That is, the loaded (+ committed) 10 data was lost after the crash.

> We actually used to use a different relfilenode, but optimized that
> away: cab9a0656c36739f59277b34fea8ab9438395869
>
> commit cab9a0656c36739f59277b34fea8ab9438395869
> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
> Date: Sun Aug 23 19:23:41 2009 +0000
>
> Make TRUNCATE do truncate-in-place when processing a relation that was created
> or previously truncated in the current (sub)transaction. This is safe since
> if the (sub)transaction later rolls back, we'd just discard the rel's current
> physical file anyway. This avoids unreasonable growth in the number of
> transient files when a relation is repeatedly truncated. Per a performance
> gripe a couple weeks ago from Todd Cook.
>
> to me the reasoning here looks flawed.

Before this commit, when I ran the above test scenario, no data loss happened.

Regards,

--
Fujii Masao

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-06 15:14:40
Message-ID:	27532.1436195680@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Sat, Jul 4, 2015 at 2:26 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> We actually used to use a different relfilenode, but optimized that
>> away: cab9a0656c36739f59277b34fea8ab9438395869
>>
>> commit cab9a0656c36739f59277b34fea8ab9438395869
>> Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
>> Date: Sun Aug 23 19:23:41 2009 +0000
>>
>> Make TRUNCATE do truncate-in-place when processing a relation that was created
>> or previously truncated in the current (sub)transaction. This is safe since
>> if the (sub)transaction later rolls back, we'd just discard the rel's current
>> physical file anyway. This avoids unreasonable growth in the number of
>> transient files when a relation is repeatedly truncated. Per a performance
>> gripe a couple weeks ago from Todd Cook.
>>
>> to me the reasoning here looks flawed.

> Before this commit, when I ran the above test scenario, no data loss happened.

Actually I think what is broken here is COPY's test to decide whether it
can omit writing WAL:

* Check to see if we can avoid writing WAL
*
* If archive logging/streaming is not enabled *and* either
* - table was created in same transaction as this COPY
* - data is being written to relfilenode created in this transaction
* then we can skip writing WAL. It's safe because if the transaction
* doesn't commit, we'll discard the table (or the new relfilenode file).
* If it does commit, we'll have done the heap_sync at the bottom of this
* routine first.

The problem with that analysis is that it supposes that, if we crash and
recover, the WAL replay sequence will not touch the data. What's killing
us in this example is the replay of the TRUNCATE, but that is not the only
possibility. For example consider this modification of Fujii-san's test
case:

BEGIN;
CREATE TABLE test (i int primary key);
INSERT INTO test VALUES(-1);
\copy test from /tmp/num.csv with csv
COMMIT;
SELECT COUNT(*) FROM test;

The COUNT() correctly says 11 rows, but after crash-and-recover,
only the row with -1 is there. This is because the INSERT writes
out an INSERT+INIT WAL record, which we happily replay, clobbering
the data added later by COPY.

We might have to give up on this COPY optimization :-(. I'm not
sure what would be a safe rule for deciding that we can skip WAL
logging in this situation, but I am pretty sure that it would
require keeping information we don't currently keep about what's
happened earlier in the transaction.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-06 15:21:23
Message-ID:	20150706152123.GK8902@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
> BEGIN;
> CREATE TABLE test (i int primary key);
> INSERT INTO test VALUES(-1);
> \copy test from /tmp/num.csv with csv
> COMMIT;
> SELECT COUNT(*) FROM test;
>
> The COUNT() correctly says 11 rows, but after crash-and-recover,
> only the row with -1 is there. This is because the INSERT writes
> out an INSERT+INIT WAL record, which we happily replay, clobbering
> the data added later by COPY.

ISTM any WAL logged action that touches a relfilenode essentially needs
to disable further optimization based on the knowledge that the relation
is new.

> We might have to give up on this COPY optimization :-(.

A crazy, not well though through, bandaid for the INSERT+INIT case would
be to force COPY to use a new page when using the SKIP_WAL codepath.

> I'm not sure what would be a safe rule for deciding that we can skip
> WAL logging in this situation, but I am pretty sure that it would
> require keeping information we don't currently keep about what's
> happened earlier in the transaction.

It'd not be impossible to add more state to the relcache entry for the
relation. Whether it's likely that we'd find all the places that'd need
updating that state, I'm not sure.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-06 15:49:54
Message-ID:	28415.1436197794@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
>> The COUNT() correctly says 11 rows, but after crash-and-recover,
>> only the row with -1 is there. This is because the INSERT writes
>> out an INSERT+INIT WAL record, which we happily replay, clobbering
>> the data added later by COPY.

> ISTM any WAL logged action that touches a relfilenode essentially needs
> to disable further optimization based on the knowledge that the relation
> is new.

After a bit more thought, I think it's not so much "any WAL logged action"
as "any unconditionally-replayed action". INSERT+INIT breaks this
example because heap_xlog_insert will unconditionally replay the action,
even if the page is valid and has same or newer LSN. Similarly, TRUNCATE
is problematic because we redo it unconditionally (and in that case it's
hard to see an alternative).

> It'd not be impossible to add more state to the relcache entry for the
> relation. Whether it's likely that we'd find all the places that'd need
> updating that state, I'm not sure.

Yeah, the sticking point is mainly being sure that the state is correctly
tracked, both now and after future changes. We'd need to identify a state
invariant that we could be pretty confident we'd not break.

One idea I had was to allow the COPY optimization only if the heap file is
physically zero-length at the time the COPY starts. That would still be
able to optimize in all the cases we care about making COPY fast for.
Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this. That should be safe if the table was
previously empty.

regards, tom lane

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-09 16:52:36
Message-ID:	CAHGQGwHGMeqA7PbRj9e-cgA7-Sy09c+Ysyy=6Ts=vLOynpU2Hg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2015-07-06 11:14:40 -0400, Tom Lane wrote:
>>> The COUNT() correctly says 11 rows, but after crash-and-recover,
>>> only the row with -1 is there. This is because the INSERT writes
>>> out an INSERT+INIT WAL record, which we happily replay, clobbering
>>> the data added later by COPY.
>
>> ISTM any WAL logged action that touches a relfilenode essentially needs
>> to disable further optimization based on the knowledge that the relation
>> is new.
>
> After a bit more thought, I think it's not so much "any WAL logged action"
> as "any unconditionally-replayed action". INSERT+INIT breaks this
> example because heap_xlog_insert will unconditionally replay the action,
> even if the page is valid and has same or newer LSN. Similarly, TRUNCATE
> is problematic because we redo it unconditionally (and in that case it's
> hard to see an alternative).
>
>> It'd not be impossible to add more state to the relcache entry for the
>> relation. Whether it's likely that we'd find all the places that'd need
>> updating that state, I'm not sure.
>
> Yeah, the sticking point is mainly being sure that the state is correctly
> tracked, both now and after future changes. We'd need to identify a state
> invariant that we could be pretty confident we'd not break.
>
> One idea I had was to allow the COPY optimization only if the heap file is
> physically zero-length at the time the COPY starts.

This seems not helpful for the case where TRUNCATE is executed
before COPY. No?

> That would still be
> able to optimize in all the cases we care about making COPY fast for.
> Rather than reverting cab9a0656c36739f, which would re-introduce a
> different performance problem, perhaps we could have COPY create a new
> relfilenode when it does this. That should be safe if the table was
> previously empty.

So, if COPY is executed multiple times at the same transaction,
only first COPY can be optimized?

After second thought, I'm thinking that we can safely optimize
COPY if no problematic WAL records like INSERT+INIT or TRUNCATE
are generated since current REDO location or the table was created
at the same transaction. That is, if INSERT or TRUNCATE is executed
after the table creation, but if CHECKPOINT happens subsequently,
we don't need to log COPY. The subsequent crash recovery will not
replay such problematic WAL records. So the example cases where
we can optimize COPY are:

BEGIN
CREATE TABLE
COPY
COPY -- subsequent COPY also can be optimized

BEGIN
CREATE TABLE
TRUNCATE
CHECKPOINT
COPY

BEGIN
CREATE TABLE
INSERT
CHECKPOINT
COPY

A crash recovery can start from previous REDO location (i.e., REDO
location of the last checkpoint record). So we might need to check
whether such problematic WAL records are generated since the previous
REDO location instead of current one.

Regards,

--
Fujii Masao

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-09 17:27:43
Message-ID:	22784.1436462863@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
> On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> One idea I had was to allow the COPY optimization only if the heap file is
>> physically zero-length at the time the COPY starts.

> This seems not helpful for the case where TRUNCATE is executed
> before COPY. No?

Huh? The heap file would be zero length in that case.

> So, if COPY is executed multiple times at the same transaction,
> only first COPY can be optimized?

This is true, and I don't think we should care, especially not if we're
going to take risks of incorrect behavior in order to optimize that
third-order case. The fact that we're dealing with this bug at all should
remind us that this stuff is harder than it looks. I want a simple,
reliable, back-patchable fix, and I do not believe that what you are
suggesting would be any of those.

regards, tom lane

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-09 18:23:15
Message-ID:	20150709182315.GG10242@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
> One idea I had was to allow the COPY optimization only if the heap file is
> physically zero-length at the time the COPY starts. That would still be
> able to optimize in all the cases we care about making COPY fast for.
> Rather than reverting cab9a0656c36739f, which would re-introduce a
> different performance problem, perhaps we could have COPY create a new
> relfilenode when it does this. That should be safe if the table was
> previously empty.

I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.

My tentative guess is that the best course is to

a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
truncation replay issue.

b) Force new pages to be used when using the heap_sync mode in
COPY. That avoids the INIT danger you found. It seems rather
reasonable to avoid using pages that have already been the target of
WAL logging here in general.

Andres

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-09 23:06:11
Message-ID:	29916.1436483171@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Andres Freund <andres(at)anarazel(dot)de> writes:
> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>> different performance problem, perhaps we could have COPY create a new
>> relfilenode when it does this. That should be safe if the table was
>> previously empty.

> I'm not convinced that cab9a0656c36739f needs to survive in that
> form. To me only allowing one COPY to benefit from the wal_level =
> minimal optimization has a significantly higher cost than
> cab9a0656c36739f.

What evidence have you got to base that value judgement on?

cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction. On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.

I think you're worrying about exactly the wrong case.

> My tentative guess is that the best course is to
> a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
> truncation replay issue.
> b) Force new pages to be used when using the heap_sync mode in
> COPY. That avoids the INIT danger you found. It seems rather
> reasonable to avoid using pages that have already been the target of
> WAL logging here in general.

And what reason is there to think that this would fix all the problems?
We know of those two, but we've not exactly looked hard for other cases.
Again, the only known field usage for the COPY optimization is the pg_dump
scenario; were that not so, we'd have noticed the problem long since.
So I don't have any faith that this is a well-tested area.

regards, tom lane

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-10 08:50:33
Message-ID:	559F8759.2090401@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/10/2015 02:06 AM, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>>> different performance problem, perhaps we could have COPY create a new
>>> relfilenode when it does this. That should be safe if the table was
>>> previously empty.
>
>> I'm not convinced that cab9a0656c36739f needs to survive in that
>> form. To me only allowing one COPY to benefit from the wal_level =
>> minimal optimization has a significantly higher cost than
>> cab9a0656c36739f.
>
> What evidence have you got to base that value judgement on?
>
> cab9a0656c36739f was based on an actual user complaint, so we have good
> evidence that there are people out there who care about the cost of
> truncating a table many times in one transaction.

Yeah, if we specifically made that case cheap, in response to a
complaint, it would be a regression to make it expensive again. We might
get away with it in a major version, but would hate to backpatch that.

>> My tentative guess is that the best course is to
>> a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
>> truncation replay issue.
>> b) Force new pages to be used when using the heap_sync mode in
>> COPY. That avoids the INIT danger you found. It seems rather
>> reasonable to avoid using pages that have already been the target of
>> WAL logging here in general.
>
> And what reason is there to think that this would fix all the problems?
> We know of those two, but we've not exactly looked hard for other cases.

Hmm. Perhaps that could be made to work, but it feels pretty fragile.
For example, you could have an insert trigger on the table that inserts
additional rows to the same table, and those inserts would be intermixed
with the rows inserted by COPY. You'll have to avoid that somehow.
Full-page images in general are a problem. If a checkpoint happens, and
a trigger modifies the page we're COPYing to in any way, you have the
same issue. Even reading a page can cause a full-page image of it to be
written: If you update a hint bit on the page while reading it, and
checksums are enabled, and a checkpoint happened since the page was last
updated, bang. I don't think that's a problem in this case because there
are no hint bits to be set on pages that we're COPYing to, but it's a
whole new subtle assumption.

I think we should
1. reliably and explicitly keep track of whether we've WAL-logged any
TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations
on the relation, and
2. make sure we never skip WAL-logging again if we have.

Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
when a new relfilenode is created, i.e. whenever rd_createSubid or
rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
(including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
only skip WAL-logging if the flag is still set. To deal with the case
that the flag gets cleared in the middle of COPY, also check the flag
whenever we're about to skip WAL-logging in heap_insert, and if it's
been cleared, ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.

Compared to the status quo, that disables the WAL-skipping optimization
in the scenario where you CREATE, INSERT, then COPY to a table in the
same transaction. I think that's acceptable.

(Alternatively, to handle the case that the flag gets cleared in the
middle of COPY, add another flag to RelationData indicating that a
WAL-skipping COPY is in-progress, and refrain from WAL-logging any
FPW-writing operations on the table when it's set (or any operations
whatsoever). That'd be more efficient, but it's such a rare corner case
that it hardly matters.)

- Heikki

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-10 08:59:32
Message-ID:	20150710085932.GJ10242@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-09 19:06:11 -0400, Tom Lane wrote:
> What evidence have you got to base that value judgement on?
>
> cab9a0656c36739f was based on an actual user complaint, so we have good
> evidence that there are people out there who care about the cost of
> truncating a table many times in one transaction. On the other hand,
> I know of no evidence that anyone's depending on multiple sequential
> COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
> for having this COPY optimization at all was to make restoring pg_dump
> scripts in a single transaction fast; and that use-case doesn't care
> about anything but a single COPY into a virgin table.

Well, you'll hardly have heard complaints about COPY, given that we
behaved like currently for a long while.

I definitely know of ETL like processes that have relied on subsequent
COPYs into truncates relations being cheaper. Can't remember the same
for intermixed COPY and INSERT, but it'd not surprise me if somebody
mixed COPY and UPDATEs rather freely for ETL.

> I think you're worrying about exactly the wrong case.
>
> > My tentative guess is that the best course is to
> > a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
> > truncation replay issue.
> > b) Force new pages to be used when using the heap_sync mode in
> > COPY. That avoids the INIT danger you found. It seems rather
> > reasonable to avoid using pages that have already been the target of
> > WAL logging here in general.
>
> And what reason is there to think that this would fix all the
> problems?

Yea, that's the big problem.

> Again, the only known field usage for the COPY optimization is the pg_dump
> scenario; were that not so, we'd have noticed the problem long since.
> So I don't have any faith that this is a well-tested area.

You need to crash in the right moment. I don't think that's that
frequently exercised...

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-10 09:14:20
Message-ID:	20150710091420.GK340@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-10 11:50:33 +0300, Heikki Linnakangas wrote:
> On 07/10/2015 02:06 AM, Tom Lane wrote:
> >Andres Freund <andres(at)anarazel(dot)de> writes:
> >>On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
> >>>Rather than reverting cab9a0656c36739f, which would re-introduce a
> >>>different performance problem, perhaps we could have COPY create a new
> >>>relfilenode when it does this. That should be safe if the table was
> >>>previously empty.
> >
> >>I'm not convinced that cab9a0656c36739f needs to survive in that
> >>form. To me only allowing one COPY to benefit from the wal_level =
> >>minimal optimization has a significantly higher cost than
> >>cab9a0656c36739f.
> >
> >What evidence have you got to base that value judgement on?
> >
> >cab9a0656c36739f was based on an actual user complaint, so we have good
> >evidence that there are people out there who care about the cost of
> >truncating a table many times in one transaction.
>
> Yeah, if we specifically made that case cheap, in response to a complaint,
> it would be a regression to make it expensive again. We might get away with
> it in a major version, but would hate to backpatch that.

Sure. But making COPY slower would also be one. Of a longer standing
behaviour, with massively bigger impact if somebody relies on it? I mean
a new relfilenode includes a couple heap and storage options. Missing
the skip wal optimization can easily double or triple COPY durations.

I generally find it to be very dubious to re-use a relfilenode after a
truncation. I bet most hackers didn't ever know we ever did that, and
the rest probably forgot it.

We can still retain a portion of the optimizations from cab9a0656c36739f
- there's no need to keep the old relfilenode's contents around after
all.

> >>My tentative guess is that the best course is to
> >>a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
> >> truncation replay issue.
> >>b) Force new pages to be used when using the heap_sync mode in
> >> COPY. That avoids the INIT danger you found. It seems rather
> >> reasonable to avoid using pages that have already been the target of
> >> WAL logging here in general.
> >
> >And what reason is there to think that this would fix all the problems?
> >We know of those two, but we've not exactly looked hard for other cases.
>
> Hmm. Perhaps that could be made to work, but it feels pretty fragile.

It does. I'm not very happy about this mess.

> For
> example, you could have an insert trigger on the table that inserts
> additional rows to the same table, and those inserts would be intermixed
> with the rows inserted by COPY.

That should be fine? As long as copy only uses new pages INSERT can use
the same ones without problem. I think...

> Full-page images in general are a problem.

With the above rules I don't think it'd be. They'd contain the previous
contents, and we'll not target them again with COPY.

> I think we should
> 1. reliably and explicitly keep track of whether we've WAL-logged any
> TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations on
> the relation, and
> 2. make sure we never skip WAL-logging again if we have.
>
> Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
> when a new relfilenode is created, i.e. whenever rd_createSubid or
> rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
> (including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
> only skip WAL-logging if the flag is still set. To deal with the case that
> the flag gets cleared in the middle of COPY, also check the flag whenever
> we're about to skip WAL-logging in heap_insert, and if it's been cleared,
> ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.

Am I missing something or will this break the BEGIN; TRUNCATE; COPY;
pattern we use ourselves and have suggested a number of times ?

Andres

From:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-10 10:23:28
Message-ID:	CAHGQGwEuT9p0KzZ=sbo1QJUo6-a-RgaTvpiCNBgZ0cKN9Gk7mQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jul 10, 2015 at 2:27 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Fujii Masao <masao(dot)fujii(at)gmail(dot)com> writes:
>> On Tue, Jul 7, 2015 at 12:49 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>> One idea I had was to allow the COPY optimization only if the heap file is
>>> physically zero-length at the time the COPY starts.
>
>> This seems not helpful for the case where TRUNCATE is executed
>> before COPY. No?
>
> Huh? The heap file would be zero length in that case.
>
>> So, if COPY is executed multiple times at the same transaction,
>> only first COPY can be optimized?
>
> This is true, and I don't think we should care, especially not if we're
> going to take risks of incorrect behavior in order to optimize that
> third-order case. The fact that we're dealing with this bug at all should
> remind us that this stuff is harder than it looks. I want a simple,
> reliable, back-patchable fix, and I do not believe that what you are
> suggesting would be any of those.

Maybe I'm missing something. But I start wondering why TRUNCATE
and INSERT (or even all the operations on the table created at
the current transaction) need to be WAL-logged while COPY can be
optimized. If no WAL records are generated on that table, the problem
we're talking about seems not to occur. Also this seems safe and
doesn't degrade the performance of data loading. Thought?

Regards,

--
Fujii Masao

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-10 10:29:02
Message-ID:	20150710102902.GL340@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-10 19:23:28 +0900, Fujii Masao wrote:
> Maybe I'm missing something. But I start wondering why TRUNCATE
> and INSERT (or even all the operations on the table created at
> the current transaction) need to be WAL-logged while COPY can be
> optimized. If no WAL records are generated on that table, the problem
> we're talking about seems not to occur. Also this seems safe and
> doesn't degrade the performance of data loading. Thought?

Skipping WAL logging means that you need to scan through the whole
shrared buffers to write out dirty buffers and fsync the segments. A
single insert wal record is a couple orders of magnitudes cheaper than
that. Essentially doing this juts for COPY is a heuristic.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andres Freund <andres(at)anarazel(dot)de>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-10 10:38:50
Message-ID:	559FA0BA.3080808@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/10/2015 12:14 PM, Andres Freund wrote:
> On 2015-07-10 11:50:33 +0300, Heikki Linnakangas wrote:
>> On 07/10/2015 02:06 AM, Tom Lane wrote:
>>> cab9a0656c36739f was based on an actual user complaint, so we have good
>>> evidence that there are people out there who care about the cost of
>>> truncating a table many times in one transaction.
>>
>> Yeah, if we specifically made that case cheap, in response to a complaint,
>> it would be a regression to make it expensive again. We might get away with
>> it in a major version, but would hate to backpatch that.
>
> Sure. But making COPY slower would also be one. Of a longer standing
> behaviour, with massively bigger impact if somebody relies on it? I mean
> a new relfilenode includes a couple heap and storage options. Missing
> the skip wal optimization can easily double or triple COPY durations.

Completely disabling the skip-WAL optimization is not acceptable either,
IMO. It's a false dichotomy that we have to choose between those two
options. We'll have to consider the exact scenarios where we'd have to
disable the optimization vs. using a new relfilenode.

>>>> My tentative guess is that the best course is to
>>>> a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
>>>> truncation replay issue.
>>>> b) Force new pages to be used when using the heap_sync mode in
>>>> COPY. That avoids the INIT danger you found. It seems rather
>>>> reasonable to avoid using pages that have already been the target of
>>>> WAL logging here in general.
>>>
>>> And what reason is there to think that this would fix all the problems?
>>> We know of those two, but we've not exactly looked hard for other cases.
>>
>> Hmm. Perhaps that could be made to work, but it feels pretty fragile.
>
> It does. I'm not very happy about this mess.
>
>> For
>> example, you could have an insert trigger on the table that inserts
>> additional rows to the same table, and those inserts would be intermixed
>> with the rows inserted by COPY.
>
> That should be fine? As long as copy only uses new pages INSERT can use
> the same ones without problem. I think...
>
>> Full-page images in general are a problem.
>
> With the above rules I don't think it'd be. They'd contain the previous
> contents, and we'll not target them again with COPY.

Well, you really have to ensure that COPY never uses a page that any
other operation (INSERT, DELETE, UPDATE, hint-bit-update) has ever
touched and created a FPW for. The naive approach, where you just reset
the target block at beginning of COPY and use the HEAP_INSERT_SKIP_FSM
option is not enough. It's possible, but requires a lot more bookkeeping
than might seem at first glance.

>> I think we should
>> 1. reliably and explicitly keep track of whether we've WAL-logged any
>> TRUNCATE, INSERT/UPDATE+INIT, or any other full-page-logging operations on
>> the relation, and
>> 2. make sure we never skip WAL-logging again if we have.
>>
>> Let's add a flag, rd_skip_wal_safe, to RelationData that's initially set
>> when a new relfilenode is created, i.e. whenever rd_createSubid or
>> rd_newRelfilenodeSubid is set. Whenever a TRUNCATE or a full-page image
>> (including INSERT/UPDATE+INIT) is WAL-logged, clear the flag. In copy.c,
>> only skip WAL-logging if the flag is still set. To deal with the case that
>> the flag gets cleared in the middle of COPY, also check the flag whenever
>> we're about to skip WAL-logging in heap_insert, and if it's been cleared,
>> ignore the HEAP_INSERT_SKIP_WAL option and WAL-log anyway.
>
> Am I missing something or will this break the BEGIN; TRUNCATE; COPY;
> pattern we use ourselves and have suggested a number of times ?

Sorry, I was imprecise above. I meant "whenever an XLOG_SMGR_TRUNCATE
record is WAL-logged", rather than a "whenever a TRUNCATE [command] is
WAL-logged". TRUNCATE on a table that wasn't created in the same
transaction doesn't emit an XLOG_SMGR_TRUNCATE record, because it
creates a whole new relfilenode. So that's OK.

In the long-term, I'd like to refactor this whole thing so that we never
WAL-log any operations on a relation that's created in the same
transaction (when wal_level=minimal). Instead, at COMMIT, we'd fsync()
the relation, or if it's smaller than some threshold, WAL-log the
contents of the whole file at that point. That would move all that
more-difficult-than-it-seems-at-first-glance logic from COPY and
indexam's to a central location, and it would allow the same
optimization for all operations, not just COPY. But that probably isn't
feasible to backpatch.

- Heikki

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-10 10:44:25
Message-ID:	20150710104425.GC26521@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-10 13:38:50 +0300, Heikki Linnakangas wrote:
> In the long-term, I'd like to refactor this whole thing so that we never
> WAL-log any operations on a relation that's created in the same transaction
> (when wal_level=minimal). Instead, at COMMIT, we'd fsync() the relation, or
> if it's smaller than some threshold, WAL-log the contents of the whole file
> at that point. That would move all that
> more-difficult-than-it-seems-at-first-glance logic from COPY and indexam's
> to a central location, and it would allow the same optimization for all
> operations, not just COPY. But that probably isn't feasible to backpatch.

I don't think that's really realistic until we have a buffer manager
that lets you efficiently scan for all pages of a relation :(

From:	"Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-21 18:24:47
Message-ID:	55AE8E6F.7030504@blackducksoftware.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

This thread seemed to trail off without a resolution. Was anything done?
(See more below.)

On 07/09/15 19:06, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>>> different performance problem, perhaps we could have COPY create a new
>>> relfilenode when it does this. That should be safe if the table was
>>> previously empty.
>
>> I'm not convinced that cab9a0656c36739f needs to survive in that
>> form. To me only allowing one COPY to benefit from the wal_level =
>> minimal optimization has a significantly higher cost than
>> cab9a0656c36739f.
>
> What evidence have you got to base that value judgement on?
>
> cab9a0656c36739f was based on an actual user complaint, so we have good
> evidence that there are people out there who care about the cost of
> truncating a table many times in one transaction.

I'm the complainer mentioned in the cab9a0656c36739f commit message. :)

FWIW, we use a temp table to split a join across 4 largish tables
(10^8 rows or more each) and 2 small tables (10^6 rows each). We
write the results of joining the 2 largest tables into the temp
table, and then join that to the other 4. This gave significant
performance benefits because the planner would know the exact row
count of the 2-way join heading into the 4-way join. After commit
cab9a0656c36739f, we got another noticeable performance improvement
(I did timings before and after, but I can't seem to put my hands
on the numbers right now).

We do millions of these queries every day in batches. Each batch
reuses a single temp table (truncating it before each pair of joins)
so as to reduce the churn in the system catalogs. In case it matters,
the temp table is created with ON COMMIT DROP.

This was (and still is) done on 9.2.x.

HTH.

-- todd cook
-- tcook(at)blackducksoftware(dot)com

> On the other hand,
> I know of no evidence that anyone's depending on multiple sequential
> COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
> for having this COPY optimization at all was to make restoring pg_dump
> scripts in a single transaction fast; and that use-case doesn't care
> about anything but a single COPY into a virgin table.
>
> I think you're worrying about exactly the wrong case.
>
>> My tentative guess is that the best course is to
>> a) Make heap_truncate_one_rel() create a new relfeilnode. That fixes the
>> truncation replay issue.
>> b) Force new pages to be used when using the heap_sync mode in
>> COPY. That avoids the INIT danger you found. It seems rather
>> reasonable to avoid using pages that have already been the target of
>> WAL logging here in general.
>
> And what reason is there to think that this would fix all the problems?
> We know of those two, but we've not exactly looked hard for other cases.
> Again, the only known field usage for the COPY optimization is the pg_dump
> scenario; were that not so, we'd have noticed the problem long since.
> So I don't have any faith that this is a well-tested area.
>
> regards, tom lane
>
>

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	"Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-21 19:37:41
Message-ID:	20150721193741.GC30213@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote:
> Hi,
>
> This thread seemed to trail off without a resolution. Was anything done?

Not that I can tell. I was the original poster of this thread. We've
worked around the issue by placing a CHECKPOINT command at the end of
the migration script. For us it's not a performance issue, more a
correctness one, tables were empty when they shouldn't have been.

I'm hoping a fix will appear in the 9.5 release, since we're intending
to release with that version. A forced checkpoint every now and them
probably won't be a serious problem though.

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	"Todd A(dot) Cook" <tcook(at)blackducksoftware(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-22 06:45:47
Message-ID:	20150722064547.GB5053@awork2.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2015-07-21 21:37:41 +0200, Martijn van Oosterhout wrote:
> On Tue, Jul 21, 2015 at 02:24:47PM -0400, Todd A. Cook wrote:
> > Hi,
> >
> > This thread seemed to trail off without a resolution. Was anything done?
>
> Not that I can tell.

Heikki and I had some in-person conversation about it at a conference,
but we didn't really find anything we both liked...

>I was the original poster of this thread. We've
> worked around the issue by placing a CHECKPOINT command at the end of
> the migration script. For us it's not a performance issue, more a
> correctness one, tables were empty when they shouldn't have been.

If it's just correctness, you could just use wal_level = archive.

> I'm hoping a fix will appear in the 9.5 release, since we're intending
> to release with that version. A forced checkpoint every now and them
> probably won't be a serious problem though.

We're imo going to have to fix this in the back branches.

Andres

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-22 08:18:07
Message-ID:	CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 10 July 2015 at 00:06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
> >> Rather than reverting cab9a0656c36739f, which would re-introduce a
> >> different performance problem, perhaps we could have COPY create a new
> >> relfilenode when it does this. That should be safe if the table was
> >> previously empty.
>
> > I'm not convinced that cab9a0656c36739f needs to survive in that
> > form. To me only allowing one COPY to benefit from the wal_level =
> > minimal optimization has a significantly higher cost than
> > cab9a0656c36739f.
>
> What evidence have you got to base that value judgement on?
>
> cab9a0656c36739f was based on an actual user complaint, so we have good
> evidence that there are people out there who care about the cost of
> truncating a table many times in one transaction. On the other hand,
> I know of no evidence that anyone's depending on multiple sequential
> COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
> for having this COPY optimization at all was to make restoring pg_dump
> scripts in a single transaction fast; and that use-case doesn't care
> about anything but a single COPY into a virgin table.
>

We have to backpatch this fix, so it must be both simple and effective.

Heikki's suggestions may be best, maybe not, but they don't seem
backpatchable.

Tom's suggestion looks good. So does Andres' suggestion. I have coded both.

And what reason is there to think that this would fix all the problems?
>

I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.

Forcing SKIP_FSM to always extend has no negative side effects in other
code paths, AFAICS.

Patches attached. Martijn, please verify.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
fix_wal_logging_copy_truncate.v1.patch	application/octet-stream	4.3 KB
fix_copy_zero_blocks.v1.patch	application/octet-stream	784 bytes

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-22 16:21:22
Message-ID:	55AFC302.1060805@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/22/2015 11:18 AM, Simon Riggs wrote:
> On 10 July 2015 at 00:06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>
>> Andres Freund <andres(at)anarazel(dot)de> writes:
>>> On 2015-07-06 11:49:54 -0400, Tom Lane wrote:
>>>> Rather than reverting cab9a0656c36739f, which would re-introduce a
>>>> different performance problem, perhaps we could have COPY create a new
>>>> relfilenode when it does this. That should be safe if the table was
>>>> previously empty.
>>
>>> I'm not convinced that cab9a0656c36739f needs to survive in that
>>> form. To me only allowing one COPY to benefit from the wal_level =
>>> minimal optimization has a significantly higher cost than
>>> cab9a0656c36739f.
>>
>> What evidence have you got to base that value judgement on?
>>
>> cab9a0656c36739f was based on an actual user complaint, so we have good
>> evidence that there are people out there who care about the cost of
>> truncating a table many times in one transaction. On the other hand,
>> I know of no evidence that anyone's depending on multiple sequential
>> COPYs, nor intermixed COPY and INSERT, to be fast. The original argument
>> for having this COPY optimization at all was to make restoring pg_dump
>> scripts in a single transaction fast; and that use-case doesn't care
>> about anything but a single COPY into a virgin table.
>>
>
> We have to backpatch this fix, so it must be both simple and effective.
>
> Heikki's suggestions may be best, maybe not, but they don't seem
> backpatchable.
>
> Tom's suggestion looks good. So does Andres' suggestion. I have coded both.

Thanks. For comparison, I wrote a patch to implement what I had in mind.

When a WAL-skipping COPY begins, we add an entry for that relation in a
"pending-fsyncs" hash table. Whenever we perform any action on a heap
that would normally be WAL-logged, we check if the relation is in the
hash table, and skip WAL-logging if so.

That was a simplified explanation. In reality, when WAL-skipping COPY
begins, we also memorize the current size of the relation. Any actions
on blocks greater than the old size are not WAL-logged, and any actions
on smaller-numbered blocks are. This ensures that if you did any INSERTs
on the table before the COPY, any new actions on the blocks that were
already WAL-logged by the INSERT are also WAL-logged. And likewise if
you perform any INSERTs after (or during, by trigger) the COPY, and they
modify the new pages, those actions are not WAL-logged. So starting a
WAL-skipping COPY splits the relation into two parts: the first part
that is WAL-logged as usual, and the later part that is not WAL-logged.
(there is one loose end marked with XXX in the patch on this, when one
of the pages involved in a cold UPDATE is before the watermark and the
other is after)

The actual fsync() has been moved to the end of transaction, as we are
now skipping WAL-logging of any actions after the COPY as well.

And truncations complicate things further. If we emit a truncation WAL
record in the transaction, we also make an entry in the hash table to
record that. All operations on a relation that has been truncated must
be WAL-logged as usual, because replaying the truncate record will
destroy all data even if we fsync later. But we still optimize for
"BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we
truncate a relation that has already been marked for fsync-at-COMMIT, we
don't need to WAL-log the truncation either.

This is more invasive than I'd like to backpatch, but I think it's the
simplest approach that works, and doesn't disable any of the important
optimizations we have.

>> And what reason is there to think that this would fix all the problems?
>
> I don't think either suggested fix could be claimed to be a great solution,
> since there is little principle here, only heuristic. Heikki's solution
> would be the only safe way, but is not backpatchable.

I can't get too excited about a half-fix that leaves you with data
corruption in some scenarios.

I wrote a little test script to test all these different scenarios
(attached). Both of your patches fail with the script.

- Heikki

Attachment	Content-Type	Size
fix-wal-level-minimal-heikki-1.patch	application/x-patch	24.5 KB
test-wal-minimal.sh	application/x-shellscript	2.5 KB

From:	Simon Riggs <simon(at)2ndQuadrant(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-23 08:52:33
Message-ID:	CANP8+jLWXr+uskrHOSdcKXj+rEDWFrKgfYuC4TNhCuv7Po+jbA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22 July 2015 at 17:21, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:

>
> When a WAL-skipping COPY begins, we add an entry for that relation in a
> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
> would normally be WAL-logged, we check if the relation is in the hash
> table, and skip WAL-logging if so.
>
> That was a simplified explanation. In reality, when WAL-skipping COPY
> begins, we also memorize the current size of the relation. Any actions on
> blocks greater than the old size are not WAL-logged, and any actions on
> smaller-numbered blocks are. This ensures that if you did any INSERTs on
> the table before the COPY, any new actions on the blocks that were already
> WAL-logged by the INSERT are also WAL-logged. And likewise if you perform
> any INSERTs after (or during, by trigger) the COPY, and they modify the new
> pages, those actions are not WAL-logged. So starting a WAL-skipping COPY
> splits the relation into two parts: the first part that is WAL-logged as
> usual, and the later part that is not WAL-logged. (there is one loose end
> marked with XXX in the patch on this, when one of the pages involved in a
> cold UPDATE is before the watermark and the other is after)
>
> The actual fsync() has been moved to the end of transaction, as we are now
> skipping WAL-logging of any actions after the COPY as well.
>
> And truncations complicate things further. If we emit a truncation WAL
> record in the transaction, we also make an entry in the hash table to
> record that. All operations on a relation that has been truncated must be
> WAL-logged as usual, because replaying the truncate record will destroy all
> data even if we fsync later. But we still optimize for "BEGIN; CREATE;
> COPY; TRUNCATE; COPY;" style patterns, because if we truncate a relation
> that has already been marked for fsync-at-COMMIT, we don't need to WAL-log
> the truncation either.
>
>
> This is more invasive than I'd like to backpatch, but I think it's the
> simplest approach that works, and doesn't disable any of the important
> optimizations we have.

I didn't like it when I first read this, but I do now. As a by product of
fixing the bug it actually extends the optimization.

You can optimize this approach so we always write WAL unless one of the two
subid fields are set, so there is no need to call smgrIsSyncPending() every
time. I couldn't see where this depended upon wal_level, but I guess its
there somewhere.

I'm unhappy about the call during MarkBufferDirtyHint() which is just too
costly. The only way to do this cheaply is to specifically mark buffers as
being BM_WAL_SKIPPED, so they do not need to be hinted. That flag would be
removed when we flush the buffers for the relation.

>
> And what reason is there to think that this would fix all the problems?
>>>
>>
>> I don't think either suggested fix could be claimed to be a great
>> solution,
>> since there is little principle here, only heuristic. Heikki's solution
>> would be the only safe way, but is not backpatchable.
>>
>
> I can't get too excited about a half-fix that leaves you with data
> corruption in some scenarios.
>

On further consideration, it seems obvious that Andres' suggestion would
not work for UPDATE or DELETE, so I now agree.

It does seem a big thing to backpatch; alternative suggestions?

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	hlinnaka <hlinnaka(at)iki(dot)fi>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-23 18:38:13
Message-ID:	CA+TgmobUs0unpdZF=nMhOGGX9EkxBH8ZzpAJi8FBsqrPyyJXOQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>
> This is more invasive than I'd like to backpatch, but I think it's the
> simplest approach that works, and doesn't disable any of the important
> optimizations we have.

Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?
Should we be worried about that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-07-24 06:27:13
Message-ID:	55B1DAC1.3030804@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/23/2015 09:38 PM, Robert Haas wrote:
> On Wed, Jul 22, 2015 at 12:21 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>
>> This is more invasive than I'd like to backpatch, but I think it's the
>> simplest approach that works, and doesn't disable any of the important
>> optimizations we have.
>
> Hmm, isn't HeapNeedsWAL() a lot more costly than RelationNeedsWAL()?

Yes. But it's still very cheap, especially in the common case that the
pending syncs hash table is empty.

> Should we be worried about that?

It doesn't worry me.

- Heikki

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Simon Riggs <simon(at)2ndQuadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-10-21 14:53:44
Message-ID:	20151021145344.GW3391@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:

> Thanks. For comparison, I wrote a patch to implement what I had in mind.
>
> When a WAL-skipping COPY begins, we add an entry for that relation in a
> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
> would normally be WAL-logged, we check if the relation is in the hash table,
> and skip WAL-logging if so.

I think this wasn't applied, was it?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2015-10-22 00:56:35
Message-ID:	CAB7nPqQ3j5JOM2VT0MenkWvK5CN0ThMwru1c=kbVA81iBzJdyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> Heikki Linnakangas wrote:
>
>> Thanks. For comparison, I wrote a patch to implement what I had in mind.
>>
>> When a WAL-skipping COPY begins, we add an entry for that relation in a
>> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
>> would normally be WAL-logged, we check if the relation is in the hash table,
>> and skip WAL-logging if so.
>
> I think this wasn't applied, was it?

No, it was not applied.
--
Michael

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-02-04 12:24:21
Message-ID:	56B342F5.1050502@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 22/10/15 03:56, Michael Paquier wrote:
> On Wed, Oct 21, 2015 at 11:53 PM, Alvaro Herrera
> <alvherre(at)2ndquadrant(dot)com> wrote:
>> Heikki Linnakangas wrote:
>>
>>> Thanks. For comparison, I wrote a patch to implement what I had in mind.
>>>
>>> When a WAL-skipping COPY begins, we add an entry for that relation in a
>>> "pending-fsyncs" hash table. Whenever we perform any action on a heap that
>>> would normally be WAL-logged, we check if the relation is in the hash table,
>>> and skip WAL-logging if so.
>>
>> I think this wasn't applied, was it?
>
> No, it was not applied.

I dropped the ball on this one back in July, so here's an attempt to
revive this thread.

I spent some time fixing the remaining issues with the prototype patch I
posted earlier, and rebased that on top of current git master. See attached.

Some review of that would be nice. If there are no major issues with it,
I'm going to create backpatchable versions of this for 9.4 and below.

- Heikki

Attachment	Content-Type	Size
0001-Fix-the-optimization-to-skip-WAL-logging-on-table-cr.patch	text/x-diff	40.7 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-02-18 07:27:13
Message-ID:	CAB7nPqR9mKteWyAPCpQw51uk2X1aUVG_N1CChUbFY1omMTWk3Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I dropped the ball on this one back in July, so here's an attempt to revive
> this thread.
>
> I spent some time fixing the remaining issues with the prototype patch I
> posted earlier, and rebased that on top of current git master. See attached.
>
> Some review of that would be nice. If there are no major issues with it, I'm
> going to create backpatchable versions of this for 9.4 and below.

I am going to look into that very soon. For now and to not forget
about this bug, I have added an entry in the CF app:
https://commitfest.postgresql.org/9/528/
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-02-19 07:33:21
Message-ID:	CAB7nPqTAP6_8c0nf8m=OTw16TsB7waE3Zz2HsxPuoercnC2qxA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I dropped the ball on this one back in July, so here's an attempt to revive
>> this thread.
>>
>> I spent some time fixing the remaining issues with the prototype patch I
>> posted earlier, and rebased that on top of current git master. See attached.
>>
>> Some review of that would be nice. If there are no major issues with it, I'm
>> going to create backpatchable versions of this for 9.4 and below.
>
> I am going to look into that very soon. For now and to not forget
> about this bug, I have added an entry in the CF app:
> https://commitfest.postgresql.org/9/528/

Worth noting that this patch does not address the problem with index
relations when a TRUNCATE is used in the same transaction as its
CREATE TABLE, take that for example when wal_level = minimal:
1) Run transaction
=# begin;
BEGIN
=# create table ab (a int primary key);
CREATE TABLE
=# truncate ab;
TRUNCATE TABLE
=# commit;
COMMIT
2) Restart server with immediate mode.
3) Failure:
=# table ab;
ERROR: XX001: could not read block 0 in file "base/16384/16388": read
only 0 of 8192 bytes
LOCATION: mdread, md.c:728

The case where a COPY is issued after TRUNCATE is fixed though, so
that's still an improvement.

Here are other comments.

+ /* Flush updates to relations that we didn't WAL-logged */
+ smgrDoPendingSyncs(true);
"Flush updates to relations there were not WAL-logged"?

+void
+FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
+{
+ FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
+}
islocal is always set as false, I'd rather remove it this argument
from FlushRelationBuffersWithoutRelCache.

for (i = 0; i < nrels; i++)
+ {
smgrclose(srels[i]);
+ }
Looks like noise.

+ if (!found)
+ {
+ pending->truncated_to = InvalidBlockNumber;
+ pending->sync_above = nblocks;
+
+ elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+
+ }
+ else if (pending->sync_above == InvalidBlockNumber)
+ {
+ elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
+ pending->sync_above = nblocks;
+ }
+ else
Here couldn't it be possible that when (sync_above !=
InvalidBlockNumber), nblocks can be higher than sync_above? In which
case we had better increase sync_above to nblocks, no?

+ if (!pendingSyncs)
+ createPendingSyncsHash();
+ pending = (PendingRelSync *) hash_search(pendingSyncs,
+ (void *) &rel->rd_node,
+ HASH_ENTER, &found);
This is lacking comments.

- if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+ BufferGetTag(buffer, &rnode, &forknum, &blknum);
+ if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+ !smgrIsSyncPending(rnode, blknum))
Here as well explaining in more details why the buffer does not need
to go through XLogSaveBufferForHint would be nice.
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-02-19 13:27:00
Message-ID:	CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 19, 2016 at 4:33 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Thu, Feb 18, 2016 at 4:27 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Thu, Feb 4, 2016 at 3:24 PM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>> I dropped the ball on this one back in July, so here's an attempt to revive
>>> this thread.
>>>
>>> I spent some time fixing the remaining issues with the prototype patch I
>>> posted earlier, and rebased that on top of current git master. See attached.
>>>
>>> Some review of that would be nice. If there are no major issues with it, I'm
>>> going to create backpatchable versions of this for 9.4 and below.
>>
>> I am going to look into that very soon. For now and to not forget
>> about this bug, I have added an entry in the CF app:
>> https://commitfest.postgresql.org/9/528/
>
> Worth noting that this patch does not address the problem with index
> relations when a TRUNCATE is used in the same transaction as its
> CREATE TABLE, take that for example when wal_level = minimal:
> 1) Run transaction
> =# begin;
> BEGIN
> =# create table ab (a int primary key);
> CREATE TABLE
> =# truncate ab;
> TRUNCATE TABLE
> =# commit;
> COMMIT
> 2) Restart server with immediate mode.
> 3) Failure:
> =# table ab;
> ERROR: XX001: could not read block 0 in file "base/16384/16388": read
> only 0 of 8192 bytes
> LOCATION: mdread, md.c:728
>
> The case where a COPY is issued after TRUNCATE is fixed though, so
> that's still an improvement.
>
> Here are other comments.
>
> + /* Flush updates to relations that we didn't WAL-logged */
> + smgrDoPendingSyncs(true);
> "Flush updates to relations there were not WAL-logged"?
>
> +void
> +FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
> +{
> + FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
> +}
> islocal is always set as false, I'd rather remove it this argument
> from FlushRelationBuffersWithoutRelCache.
>
> for (i = 0; i < nrels; i++)
> + {
> smgrclose(srels[i]);
> + }
> Looks like noise.
>
> + if (!found)
> + {
> + pending->truncated_to = InvalidBlockNumber;
> + pending->sync_above = nblocks;
> +
> + elog(DEBUG2, "registering new pending sync for rel %u/%u/%u at
> block %u",
> + rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
> +
> + }
> + else if (pending->sync_above == InvalidBlockNumber)
> + {
> + elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
> + rnode.spcNode, rnode.dbNode, rnode.relNode, nblocks);
> + pending->sync_above = nblocks;
> + }
> + else
> Here couldn't it be possible that when (sync_above !=
> InvalidBlockNumber), nblocks can be higher than sync_above? In which
> case we had better increase sync_above to nblocks, no?
>
> + if (!pendingSyncs)
> + createPendingSyncsHash();
> + pending = (PendingRelSync *) hash_search(pendingSyncs,
> + (void *) &rel->rd_node,
> + HASH_ENTER, &found);
> This is lacking comments.
>
> - if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
> + BufferGetTag(buffer, &rnode, &forknum, &blknum);
> + if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
> + !smgrIsSyncPending(rnode, blknum))
> Here as well explaining in more details why the buffer does not need
> to go through XLogSaveBufferForHint would be nice.

An additional one:
- XLogRegisterBuffer(0, newbuf, bufflags);
- if (oldbuf != newbuf)
- XLogRegisterBuffer(1, oldbuf, REGBUF_STANDARD);
In log_heap_update, the new buffer is now conditionally logged
depending on if the heap needs WAL or not.

Now during replay the following thing is done:
- oldaction = XLogReadBufferForRedo(record, (oldblk == newblk) ? 0 : 1,
- &obuffer);
+ if (oldblk == newblk)
+ oldaction = XLogReadBufferForRedo(record, 0, &obuffer);
+ else if (XLogRecHasBlockRef(record, 1))
+ oldaction = XLogReadBufferForRedo(record, 1, &obuffer);
+ else
+ oldaction = BLK_DONE;
Shouldn't we check for XLogRecHasBlockRef(record, 0) when the tuple is
updated on the same page?
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	hlinnaka(at)iki(dot)fi, alvherre(at)2ndquadrant(dot)com, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-11 08:32:59
Message-ID:	20160311.173259.190313590.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, I considered on the original issue.

At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg(at)mail(dot)gmail(dot)com>
> > Worth noting that this patch does not address the problem with index
> > relations when a TRUNCATE is used in the same transaction as its

Focusing this issue, what we should do is somehow building empty
index just after a index truncation. The attached patch does the
following things to fix this.

- make index_build use ambuildempty when the relation on which
the index will be built is apparently empty. That is, when the
relation has no block.

- add one parameter "persistent" to ambuildempty(). It behaves as
before if the parameter is false. If not, it creates an empty
index on MAIN_FORK and emits logs even if wal_level is minimal.

Creation of an index for an empty table can be safely done by
ambuildempty, since it creates the image for init fork, which can
be simply copied as main fork on initialization. And the heap is
always empty when RelationTruncateIndexes calls index_build.

For nonempty tables, ambuild properly initializes the new index.

The new parameter 'persistent' would be better be forknum because
it actually represents the persistency of the index to be
created. But I'm out of time now..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
Fix_wal_logging_problem_20160311.patch	text/x-patch	13.5 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-15 17:21:34
Message-ID:	CAB7nPqSVm-X1-w9i=U=DCyMxDxzfNT-41pqTSvh0DUmUgi8BQg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg(at)mail(dot)gmail(dot)com>
>> > Worth noting that this patch does not address the problem with index
>> > relations when a TRUNCATE is used in the same transaction as its
>
> Focusing this issue, what we should do is somehow building empty
> index just after a index truncation. The attached patch does the
> following things to fix this.
>
> - make index_build use ambuildempty when the relation on which
> the index will be built is apparently empty. That is, when the
> relation has no block.
> - add one parameter "persistent" to ambuildempty(). It behaves as
> before if the parameter is false. If not, it creates an empty
> index on MAIN_FORK and emits logs even if wal_level is minimal.

Hm. It seems to me that this patch is just a bandaid for the real
problem which is that we should not TRUNCATE the underlying index
relations when the TRUNCATE optimization is running. In short I would
let the empty routines in AM code paths alone, and just continue using
them for the generation of INIT_FORKNUM with unlogged relations. Your
patch is not something backpatchable anyway I think.

> The new parameter 'persistent' would be better be forknum because
> it actually represents the persistency of the index to be
> created. But I'm out of time now..

I actually have some users running with wal_level to minimal, even if
I don't think they use this optimization, we had better fix even index
relations at the same time as table relations.. I'll try to get some
time once the patch review storm goes down a little, except if someone
beats me to it first.
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	hlinnaka(at)iki(dot)fi, alvherre(at)2ndquadrant(dot)com, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-16 02:01:58
Message-ID:	20160316.110158.196524017.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for the comment.

I understand that this is not an issue in a hurry so don't bother
to reply.

At Tue, 15 Mar 2016 18:21:34 +0100, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSVm-X1-w9i=U=DCyMxDxzfNT-41pqTSvh0DUmUgi8BQg(at)mail(dot)gmail(dot)com>
> On Fri, Mar 11, 2016 at 9:32 AM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > At Fri, 19 Feb 2016 22:27:00 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSGFKUAFqPe5t30jeEA+V9yFMM4yJGa3SnkgY1RHzn7Dg(at)mail(dot)gmail(dot)com>
> >> > Worth noting that this patch does not address the problem with index
> >> > relations when a TRUNCATE is used in the same transaction as its
> >
> > Focusing this issue, what we should do is somehow building empty
> > index just after a index truncation. The attached patch does the
> > following things to fix this.
> >
> > - make index_build use ambuildempty when the relation on which
> > the index will be built is apparently empty. That is, when the
> > relation has no block.
> > - add one parameter "persistent" to ambuildempty(). It behaves as
> > before if the parameter is false. If not, it creates an empty
> > index on MAIN_FORK and emits logs even if wal_level is minimal.
>
> Hm. It seems to me that this patch is just a bandaid for the real
> problem which is that we should not TRUNCATE the underlying index
> relations when the TRUNCATE optimization is running.

The eventual problem is a 0-length index relation left just after
a relation truncation. We assume that an index with an empty
relation after a recovery is not valid. However just skipping
TRUNCATE of the index relation won't resolve it since it in turn
leaves an index with garbage entries. Am I missing something?

Since the index relation should be "validly emptied" in-place in
any way in the case of TRUNCATE optimization, I tried that by
TRUNCATE + ambuildempty, which can be redo'ed properly, too. A
repeated TRUNCATEs issues eventully-useless logs but it would be
inevitable since we cannot fortell of any succeeding TRUNCATEs.

(TRUNCATE+)COPY+INSERT seems another kind of problem, which would
be fixed by Heikki's patch.

> In short I would
> let the empty routines in AM code paths alone, and just continue using
> them for the generation of INIT_FORKNUM with unlogged relations. Your
> patch is not something backpatchable anyway I think.

It seems to be un-backpatchable if the change of the manner to
call ambuildempty inhibits this.

> > The new parameter 'persistent' would be better be forknum because
> > it actually represents the persistency of the index to be
> > created. But I'm out of time now..
>
> I actually have some users running with wal_level to minimal, even if
> I don't think they use this optimization, we had better fix even index
> relations at the same time as table relations.. I'll try to get some
> time once the patch review storm goes down a little, except if someone
> beats me to it first.

Ok, I understand that this is not an issue in a hurry. I'll go to
another patch that needs review.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	David Steele <david(at)pgmasters(dot)net>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, michael(dot)paquier(at)gmail(dot)com
Cc:	hlinnaka(at)iki(dot)fi, alvherre(at)2ndquadrant(dot)com, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-22 16:38:42
Message-ID:	56F17512.6060808@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:

> Ok, I understand that this is not an issue in a hurry. I'll go to
> another patch that needs review.

Since we're getting towards the end of the CF is it time to pick this up
again?

Thanks,
--
-David
david(at)pgmasters(dot)net

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-23 00:52:39
Message-ID:	CAB7nPqRLO7+nHfX9sd15jamRyW6kp+2x096ObT9yXV7N=ecdcQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david(at)pgmasters(dot)net> wrote:
> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>
>> Ok, I understand that this is not an issue in a hurry. I'll go to
>> another patch that needs review.
>
> Since we're getting towards the end of the CF is it time to pick this up
> again?

Perhaps not. This is a legit bug with an unfinished patch (see index
relation truncation) that is going to need a careful review. I don't
think that this should be impacted by the 4/8 feature freeze, so we
could still work on that after the embargo and we've had this bug for
months actually. FWIW, I am still planning to work on it once the CF
is done, in order to keep my manpower focused on actual patch reviews
as much as possible...
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-23 00:54:40
Message-ID:	CAB7nPqRdcnQRLvBsBFKci8Y-a3bYmujpwSSr7if2BO+QXZkVcA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david(at)pgmasters(dot)net> wrote:
>> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>>
>>> Ok, I understand that this is not an issue in a hurry. I'll go to
>>> another patch that needs review.
>>
>> Since we're getting towards the end of the CF is it time to pick this up
>> again?
>
> Perhaps not. This is a legit bug with an unfinished patch (see index
> relation truncation) that is going to need a careful review. I don't
> think that this should be impacted by the 4/8 feature freeze, so we
> could still work on that after the embargo and we've had this bug for
> months actually. FWIW, I am still planning to work on it once the CF
> is done, in order to keep my manpower focused on actual patch reviews
> as much as possible...

In short, we may want to bump that to next CF... I have already marked
this ticket as something to work on soonish on my side, so it does not
change much seen from here if it's part of the next CF. What we should
just be sure is not to lose track of its existence.
--
Michael

From:	David Steele <david(at)pgmasters(dot)net>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-23 02:11:12
Message-ID:	56F1FB40.4060605@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 3/22/16 8:54 PM, Michael Paquier wrote:
> On Wed, Mar 23, 2016 at 9:52 AM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Wed, Mar 23, 2016 at 1:38 AM, David Steele <david(at)pgmasters(dot)net> wrote:
>>> On 3/15/16 10:01 PM, Kyotaro HORIGUCHI wrote:
>>>
>>>> Ok, I understand that this is not an issue in a hurry. I'll go to
>>>> another patch that needs review.
>>>
>>> Since we're getting towards the end of the CF is it time to pick this up
>>> again?
>>
>> Perhaps not. This is a legit bug with an unfinished patch (see index
>> relation truncation) that is going to need a careful review. I don't
>> think that this should be impacted by the 4/8 feature freeze, so we
>> could still work on that after the embargo and we've had this bug for
>> months actually. FWIW, I am still planning to work on it once the CF
>> is done, in order to keep my manpower focused on actual patch reviews
>> as much as possible...
>
> In short, we may want to bump that to next CF... I have already marked
> this ticket as something to work on soonish on my side, so it does not
> change much seen from here if it's part of the next CF. What we should
> just be sure is not to lose track of its existence.

I would prefer not to bump it to the next CF unless we decide this will
not get fixed for 9.6.

--
-David
david(at)pgmasters(dot)net

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-03-23 03:45:12
Message-ID:	CAB7nPqQhkTA2AgaQrdL2cT+crGdNOoQyzcE5YcgASSLCsiAhRA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david(at)pgmasters(dot)net> wrote:
> I would prefer not to bump it to the next CF unless we decide this will
> not get fixed for 9.6.

It may make sense to add that to the list of open items for 9.6
instead. That's not a feature.
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-04-06 06:11:15
Message-ID:	CAB7nPqT6C3=nFc5j51fmod7_XDOuSxtsMfR59vOO6pofDwSSMQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david(at)pgmasters(dot)net> wrote:
>> I would prefer not to bump it to the next CF unless we decide this will
>> not get fixed for 9.6.
>
> It may make sense to add that to the list of open items for 9.6
> instead. That's not a feature.

So I have moved this patch to the next CF for now, and will work on
fixing it rather soonishly as an effort to stabilize 9.6 as well as
back-branches.
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-07-28 07:59:15
Message-ID:	CAB7nPqSHh+oS-X0oqp_m-6sX0OgVckEv9NEfNt1+YLCiMMCE3g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david(at)pgmasters(dot)net> wrote:
>>> I would prefer not to bump it to the next CF unless we decide this will
>>> not get fixed for 9.6.
>>
>> It may make sense to add that to the list of open items for 9.6
>> instead. That's not a feature.
>
> So I have moved this patch to the next CF for now, and will work on
> fixing it rather soonishly as an effort to stabilize 9.6 as well as
> back-branches.

Well, not that soon at the end, but I am back on that... I have not
completely reviewed all the code yet, and the case of index relation
referring to a relation optimized with truncate is still broken, but
for now here is a rebased patch if people are interested. I am going
to get as well a TAP tests out of my pocket to ease testing.
--
Michael

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-1.patch	text/x-patch	24.5 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	David Steele <david(at)pgmasters(dot)net>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-07-29 07:54:42
Message-ID:	CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 28, 2016 at 4:59 PM, Michael Paquier
<michael(dot)paquier(at)gmail(dot)com> wrote:
> On Wed, Apr 6, 2016 at 3:11 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>> On Wed, Mar 23, 2016 at 12:45 PM, Michael Paquier
>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> On Wed, Mar 23, 2016 at 11:11 AM, David Steele <david(at)pgmasters(dot)net> wrote:
>>>> I would prefer not to bump it to the next CF unless we decide this will
>>>> not get fixed for 9.6.
>>>
>>> It may make sense to add that to the list of open items for 9.6
>>> instead. That's not a feature.
>>
>> So I have moved this patch to the next CF for now, and will work on
>> fixing it rather soonishly as an effort to stabilize 9.6 as well as
>> back-branches.
>
> Well, not that soon at the end, but I am back on that... I have not
> completely reviewed all the code yet, and the case of index relation
> referring to a relation optimized with truncate is still broken, but
> for now here is a rebased patch if people are interested. I am going
> to get as well a TAP tests out of my pocket to ease testing.

The patch I sent yesterday was based on an incorrect version. Attached
is a slightly-modified version of the last one I found here
(https://www.postgresql.org/message-id/56B342F5.1050502@iki.fi), which
is rebased on HEAD at ed0b228. I have also converted the test case
script of upthread into a TAP test in src/test/recovery that covers 3
cases and I included that in the patch:
1) CREATE + INSERT + COPY => crash
2) CREATE + trigger + COPY => crash
3) CREATE + TRUNCATE + COPY => incorrect number of rows.
The first two tests make the system crash, the third one reports an
incorrect number of rows.

This is registered in next CF by the way:
https://commitfest.postgresql.org/10/528/
Thoughts?
--
Michael

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-2.patch	invalid/octet-stream	41.6 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, alvherre(at)2ndquadrant(dot)com, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-09-26 08:03:54
Message-ID:	20160926.170354.212384164.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, I return to this before my things:)

Though I haven't played with the patch yet..

At Fri, 29 Jul 2016 16:54:42 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg(at)mail(dot)gmail(dot)com>
> > Well, not that soon at the end, but I am back on that... I have not
> > completely reviewed all the code yet, and the case of index relation
> > referring to a relation optimized with truncate is still broken, but
> > for now here is a rebased patch if people are interested. I am going
> > to get as well a TAP tests out of my pocket to ease testing.
>
> The patch I sent yesterday was based on an incorrect version. Attached
> is a slightly-modified version of the last one I found here
> (https://www.postgresql.org/message-id/56B342F5.1050502@iki.fi), which
> is rebased on HEAD at ed0b228. I have also converted the test case
> script of upthread into a TAP test in src/test/recovery that covers 3
> cases and I included that in the patch:
> 1) CREATE + INSERT + COPY => crash
> 2) CREATE + trigger + COPY => crash
> 3) CREATE + TRUNCATE + COPY => incorrect number of rows.
> The first two tests make the system crash, the third one reports an
> incorrect number of rows.

At the first glance, managing sync_above and truncate_to is
workable for these cases, but seems too complicated for the
problem to be resolved.

This provides smgr with a capability to manage pending page
synchs. But the postpone-page-syncs-or-not issue rather seems to
be a matter of the users of that, who are responsible for WAL
issueing. Anyway heap_resgister_sync doesn't use any secret of
smgr. So I think this approach binds smgr with Relation too
tightly.

By this patch, many RelationNeedsWALs, which just accesses local
struct, are replaced with HeapNeedsWAL, which eventually accesses
a hash added by this patch. Especially in log_heap_update, it is
called for every update of single tuple (on a relation that needs
WAL).

Though I don't know how it actually impacts the perfomance, it
seems to me that we can live with truncated_to and sync_above in
RelationData and BufferNeedsWAL(rel, buf) instead of
HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
seems to exist at once in the hash.

What do you think?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	david(at)pgmasters(dot)net, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-09-29 07:59:55
Message-ID:	CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Hello, I return to this before my things:)
>
> Though I haven't played with the patch yet..

Be sure to run the test cases in the patch or base your tests on them then!

> Though I don't know how it actually impacts the perfomance, it
> seems to me that we can live with truncated_to and sync_above in
> RelationData and BufferNeedsWAL(rel, buf) instead of
> HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
> seems to exist at once in the hash.

TBH, I still think that the design of this patch as proposed is pretty
cool and easy to follow.
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, alvherre(at)2ndquadrant(dot)com, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-09-29 13:02:57
Message-ID:	20160929.220257.67781565.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA(at)mail(dot)gmail(dot)com>
> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > Hello, I return to this before my things:)
> >
> > Though I haven't played with the patch yet..
>
> Be sure to run the test cases in the patch or base your tests on them then!

All items of 006_truncate_opt fail on ed0b228 and they are fixed
with the patch.

> > Though I don't know how it actually impacts the perfomance, it
> > seems to me that we can live with truncated_to and sync_above in
> > RelationData and BufferNeedsWAL(rel, buf) instead of
> > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
> > seems to exist at once in the hash.
>
> TBH, I still think that the design of this patch as proposed is pretty
> cool and easy to follow.

It is clean from certain viewpoint but additional hash,
especially hash-searching on every HeapNeedsWAL seems to me to be
unacceptable. Do you see it accetable?

The attached patch is quiiiccck-and-dirty-hack of Michael's patch
just as a PoC of my proposal quoted above. This also passes the
006 test. The major changes are the following.

- Moved sync_above and truncted_to into RelationData.

- Cleaning up is done in AtEOXact_cleanup instead of explicit
calling to smgrDoPendingSyncs().

* BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
hash_search. It just refers to the additional members in the
given Relation.

X I feel that I have dropped one of the features of the origitnal
patch during the hack, but I don't recall it clearly now:(

X I haven't consider relfilenode replacement, which didn't matter
for the original patch. (but there's few places to consider).

What do you think about this?

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-horiguchi-1.patch	text/x-patch	32.1 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	david(at)pgmasters(dot)net, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Masao Fujii <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-10-02 12:43:46
Message-ID:	CAB7nPqTKOyHkrBSxvvSBZCXvU9F8OT_uumXmST_awKsswQA5Sg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Hello,
>
> At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA(at)mail(dot)gmail(dot)com>
>> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>> > Hello, I return to this before my things:)
>> >
>> > Though I haven't played with the patch yet..
>>
>> Be sure to run the test cases in the patch or base your tests on them then!
>
> All items of 006_truncate_opt fail on ed0b228 and they are fixed
> with the patch.
>
>> > Though I don't know how it actually impacts the perfomance, it
>> > seems to me that we can live with truncated_to and sync_above in
>> > RelationData and BufferNeedsWAL(rel, buf) instead of
>> > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
>> > seems to exist at once in the hash.
>>
>> TBH, I still think that the design of this patch as proposed is pretty
>> cool and easy to follow.
>
> It is clean from certain viewpoint but additional hash,
> especially hash-searching on every HeapNeedsWAL seems to me to be
> unacceptable. Do you see it accetable?
>
>
> The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> just as a PoC of my proposal quoted above. This also passes the
> 006 test. The major changes are the following.
>
> - Moved sync_above and truncted_to into RelationData.
>
> - Cleaning up is done in AtEOXact_cleanup instead of explicit
> calling to smgrDoPendingSyncs().
>
> * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
> hash_search. It just refers to the additional members in the
> given Relation.
>
> X I feel that I have dropped one of the features of the origitnal
> patch during the hack, but I don't recall it clearly now:(
>
> X I haven't consider relfilenode replacement, which didn't matter
> for the original patch. (but there's few places to consider).
>
> What do you think about this?

I have moved this patch to next CF. (I still need to look at your patch.)
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, alvherre(at)2ndquadrant(dot)com, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-11-01 02:00:46
Message-ID:	20161101.110046.147437718.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

At Sun, 2 Oct 2016 21:43:46 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqTKOyHkrBSxvvSBZCXvU9F8OT_uumXmST_awKsswQA5Sg(at)mail(dot)gmail(dot)com>
> On Thu, Sep 29, 2016 at 10:02 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > Hello,
> >
> > At Thu, 29 Sep 2016 16:59:55 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqT5x05tG7aut1yz+WJN76DqNz1Jzq46fSFtee4YbY0YcA(at)mail(dot)gmail(dot)com>
> >> On Mon, Sep 26, 2016 at 5:03 PM, Kyotaro HORIGUCHI
> >> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >> > Hello, I return to this before my things:)
> >> >
> >> > Though I haven't played with the patch yet..
> >>
> >> Be sure to run the test cases in the patch or base your tests on them then!
> >
> > All items of 006_truncate_opt fail on ed0b228 and they are fixed
> > with the patch.
> >
> >> > Though I don't know how it actually impacts the perfomance, it
> >> > seems to me that we can live with truncated_to and sync_above in
> >> > RelationData and BufferNeedsWAL(rel, buf) instead of
> >> > HeapNeedsWAL(rel, buf). Anyway up to one entry for one relation
> >> > seems to exist at once in the hash.
> >>
> >> TBH, I still think that the design of this patch as proposed is pretty
> >> cool and easy to follow.
> >
> > It is clean from certain viewpoint but additional hash,
> > especially hash-searching on every HeapNeedsWAL seems to me to be
> > unacceptable. Do you see it accetable?
> >
> >
> > The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> > just as a PoC of my proposal quoted above. This also passes the
> > 006 test. The major changes are the following.
> >
> > - Moved sync_above and truncted_to into RelationData.
> >
> > - Cleaning up is done in AtEOXact_cleanup instead of explicit
> > calling to smgrDoPendingSyncs().
> >
> > * BufferNeedsWAL (replace of HeapNeedsWAL) no longer requires
> > hash_search. It just refers to the additional members in the
> > given Relation.
> >
> > X I feel that I have dropped one of the features of the origitnal
> > patch during the hack, but I don't recall it clearly now:(
> >
> > X I haven't consider relfilenode replacement, which didn't matter
> > for the original patch. (but there's few places to consider).
> >
> > What do you think about this?
>
> I have moved this patch to next CF. (I still need to look at your patch.)

Thanks for considering that.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-11-08 20:39:18
Message-ID:	CA+TgmoaYfuddzEW75jQPGjh8XjHFy=VTJdaKjfU6O-SYNEH9sQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> I dropped the ball on this one back in July, so here's an attempt to revive
> this thread.
>
> I spent some time fixing the remaining issues with the prototype patch I
> posted earlier, and rebased that on top of current git master. See attached.
>
> Some review of that would be nice. If there are no major issues with it, I'm
> going to create backpatchable versions of this for 9.4 and below.

Heikki:

Are you going to do commit something here? This thread and patch are
now 14 months old, which is a long time to make people wait for a bug
fix. The status in the CF is "Ready for Committer" although I am not
sure if that's accurate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-11-09 00:27:55
Message-ID:	CAB7nPqR0XnW1uu25=MR48uZcmLO-TvSptx=JnRMVcD2fY=rS-A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>> I dropped the ball on this one back in July, so here's an attempt to revive
>> this thread.
>>
>> I spent some time fixing the remaining issues with the prototype patch I
>> posted earlier, and rebased that on top of current git master. See attached.
>>
>> Some review of that would be nice. If there are no major issues with it, I'm
>> going to create backpatchable versions of this for 9.4 and below.
>
> Are you going to do commit something here? This thread and patch are
> now 14 months old, which is a long time to make people wait for a bug
> fix. The status in the CF is "Ready for Committer" although I am not
> sure if that's accurate.

"Needs Review" is definitely a better definition of its current state.
The last time I had a look at this patch I thought that it was in
pretty good shape (not Horiguchi-san's version, but the one in
https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com).
With some of the recent changes, surely it needs a second look, things
related to heap handling tend to rot quickly.

I'll look into it once again by the end of this week if Heikki does
not show up, the rest will be on him I am afraid...
--
Michael

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-11-09 06:55:52
Message-ID:	CAB7nPqQYwyzacRh2uisSyvKZg0XfTbZVj+XHz8eWjZGJsEM5hQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
> On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
wrote:
>>> I dropped the ball on this one back in July, so here's an attempt to
revive
>>> this thread.
>>>
>>> I spent some time fixing the remaining issues with the prototype patch I
>>> posted earlier, and rebased that on top of current git master. See
attached.
>>>
>>> Some review of that would be nice. If there are no major issues with
it, I'm
>>> going to create backpatchable versions of this for 9.4 and below.
>>
>> Are you going to do commit something here? This thread and patch are
>> now 14 months old, which is a long time to make people wait for a bug
>> fix. The status in the CF is "Ready for Committer" although I am not
>> sure if that's accurate.
>
> "Needs Review" is definitely a better definition of its current state.
> The last time I had a look at this patch I thought that it was in
> pretty good shape (not Horiguchi-san's version, but the one in
>
https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg@mail.gmail.com
).
> With some of the recent changes, surely it needs a second look, things
> related to heap handling tend to rot quickly.
>
> I'll look into it once again by the end of this week if Heikki does
> not show up, the rest will be on him I am afraid...

I have been able to hit a crash with recovery test 008:
(lldb) bt
* thread #1: tid = 0x0000, 0x00007fff96d48f06
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90
frame #2: 0x00007fff8e5cc6df libsystem_c.dylib`abort + 129
frame #3: 0x0000000106ef10f0
postgres`ExceptionalCondition(conditionName="!(( !( ((void) ((bool) (!
(!((buffer) <= NBuffers && (buffer) >= -NLocBuffer)) ||
(ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >=
-NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0)))), (buffer)
!= 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) :
(GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion",
fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) +
204 at bufmgr.c:2593
frame #5: 0x000000010694e6ad
postgres`HeapNeedsWAL(rel=0x00007f9454804118, buf=0) + 61 at heapam.c:9234
frame #6: 0x000000010696d8bd
postgres`visibilitymap_set(rel=0x00007f9454804118, heapBlk=1, heapBuf=0,
recptr=50841176, vmBuf=118, cutoff_xid=866, flags='\x01') + 989 at
visibilitymap.c:310
frame #7: 0x000000010695d020
postgres`heap_xlog_visible(record=0x00007f94520035d0) + 896 at heapam.c:8148
frame #8: 0x000000010695c582
postgres`heap2_redo(record=0x00007f94520035d0) + 242 at heapam.c:9107
frame #9: 0x00000001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950
frame #10: 0x0000000106c9d783 postgres`StartupProcessMain + 339 at
startup.c:216
frame #11: 0x00000001069ee6ec postgres`AuxiliaryProcessMain(argc=2,
argv=0x00007fff59316d80) + 1676 at bootstrap.c:420
frame #12: 0x0000000106c98002
postgres`StartChildProcess(type=StartupProcess) + 322 at postmaster.c:5221
frame #13: 0x0000000106c96031 postgres`PostmasterMain(argc=3,
argv=0x00007f9451c04210) + 6033 at postmaster.c:1301
frame #14: 0x0000000106bc30cf postgres`main(argc=3,
argv=0x00007f9451c04210) + 751 at main.c:228
(lldb) up 1
frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) + 204
at bufmgr.c:2593
2590 {
2591 BufferDesc *bufHdr;
2592
-> 2593 Assert(BufferIsPinned(buffer));
2594
2595 if (BufferIsLocal(buffer))
2596 bufHdr = GetLocalBufferDescriptor(-buffer - 1);
--
Michael

From:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2016-12-02 04:39:42
Message-ID:	CAJrrPGffYzGCLZLfg6Q-RNR5H8ayLjKhYhWOkmfTU9-Af5n+cA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 9, 2016 at 5:55 PM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

>
>
> On Wed, Nov 9, 2016 at 9:27 AM, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
> wrote:
> > On Wed, Nov 9, 2016 at 5:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com>
> wrote:
> >> On Thu, Feb 4, 2016 at 7:24 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
> wrote:
> >>> I dropped the ball on this one back in July, so here's an attempt to
> revive
> >>> this thread.
> >>>
> >>> I spent some time fixing the remaining issues with the prototype patch
> I
> >>> posted earlier, and rebased that on top of current git master. See
> attached.
> >>>
> >>> Some review of that would be nice. If there are no major issues with
> it, I'm
> >>> going to create backpatchable versions of this for 9.4 and below.
> >>
> >> Are you going to do commit something here? This thread and patch are
> >> now 14 months old, which is a long time to make people wait for a bug
> >> fix. The status in the CF is "Ready for Committer" although I am not
> >> sure if that's accurate.
> >
> > "Needs Review" is definitely a better definition of its current state.
> > The last time I had a look at this patch I thought that it was in
> > pretty good shape (not Horiguchi-san's version, but the one in
> > https://www.postgresql.org/message-id/CAB7nPqR+3JjS=JB3R=
> AxxkXCyEB-q77U-ERW7_uKAJCtWNTfrg(at)mail(dot)gmail(dot)com).
> > With some of the recent changes, surely it needs a second look, things
> > related to heap handling tend to rot quickly.
> >
> > I'll look into it once again by the end of this week if Heikki does
> > not show up, the rest will be on him I am afraid...
>
> I have been able to hit a crash with recovery test 008:
> (lldb) bt
> * thread #1: tid = 0x0000, 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill
> + 10, stop reason = signal SIGSTOP
> * frame #0: 0x00007fff96d48f06 libsystem_kernel.dylib`__pthread_kill +
> 10
> frame #1: 0x00007fff9102e4ec libsystem_pthread.dylib`pthread_kill + 90
> frame #2: 0x00007fff8e5cc6df libsystem_c.dylib`abort + 129
> frame #3: 0x0000000106ef10f0 postgres`ExceptionalCondition(conditionName="!((
> !( ((void) ((bool) (! (!((buffer) <= NBuffers && (buffer) >= -NLocBuffer))
> || (ExceptionalCondition(\"!((buffer) <= NBuffers && (buffer) >=
> -NLocBuffer)\", (\"FailedAssertion\"), \"bufmgr.c\", 2593), 0)))), (buffer)
> != 0 ) ? ((bool) 0) : ((buffer) < 0) ? (LocalRefCount[-(buffer) - 1] > 0) :
> (GetPrivateRefCount(buffer) > 0) ))", errorType="FailedAssertion",
> fileName="bufmgr.c", lineNumber=2593) + 128 at assert.c:54
> frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0)
> + 204 at bufmgr.c:2593
> frame #5: 0x000000010694e6ad postgres`HeapNeedsWAL(rel=0x00007f9454804118,
> buf=0) + 61 at heapam.c:9234
> frame #6: 0x000000010696d8bd postgres`visibilitymap_set(rel=0x00007f9454804118,
> heapBlk=1, heapBuf=0, recptr=50841176, vmBuf=118, cutoff_xid=866,
> flags='\x01') + 989 at visibilitymap.c:310
> frame #7: 0x000000010695d020 postgres`heap_xlog_visible(record=0x00007f94520035d0)
> + 896 at heapam.c:8148
> frame #8: 0x000000010695c582 postgres`heap2_redo(record=0x00007f94520035d0)
> + 242 at heapam.c:9107
> frame #9: 0x00000001069d132d postgres`StartupXLOG + 9181 at xlog.c:6950
> frame #10: 0x0000000106c9d783 postgres`StartupProcessMain + 339 at
> startup.c:216
> frame #11: 0x00000001069ee6ec postgres`AuxiliaryProcessMain(argc=2,
> argv=0x00007fff59316d80) + 1676 at bootstrap.c:420
> frame #12: 0x0000000106c98002 postgres`StartChildProcess(type=StartupProcess)
> + 322 at postmaster.c:5221
> frame #13: 0x0000000106c96031 postgres`PostmasterMain(argc=3,
> argv=0x00007f9451c04210) + 6033 at postmaster.c:1301
> frame #14: 0x0000000106bc30cf postgres`main(argc=3,
> argv=0x00007f9451c04210) + 751 at main.c:228
> (lldb) up 1
> frame #4: 0x0000000106cf4a2c postgres`BufferGetBlockNumber(buffer=0) +
> 204 at bufmgr.c:2593
> 2590 {
> 2591 BufferDesc *bufHdr;
> 2592
> -> 2593 Assert(BufferIsPinned(buffer));
> 2594
> 2595 if (BufferIsLocal(buffer))
> 2596 bufHdr = GetLocalBufferDescriptor(-buffer - 1);
>

The latest proposed patch still having problems.
Closed in 2016-11 commitfest with "moved to next CF" status because of a
bug fix patch.
Please feel free to update the status once you submit the updated patch.

Regards,
Hari Babu
Fujitsu Australia

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-01-31 04:33:19
Message-ID:	CAB7nPqRopm_VoGMOUTiMF5ZOLitEjkXqjYE+Kxb+nbd1-+OB4A@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com> wrote:
> The latest proposed patch still having problems.
> Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
> fix patch.
> Please feel free to update the status once you submit the updated patch.

And moved to CF 2017-03...
--
Michael

From:	David Steele <david(at)pgmasters(dot)net>
To:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-03-01 16:23:33
Message-ID:	9efe8fbe-b592-3a9f-dcad-b54ccb6966e0@pgmasters.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 1/30/17 11:33 PM, Michael Paquier wrote:
> On Fri, Dec 2, 2016 at 1:39 PM, Haribabu Kommi <kommi(dot)haribabu(at)gmail(dot)com> wrote:
>> The latest proposed patch still having problems.
>> Closed in 2016-11 commitfest with "moved to next CF" status because of a bug
>> fix patch.
>> Please feel free to update the status once you submit the updated patch.
> And moved to CF 2017-03...

Are there any plans to post a new patch? This thread is now 18 months
old and it would be good to get a resolution in this CF.

Thanks,

--
-David
david(at)pgmasters(dot)net

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-07 23:33:21
Message-ID:	20170407233321.lpruyvn6r4tuujmt@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kyotaro HORIGUCHI wrote:

> The attached patch is quiiiccck-and-dirty-hack of Michael's patch
> just as a PoC of my proposal quoted above. This also passes the
> 006 test. The major changes are the following.
>
> - Moved sync_above and truncted_to into RelationData.

Interesting. I wonder if it's possible that a relcache invalidation
would cause these values to get lost for some reason, because that would
be dangerous.

I suppose the rationale is that this shouldn't happen because any
operation that does things this way must hold an exclusive lock on the
relation. But that doesn't guarantee that the relcache entry is
completely stable, does it? If we can get proof of that, then this
technique should be safe, I think.

In your version of the patch, which I spent some time skimming, I am
missing comments on various functions. I added some as I went along,
including one XXX indicating it must be filled.

RecordPendingSync() should really live in relcache.c (and probably get a
different name).

> X I feel that I have dropped one of the features of the origitnal
> patch during the hack, but I don't recall it clearly now:(

Hah :-)

> X I haven't consider relfilenode replacement, which didn't matter
> for the original patch. (but there's few places to consider).

Hmm ... Please provide.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-horiguchi-2.patch	text/plain	32.3 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-07 23:36:25
Message-ID:	20170407233625.yjze4eopoihy2ozd@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I have claimed this patch as committer FWIW.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-07 23:42:17
Message-ID:	20170407234217.i345adtkknjfw6ss@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:

> I suppose the rationale is that this shouldn't happen because any
> operation that does things this way must hold an exclusive lock on the
> relation. But that doesn't guarantee that the relcache entry is
> completely stable, does it? If we can get proof of that, then this
> technique should be safe, I think.

It occurs to me that in order to test this we could run the recovery
tests (including Michael's new 006 file, which you didn't include in
your patch) under -D CLOBBER_CACHE_ALWAYS. I think that'd be sufficient
proof that it is solid.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-08 00:38:35
Message-ID:	27309.1491611915@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> Interesting. I wonder if it's possible that a relcache invalidation
> would cause these values to get lost for some reason, because that would
> be dangerous.

It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
how strong a lock you hold.

regards, tom lane

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	alvherre(at)2ndquadrant(dot)com, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-11 00:56:06
Message-ID:	20170411.095606.245908357.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, thank you for looking this.

At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <27309(dot)1491611915(at)sss(dot)pgh(dot)pa(dot)us>
> Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> > Interesting. I wonder if it's possible that a relcache invalidation
> > would cause these values to get lost for some reason, because that would
> > be dangerous.
>
> > I suppose the rationale is that this shouldn't happen because any
> > operation that does things this way must hold an exclusive lock on the
> > relation. But that doesn't guarantee that the relcache entry is
> > completely stable,
>
> It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
> how strong a lock you hold.
>
> regards, tom lane

Ugh. Yes, relcache invalidation happens anytime and it resets the
added values. pg_stat_info deceived me that it can store
transient values. But I came up with another thought.

The reason I proposed it was I thought that hash_search for every
buffer is not good. Instead, like pg_stat_info, we can link the
pending-sync hash entry to Relation. This greately reduces the
frequency of hash-searching.

I'll post new patch in this way soon.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	alvherre(at)2ndquadrant(dot)com, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-11 08:33:41
Message-ID:	20170411.173341.257028732.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170411(dot)095606(dot)245908357(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Hello, thank you for looking this.
>
> At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <27309(dot)1491611915(at)sss(dot)pgh(dot)pa(dot)us>
> > Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> > > Interesting. I wonder if it's possible that a relcache invalidation
> > > would cause these values to get lost for some reason, because that would
> > > be dangerous.
> >
> > > I suppose the rationale is that this shouldn't happen because any
> > > operation that does things this way must hold an exclusive lock on the
> > > relation. But that doesn't guarantee that the relcache entry is
> > > completely stable,
> >
> > It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
> > how strong a lock you hold.
> >
> > regards, tom lane
>
> Ugh. Yes, relcache invalidation happens anytime and it resets the
> added values. pg_stat_info deceived me that it can store
> transient values. But I came up with another thought.
>
> The reason I proposed it was I thought that hash_search for every
> buffer is not good. Instead, like pg_stat_info, we can link the

buffer => buffer modification

> pending-sync hash entry to Relation. This greately reduces the
> frequency of hash-searching.
>
> I'll post new patch in this way soon.

Here it is.

- Relation has new members no_pending_sync and pending_sync that
works as instant cache of an entry in pendingSync hash.

- Commit-time synchronizing is restored as Michael's patch.

- If relfilenode is replaced, pending_sync for the old node is
removed. Anyway this is ignored on abort and meaningless on
commit.

- TAP test is renamed to 012 since some new files have been added.

Accessing pending sync hash occured on every calling of
HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
accessing relations has pending sync. Almost of them are
eliminated as the result.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-horiguchi-2.patch	text/x-patch	95.8 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	alvherre(at)2ndquadrant(dot)com, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-11 08:38:12
Message-ID:	20170411.173812.133964522.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry, what I have just sent was broken.

At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170411(dot)173341(dot)257028732(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170411(dot)095606(dot)245908357(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Hello, thank you for looking this.
> >
> > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <27309(dot)1491611915(at)sss(dot)pgh(dot)pa(dot)us>
> > > Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> > > > Interesting. I wonder if it's possible that a relcache invalidation
> > > > would cause these values to get lost for some reason, because that would
> > > > be dangerous.
> > >
> > > > I suppose the rationale is that this shouldn't happen because any
> > > > operation that does things this way must hold an exclusive lock on the
> > > > relation. But that doesn't guarantee that the relcache entry is
> > > > completely stable,
> > >
> > > It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
> > > how strong a lock you hold.
> > >
> > > regards, tom lane
> >
> > Ugh. Yes, relcache invalidation happens anytime and it resets the
> > added values. pg_stat_info deceived me that it can store
> > transient values. But I came up with another thought.
> >
> > The reason I proposed it was I thought that hash_search for every
> > buffer is not good. Instead, like pg_stat_info, we can link the
>
> buffer => buffer modification
>
> > pending-sync hash entry to Relation. This greately reduces the
> > frequency of hash-searching.
> >
> > I'll post new patch in this way soon.
>
> Here it is.

It contained tariling space and missing test script. This is the
correct patch.

> - Relation has new members no_pending_sync and pending_sync that
> works as instant cache of an entry in pendingSync hash.
>
> - Commit-time synchronizing is restored as Michael's patch.
>
> - If relfilenode is replaced, pending_sync for the old node is
> removed. Anyway this is ignored on abort and meaningless on
> commit.
>
> - TAP test is renamed to 012 since some new files have been added.
>
> Accessing pending sync hash occured on every calling of
> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> accessing relations has pending sync. Almost of them are
> eliminated as the result.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-horiguchi-2.patch	text/x-patch	44.3 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, david(at)pgmasters(dot)net, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-13 04:52:40
Message-ID:	CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> Sorry, what I have just sent was broken.

You can use PROVE_TESTS when running make check to select a subset of
tests you want to run. I use that all the time when working on patches
dedicated to certain code paths.

>> - Relation has new members no_pending_sync and pending_sync that
>> works as instant cache of an entry in pendingSync hash.
>> - Commit-time synchronizing is restored as Michael's patch.
>> - If relfilenode is replaced, pending_sync for the old node is
>> removed. Anyway this is ignored on abort and meaningless on
>> commit.
>> - TAP test is renamed to 012 since some new files have been added.
>>
>> Accessing pending sync hash occurred on every calling of
>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
>> accessing relations has pending sync. Almost of them are
>> eliminated as the result.

Did you actually test this patch? One of the logs added makes the
tests a long time to run:
2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
STATEMENT: ANALYZE;
2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
= 0x0, no_pending_sync = 0

- lsn = XLogInsert(RM_SMGR_ID,
- XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
+ rel->no_pending_sync= false;
+ rel->pending_sync = pending;
+ }
It seems to me that those flags and the pending_sync data should be
kept in the context of backend process and not be part of the Relation
data...

+void
+RecordPendingSync(Relation rel)
I don't think that I agree that this should be part of relcache.c. The
syncs are tracked should be tracked out of the relation context.

Seeing how invasive this change is, I would also advocate for this
patch as only being a HEAD-only change, not many people are
complaining about this optimization of TRUNCATE missing when wal_level
= minimal, and this needs a very careful review.

Should I code something? Or Horiguchi-san, would you take care of it?
The previous crash I saw has been taken care of, but it's been really
some time since I looked at this patch...
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	alvherre(at)2ndquadrant(dot)com, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-13 06:29:35
Message-ID:	20170413.152935.100104316.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'd like to put a supplimentary explanation.

At Tue, 11 Apr 2017 17:38:12 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170411(dot)173812(dot)133964522(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Sorry, what I have just sent was broken.
>
> At Tue, 11 Apr 2017 17:33:41 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170411(dot)173341(dot)257028732(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > At Tue, 11 Apr 2017 09:56:06 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170411(dot)095606(dot)245908357(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > Hello, thank you for looking this.
> > >
> > > At Fri, 07 Apr 2017 20:38:35 -0400, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote in <27309(dot)1491611915(at)sss(dot)pgh(dot)pa(dot)us>
> > > > Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> writes:
> > > > > Interesting. I wonder if it's possible that a relcache invalidation
> > > > > would cause these values to get lost for some reason, because that would
> > > > > be dangerous.
> > > >
> > > > > I suppose the rationale is that this shouldn't happen because any
> > > > > operation that does things this way must hold an exclusive lock on the
> > > > > relation. But that doesn't guarantee that the relcache entry is
> > > > > completely stable,
> > > >
> > > > It ABSOLUTELY is not safe. Relcache flushes can happen regardless of
> > > > how strong a lock you hold.
> > > >
> > > > regards, tom lane
> > >
> > > Ugh. Yes, relcache invalidation happens anytime and it resets the

The pending locations are not stored in relcache hash so the
problem here is not invalidation but that Relation objects are
created as necessary, anywhere. Even if no invalidation happens,
the same thing will happen in a bit different form.

> > > added values. pg_stat_info deceived me that it can store
> > > transient values. But I came up with another thought.
> > >
> > > The reason I proposed it was I thought that hash_search for every
> > > buffer is not good. Instead, like pg_stat_info, we can link the
> >
> > buffer => buffer modification
> >
> > > pending-sync hash entry to Relation. This greately reduces the
> > > frequency of hash-searching.
> > >
> > > I'll post new patch in this way soon.
> >
> > Here it is.
>
> It contained tariling space and missing test script. This is the
> correct patch.
>
> > - Relation has new members no_pending_sync and pending_sync that
> > works as instant cache of an entry in pendingSync hash.
> >
> > - Commit-time synchronizing is restored as Michael's patch.
> >
> > - If relfilenode is replaced, pending_sync for the old node is
> > removed. Anyway this is ignored on abort and meaningless on
> > commit.
> >
> > - TAP test is renamed to 012 since some new files have been added.
> >
> > Accessing pending sync hash occured on every calling of
> > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> > accessing relations has pending sync. Almost of them are
> > eliminated as the result.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-04-13 09:42:19
Message-ID:	20170413.184219.106482305.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA(at)mail(dot)gmail(dot)com>
> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > Sorry, what I have just sent was broken.
>
> You can use PROVE_TESTS when running make check to select a subset of
> tests you want to run. I use that all the time when working on patches
> dedicated to certain code paths.

Thank you for the information. Removing unwanted test scripts
from t/ directories was annoyance. This makes me happy.

> >> - Relation has new members no_pending_sync and pending_sync that
> >> works as instant cache of an entry in pendingSync hash.
> >> - Commit-time synchronizing is restored as Michael's patch.
> >> - If relfilenode is replaced, pending_sync for the old node is
> >> removed. Anyway this is ignored on abort and meaningless on
> >> commit.
> >> - TAP test is renamed to 012 since some new files have been added.
> >>
> >> Accessing pending sync hash occurred on every calling of
> >> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> >> accessing relations has pending sync. Almost of them are
> >> eliminated as the result.
>
> Did you actually test this patch? One of the logs added makes the
> tests a long time to run:

Maybe this patch requires make clean since it extends the
structure RelationData. (Perhaps I saw the same trouble.)

> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> STATEMENT: ANALYZE;
> 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
> = 0x0, no_pending_sync = 0
>
> - lsn = XLogInsert(RM_SMGR_ID,
> - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> + rel->no_pending_sync= false;
> + rel->pending_sync = pending;
> + }
>
> It seems to me that those flags and the pending_sync data should be
> kept in the context of backend process and not be part of the Relation
> data...

I understand that the context of "backend process" means
storage.c local. I don't mind the context on which the data is,
but I found only there that can get rid of frequent hash
searching. For pending deletions, just appending to a list is
enough and costs almost nothing, on the other hand pendig syncs
are required to be referenced, sometimes very frequently.

> +void
> +RecordPendingSync(Relation rel)
> I don't think that I agree that this should be part of relcache.c. The
> syncs are tracked should be tracked out of the relation context.

Yeah.. It's in storage.c in the latest patch. (Sorry for the
duplicate name). I think it is a kind of bond between smgr and
relation.

> Seeing how invasive this change is, I would also advocate for this
> patch as only being a HEAD-only change, not many people are
> complaining about this optimization of TRUNCATE missing when wal_level
> = minimal, and this needs a very careful review.

Agreed.

> Should I code something? Or Horiguchi-san, would you take care of it?
> The previous crash I saw has been taken care of, but it's been really
> some time since I looked at this patch...

My point is hash-search on every tuple insertion should be evaded
even if it happens rearely. Once it was a bit apart from your
original patch, but in the latest patch the significant part
(pending-sync hash) is revived from the original one.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Daniel Gustafsson <daniel(at)yesql(dot)se>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	michael(dot)paquier(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-05 10:05:01
Message-ID:	B3EC34FC-A48E-41AA-8598-BFC5D87CB383@yesql.se
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA(at)mail(dot)gmail(dot)com>
>> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>> Sorry, what I have just sent was broken.
>>
>> You can use PROVE_TESTS when running make check to select a subset of
>> tests you want to run. I use that all the time when working on patches
>> dedicated to certain code paths.
>
> Thank you for the information. Removing unwanted test scripts
> from t/ directories was annoyance. This makes me happy.
>
>>>> - Relation has new members no_pending_sync and pending_sync that
>>>> works as instant cache of an entry in pendingSync hash.
>>>> - Commit-time synchronizing is restored as Michael's patch.
>>>> - If relfilenode is replaced, pending_sync for the old node is
>>>> removed. Anyway this is ignored on abort and meaningless on
>>>> commit.
>>>> - TAP test is renamed to 012 since some new files have been added.
>>>>
>>>> Accessing pending sync hash occurred on every calling of
>>>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
>>>> accessing relations has pending sync. Almost of them are
>>>> eliminated as the result.
>>
>> Did you actually test this patch? One of the logs added makes the
>> tests a long time to run:
>
> Maybe this patch requires make clean since it extends the
> structure RelationData. (Perhaps I saw the same trouble.)
>
>> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
>> STATEMENT: ANALYZE;
>> 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
>> = 0x0, no_pending_sync = 0
>>
>> - lsn = XLogInsert(RM_SMGR_ID,
>> - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
>> + rel->no_pending_sync= false;
>> + rel->pending_sync = pending;
>> + }
>>
>> It seems to me that those flags and the pending_sync data should be
>> kept in the context of backend process and not be part of the Relation
>> data...
>
> I understand that the context of "backend process" means
> storage.c local. I don't mind the context on which the data is,
> but I found only there that can get rid of frequent hash
> searching. For pending deletions, just appending to a list is
> enough and costs almost nothing, on the other hand pendig syncs
> are required to be referenced, sometimes very frequently.
>
>> +void
>> +RecordPendingSync(Relation rel)
>> I don't think that I agree that this should be part of relcache.c. The
>> syncs are tracked should be tracked out of the relation context.
>
> Yeah.. It's in storage.c in the latest patch. (Sorry for the
> duplicate name). I think it is a kind of bond between smgr and
> relation.
>
>> Seeing how invasive this change is, I would also advocate for this
>> patch as only being a HEAD-only change, not many people are
>> complaining about this optimization of TRUNCATE missing when wal_level
>> = minimal, and this needs a very careful review.
>
> Agreed.
>
>> Should I code something? Or Horiguchi-san, would you take care of it?
>> The previous crash I saw has been taken care of, but it's been really
>> some time since I looked at this patch...
>
> My point is hash-search on every tuple insertion should be evaded
> even if it happens rearely. Once it was a bit apart from your
> original patch, but in the latest patch the significant part
> (pending-sync hash) is revived from the original one.

This patch has followed along since CF 2016-03, do we think we can reach a
conclusion in this CF? It was marked as "Waiting on Author”, based on
developments since in this thread, I’ve changed it back to “Needs Review”
again.

cheers ./daniel

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	daniel(at)yesql(dot)se
Cc:	michael(dot)paquier(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-08 07:30:01
Message-ID:	20170908.163001.53230385.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for your notification.

At Tue, 5 Sep 2017 12:05:01 +0200, Daniel Gustafsson <daniel(at)yesql(dot)se> wrote in <B3EC34FC-A48E-41AA-8598-BFC5D87CB383(at)yesql(dot)se>
> > On 13 Apr 2017, at 11:42, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >
> > At Thu, 13 Apr 2017 13:52:40 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqTRyica1d-zU+YckveFC876=Sc847etmk7TRgAS2pA9CA(at)mail(dot)gmail(dot)com>
> >> On Tue, Apr 11, 2017 at 5:38 PM, Kyotaro HORIGUCHI
> >> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >>> Sorry, what I have just sent was broken.
> >>
> >> You can use PROVE_TESTS when running make check to select a subset of
> >> tests you want to run. I use that all the time when working on patches
> >> dedicated to certain code paths.
> >
> > Thank you for the information. Removing unwanted test scripts
> > from t/ directories was annoyance. This makes me happy.
> >
> >>>> - Relation has new members no_pending_sync and pending_sync that
> >>>> works as instant cache of an entry in pendingSync hash.
> >>>> - Commit-time synchronizing is restored as Michael's patch.
> >>>> - If relfilenode is replaced, pending_sync for the old node is
> >>>> removed. Anyway this is ignored on abort and meaningless on
> >>>> commit.
> >>>> - TAP test is renamed to 012 since some new files have been added.
> >>>>
> >>>> Accessing pending sync hash occurred on every calling of
> >>>> HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> >>>> accessing relations has pending sync. Almost of them are
> >>>> eliminated as the result.
> >>
> >> Did you actually test this patch? One of the logs added makes the
> >> tests a long time to run:
> >
> > Maybe this patch requires make clean since it extends the
> > structure RelationData. (Perhaps I saw the same trouble.)
> >
> >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> >> STATEMENT: ANALYZE;
> >> 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
> >> = 0x0, no_pending_sync = 0
> >>
> >> - lsn = XLogInsert(RM_SMGR_ID,
> >> - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> >> + rel->no_pending_sync= false;
> >> + rel->pending_sync = pending;
> >> + }
> >>
> >> It seems to me that those flags and the pending_sync data should be
> >> kept in the context of backend process and not be part of the Relation
> >> data...
> >
> > I understand that the context of "backend process" means
> > storage.c local. I don't mind the context on which the data is,
> > but I found only there that can get rid of frequent hash
> > searching. For pending deletions, just appending to a list is
> > enough and costs almost nothing, on the other hand pendig syncs
> > are required to be referenced, sometimes very frequently.
> >
> >> +void
> >> +RecordPendingSync(Relation rel)
> >> I don't think that I agree that this should be part of relcache.c. The
> >> syncs are tracked should be tracked out of the relation context.
> >
> > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > duplicate name). I think it is a kind of bond between smgr and
> > relation.
> >
> >> Seeing how invasive this change is, I would also advocate for this
> >> patch as only being a HEAD-only change, not many people are
> >> complaining about this optimization of TRUNCATE missing when wal_level
> >> = minimal, and this needs a very careful review.
> >
> > Agreed.
> >
> >> Should I code something? Or Horiguchi-san, would you take care of it?
> >> The previous crash I saw has been taken care of, but it's been really
> >> some time since I looked at this patch...
> >
> > My point is hash-search on every tuple insertion should be evaded
> > even if it happens rearely. Once it was a bit apart from your
> > original patch, but in the latest patch the significant part
> > (pending-sync hash) is revived from the original one.
>
> This patch has followed along since CF 2016-03, do we think we can reach a
> conclusion in this CF? It was marked as "Waiting on Author”, based on
> developments since in this thread, I’ve changed it back to “Needs Review”
> again.

I manged to reload its context into my head. It doesn't apply on
the current master and needs some amendment. I'm going to work on
this.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	daniel(at)yesql(dot)se
Cc:	michael(dot)paquier(at)gmail(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-12 04:14:41
Message-ID:	20170912.131441.20602611.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170908(dot)163001(dot)53230385(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> > >> STATEMENT: ANALYZE;
> > >> 2017-04-13 12:12:25.766 JST [85492] LOG: BufferNeedsWAL: pendingSyncs
> > >> = 0x0, no_pending_sync = 0
> > >>
> > >> - lsn = XLogInsert(RM_SMGR_ID,
> > >> - XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> > >> + rel->no_pending_sync= false;
> > >> + rel->pending_sync = pending;
> > >> + }
> > >>
> > >> It seems to me that those flags and the pending_sync data should be
> > >> kept in the context of backend process and not be part of the Relation
> > >> data...
> > >
> > > I understand that the context of "backend process" means
> > > storage.c local. I don't mind the context on which the data is,
> > > but I found only there that can get rid of frequent hash
> > > searching. For pending deletions, just appending to a list is
> > > enough and costs almost nothing, on the other hand pendig syncs
> > > are required to be referenced, sometimes very frequently.
> > >
> > >> +void
> > >> +RecordPendingSync(Relation rel)
> > >> I don't think that I agree that this should be part of relcache.c. The
> > >> syncs are tracked should be tracked out of the relation context.
> > >
> > > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > > duplicate name). I think it is a kind of bond between smgr and
> > > relation.
> > >
> > >> Seeing how invasive this change is, I would also advocate for this
> > >> patch as only being a HEAD-only change, not many people are
> > >> complaining about this optimization of TRUNCATE missing when wal_level
> > >> = minimal, and this needs a very careful review.
> > >
> > > Agreed.
> > >
> > >> Should I code something? Or Horiguchi-san, would you take care of it?
> > >> The previous crash I saw has been taken care of, but it's been really
> > >> some time since I looked at this patch...
> > >
> > > My point is hash-search on every tuple insertion should be evaded
> > > even if it happens rearely. Once it was a bit apart from your
> > > original patch, but in the latest patch the significant part
> > > (pending-sync hash) is revived from the original one.
> >
> > This patch has followed along since CF 2016-03, do we think we can reach a
> > conclusion in this CF? It was marked as "Waiting on Author”, based on
> > developments since in this thread, I’ve changed it back to “Needs Review”
> > again.
>
> I manged to reload its context into my head. It doesn't apply on
> the current master and needs some amendment. I'm going to work on
> this.

Rebased and slightly modified.

Michael's latest patch on which this patch is piggybacking seems
works perfectly. The motive of my addition is avoiding frequent
(I think specifically per tuple modification) hash accessing
occurs while pending-syncs exist. The hash contains at least 6 or
more entries.

The attached patch emits more log messages that will be removed
in the final shape to see how much the addition reduces the hash
access. As a basis of determining the worthiness of the
additional mechanism, I'll show an example of a set of queries
below.

In the log messages, "r" is relation oid, "b" is buffer number,
"hash" is the pointer to the backend-global hash table for
pending syncs. "ent" is the entry in the hash belongs to the
relation, "neg" is a flag indicates that the existing pending
sync hash doesn't have an entry for the relation.

=# set log_min_message to debug2;
=# begin;
=# create table test1(a text primary key);
> DEBUG: BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0
# relid=2608 buf=55, hash has not been created

=# insert into test1 values ('inserted row');
> DEBUG: BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0
# relid=24807, fist buffer, hash has not bee created

=# copy test1 from '/<somewhere>/copy_data.txt';
> DEBUG: BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0
# hash created, pending sync entry linked, no longer needs hash acess
# (repeats for the number of buffers)
COPY 200

=# create table test3(a text primary key);
> DEBUG: BufferNeedsWAL(r 2608, b 55): hash = 0x171de00, ent=(nil), neg = 1
# no pending sync entry for this relation, no longer needs hash access.

=# insert into test3 (select a from generate_series(0, 99) a);
> DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG: BufferNeedsWAL: accessing hash : not found
> DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 1
# This table no longer needs hash access, (repeats for the number of tuples)

=# truncate test3;
=# insert into test3 (select a from generate_series(0, 99) a);
> DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG: BufferNeedsWAL: accessing hash : found
> DEBUG: BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=0x171f340, neg = 0
# This table has pending sync but no longer needs hash access,
# (repeats for the number of tuples)

The hash is required in the case of relcache invalidation. When
ent=(nil) and neg = 0 but hash != (nil), it tries hash search and
restores the previous state.

This mechanism avoids most of the hash accesses by replacing into
just following a pointer. On the other hand, the hash access
occurs only after relation truncate in the current
transaction. In other words, this won't be in effect unless any
of table truncation, copy, create as, alter table or refresing
matview occurs.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-horiguchi-3.patch	text/x-patch	33.7 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	tgl(at)sss(dot)pgh(dot)pa(dot)us
Cc:	alvherre(at)2ndquadrant(dot)com, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-13 01:04:21
Message-ID:	20170913.100421.234955137.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, (does this seem to be a top post?)

The CF status of this patch turned into "Waiting on Author" by
automated CI checking. However, I still don't get any error even
on the current master (69835bc) after make distclean. Also I
don't see any difference between the "problematic" patch and my
working branch has nothing different other than patching line
shifts. (So I haven't post a new one.)

I looked on the location heapam.c:2502 where the CI complains at
in my working branch and I found a different code with the
complaint.

https://travis-ci.org/postgresql-cfbot/postgresql/builds/274777750

1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in this function)
1364 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

heapam.c:2502(at)work branch
2502: /* XLOG stuff */
2503: if (BufferNeedsWAL(relation, buffer))

So I conclude that the CI mechinery failed to applly the patch
correctly.

At Thu, 13 Apr 2017 15:29:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170413(dot)152935(dot)100104316(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > > I'll post new patch in this way soon.
> > >
> > > Here it is.
> >
> > It contained tariling space and missing test script. This is the
> > correct patch.
> >
> > > - Relation has new members no_pending_sync and pending_sync that
> > > works as instant cache of an entry in pendingSync hash.
> > >
> > > - Commit-time synchronizing is restored as Michael's patch.
> > >
> > > - If relfilenode is replaced, pending_sync for the old node is
> > > removed. Anyway this is ignored on abort and meaningless on
> > > commit.
> > >
> > > - TAP test is renamed to 012 since some new files have been added.
> > >
> > > Accessing pending sync hash occured on every calling of
> > > HeapNeedsWAL() (per insertion/update/freeze of a tuple) if any of
> > > accessing relations has pending sync. Almost of them are
> > > eliminated as the result.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-13 03:05:31
Message-ID:	CAEepm=0x7CGYmNM5q7TKzz_KrD+Pr7jbFzD8UZad_+=4PG1PyA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Sep 13, 2017 at 1:04 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> The CF status of this patch turned into "Waiting on Author" by
> automated CI checking. However, I still don't get any error even
> on the current master (69835bc) after make distclean. Also I
> don't see any difference between the "problematic" patch and my
> working branch has nothing different other than patching line
> shifts. (So I haven't post a new one.)
>
> I looked on the location heapam.c:2502 where the CI complains at
> in my working branch and I found a different code with the
> complaint.
>
> https://travis-ci.org/postgresql-cfbot/postgresql/builds/274777750
>
> 1363 heapam.c:2502:18: error: ‘HEAP_INSERT_SKIP_WAL’ undeclared (first use in this function)
> 1364 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
>
> heapam.c:2502(at)work branch
> 2502: /* XLOG stuff */
> 2503: if (BufferNeedsWAL(relation, buffer))
>
> So I conclude that the CI mechinery failed to applly the patch
> correctly.

Hi Horiguchi-san,

Hmm. Here is that line in heamap.c in unpatched master:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;h=d20f0381f3bc23f99c505ef8609d63240ac5d44b;hb=HEAD#l2485

It says:

2485 if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

After applying fix-wal-level-minimal-michael-horiguchi-3.patch from
this message:

https://www.postgresql.org/message-id/20170912.131441.20602611.horiguchi.kyotaro%40lab.ntt.co.jp

... that line is unchanged, although it has moved to line number 2502.
It doesn't compile for me, because your patch removed the definition
of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.

I'm not sure what happened. Is it possible that your patch was not
created by diffing against master?

--
Thomas Munro
http://www.enterprisedb.com

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-13 08:03:48
Message-ID:	20170913080348.vonuawgb7xojwq23@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Kyotaro HORIGUCHI wrote:

> The CF status of this patch turned into "Waiting on Author" by
> automated CI checking.

I object to automated turning of patches to waiting on author by
machinery. Sending occasional reminder messages to authors making them
know about outdated patches seems acceptable to me at this stage.

It'll take some time for this machinery to get perfected; only when it
is beyond experimental mode it'll be acceptable to change patches'
status in an automated fashion.

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	thomas(dot)munro(at)enterprisedb(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-13 08:42:39
Message-ID:	20170913.174239.25978735.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Wed, 13 Sep 2017 15:05:31 +1200, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> wrote in <CAEepm=0x7CGYmNM5q7TKzz_KrD+Pr7jbFzD8UZad_+=4PG1PyA(at)mail(dot)gmail(dot)com>
> It doesn't compile for me, because your patch removed the definition
> of HEAP_INSERT_SKIP_WAL but hasn't removed that reference to it.
>
> I'm not sure what happened. Is it possible that your patch was not
> created by diffing against master?

It created using filterdiff.

> git diff master --patience | grep options
...
> - if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))

but the line dissapears from the output of the following command

> git diff master --patience | filterdiff --format=context | grep options

filterdiff seems to did something wrong..

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	thomas(dot)munro(at)enterprisedb(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, michael(dot)paquier(at)gmail(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: WAL logging problem in 9.4.3?
Date:	2017-09-14 06:34:59
Message-ID:	20170914.153459.94374240.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170913(dot)174239(dot)25978735(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> filterdiff seems to did something wrong..

# to did...

The patch is broken by filterdiff so I send a new patch made
directly by git format-patch. I confirmed that a build completes
with applying this.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
fix-wal-level-minimal-michael-horiguchi-5.patch	text/x-patch	38.7 KB

From:	Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2017-11-28 01:36:39
Message-ID:	CAB7nPqSqukqS5Xx6_6KEk53eRy5ObdvaNG-5aN_4cE8=gTeOdg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170913(dot)174239(dot)25978735(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
>> filterdiff seems to did something wrong..
>
> # to did...
>
> The patch is broken by filterdiff so I send a new patch made
> directly by git format-patch. I confirmed that a build completes
> with applying this.

To my surprise this patch still applies but fails recovery tests. I am
bumping it to next CF, for what will be its 8th registration as it is
for a bug fix, switching the status to "waiting on author".
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(dot)paquier(at)gmail(dot)com
Cc:	thomas(dot)munro(at)enterprisedb(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2017-12-11 08:54:24
Message-ID:	20171211.175424.09346818.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 28 Nov 2017 10:36:39 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSqukqS5Xx6_6KEk53eRy5ObdvaNG-5aN_4cE8=gTeOdg(at)mail(dot)gmail(dot)com>
> On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170913(dot)174239(dot)25978735(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> >> filterdiff seems to did something wrong..
> >
> > # to did...

It's horrid to see that:p

> > The patch is broken by filterdiff so I send a new patch made
> > directly by git format-patch. I confirmed that a build completes
> > with applying this.
>
> To my surprise this patch still applies but fails recovery tests. I am
> bumping it to next CF, for what will be its 8th registration as it is
> for a bug fix, switching the status to "waiting on author".

Thank you for checking that. I saw maybe the same failure. It
occurred when visibilitymap_set() is called with heapBuf =
InvalidBuffer during recovery. Checking pendingSyncs and
no_pending_sync before the elog fixes it. Anyway the DEBUG2 elogs
are to removed before committing. They are just to look how it
works.

The attached patch applies on the current HEAD and passes all
recovery tests.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0001-Fix-WAL-logging-problem.patch	text/x-patch	38.8 KB

From:	Stephen Frost <sfrost(at)snowman(dot)net>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	michael(dot)paquier(at)gmail(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-01-05 04:10:40
Message-ID:	20180105041040.GI2416@tamriel.snowman.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greetings,

* Kyotaro HORIGUCHI (horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp) wrote:
> At Tue, 28 Nov 2017 10:36:39 +0900, Michael Paquier <michael(dot)paquier(at)gmail(dot)com> wrote in <CAB7nPqSqukqS5Xx6_6KEk53eRy5ObdvaNG-5aN_4cE8=gTeOdg(at)mail(dot)gmail(dot)com>
> > On Thu, Sep 14, 2017 at 3:34 PM, Kyotaro HORIGUCHI
> > <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > At Wed, 13 Sep 2017 17:42:39 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20170913(dot)174239(dot)25978735(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > >> filterdiff seems to did something wrong..
> > >
> > > # to did...
>
> It's horrid to see that:p
>
> > > The patch is broken by filterdiff so I send a new patch made
> > > directly by git format-patch. I confirmed that a build completes
> > > with applying this.
> >
> > To my surprise this patch still applies but fails recovery tests. I am
> > bumping it to next CF, for what will be its 8th registration as it is
> > for a bug fix, switching the status to "waiting on author".
>
> Thank you for checking that. I saw maybe the same failure. It
> occurred when visibilitymap_set() is called with heapBuf =
> InvalidBuffer during recovery. Checking pendingSyncs and
> no_pending_sync before the elog fixes it. Anyway the DEBUG2 elogs
> are to removed before committing. They are just to look how it
> works.
>
> The attached patch applies on the current HEAD and passes all
> recovery tests.

This is currently marked as 'waiting on author' in the CF app, but it
sounds like it should be 'Needs review'. If that's the case, please
update the CF app accordingly. If you run into any issues with that,
let me know.

Thanks!

Stephen

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	sfrost(at)snowman(dot)net
Cc:	michael(dot)paquier(at)gmail(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-01-11 08:03:55
Message-ID:	20180111.170355.123986255.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello,

At Thu, 4 Jan 2018 23:10:40 -0500, Stephen Frost <sfrost(at)snowman(dot)net> wrote in <20180105041040(dot)GI2416(at)tamriel(dot)snowman(dot)net>
> > The attached patch applies on the current HEAD and passes all
> > recovery tests.
>
> This is currently marked as 'waiting on author' in the CF app, but it
> sounds like it should be 'Needs review'. If that's the case, please
> update the CF app accordingly. If you run into any issues with that,
> let me know.
>
> Thanks!

Thank you for noticing me of that. The attached is the rebased
patch (the previous version didn't conflict with the current
master, though) and changed the status to "Needs Review".

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0001-Fix-WAL-logging-problem.patch	text/x-patch	38.8 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	sfrost(at)snowman(dot)net
Cc:	michael(dot)paquier(at)gmail(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-03-30 01:06:46
Message-ID:	20180330.100646.86008470.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. I found that c203d6cf81 hit this and this is the rebased
version on the current master.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0001-Fix-WAL-logging-problem.patch	text/x-patch	38.8 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	sfrost(at)snowman(dot)net, michael(dot)paquier(at)gmail(dot)com, thomas(dot)munro(at)enterprisedb(dot)com, tgl(at)sss(dot)pgh(dot)pa(dot)us, alvherre(at)2ndquadrant(dot)com, david(at)pgmasters(dot)net, hlinnaka(at)iki(dot)fi, simon(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, masao(dot)fujii(at)gmail(dot)com, kleptog(at)svana(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-04 04:59:12
Message-ID:	20180704045912.GG1672@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
> Hello. I found that c203d6cf81 hit this and this is the rebased
> version on the current master.

Okay, as this is visibly the oldest item in this commit fest, Andrew has
asked me to look at a solution which would allow us to definitely close
the loop for all maintained branches. In consequence, I have been
looking at this problem. Here are my thoughts:
- The set of errors reported on this thread are alarming, depending on
the scenarios used, we could have "could not read file" stuff, or even
data loss after WAL replay comes and wipes out everything.
- Disabling completely the TRUNCATE optimization is definitely not cool,
as there could be an impact for users.
- Removing wal_level = minimal is not acceptable as well, as some people
rely on this feature.
- Rewriting the sync handling of heap relation files in an invasive way
may be something to investigate and improve on HEAD (I am not really
convinced about that actually for the optimizations discussed on this
thread as this may result in more bugs than actual fixes), but that
would do nothing for back-branches.

Hence I propose the patch attached which disables the TRUNCATE and COPY
optimizations for two cases, which are the ones actually causing
problems. One solution has been presented by Simon here for COPY, which
is to disable the optimization when there are no blocks on a relation
with wal_level = minimal:
https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
For back-patching, I find that really appealing.

The second thing that the patch attached does is to tweak
ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
wal_level = minimal.

Another thing that this patch adds is a set of regression tests to
stress all the various scenarios presented on this thread with table
creation, INSERT, COPY and TRUNCATE running in the same transactions for
both wal_level = minimal and replica, which make sure that there are no
failures and no actual data loss. The test is useful anyway, as any
patch presented did not present a way to test easily all the scenarios,
except for a bash script present upthread, but this discarded some of
the cases.

I would propose that for a back-patch, except for the test which can go
down easily to 9.6 but I have not tested that yet.

Thoughts?
--
Michael

Attachment	Content-Type	Size
wal-minimal-copy-truncate.patch	text/x-diff	6.3 KB

From:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-04 11:55:53
Message-ID:	CAA8=A7_76dy-TWU1R7waJ0XogP5EDkQaiXtZ4=V_d_Di3HZqGg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 4, 2018 at 12:59 AM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Fri, Mar 30, 2018 at 10:06:46AM +0900, Kyotaro HORIGUCHI wrote:
>> Hello. I found that c203d6cf81 hit this and this is the rebased
>> version on the current master.
>
> Okay, as this is visibly the oldest item in this commit fest, Andrew has
> asked me to look at a solution which would allow us to definitely close
> the loop for all maintained branches. In consequence, I have been
> looking at this problem. Here are my thoughts:
> - The set of errors reported on this thread are alarming, depending on
> the scenarios used, we could have "could not read file" stuff, or even
> data loss after WAL replay comes and wipes out everything.
> - Disabling completely the TRUNCATE optimization is definitely not cool,
> as there could be an impact for users.
> - Removing wal_level = minimal is not acceptable as well, as some people
> rely on this feature.
> - Rewriting the sync handling of heap relation files in an invasive way
> may be something to investigate and improve on HEAD (I am not really
> convinced about that actually for the optimizations discussed on this
> thread as this may result in more bugs than actual fixes), but that
> would do nothing for back-branches.
>
> Hence I propose the patch attached which disables the TRUNCATE and COPY
> optimizations for two cases, which are the ones actually causing
> problems. One solution has been presented by Simon here for COPY, which
> is to disable the optimization when there are no blocks on a relation
> with wal_level = minimal:
> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
> For back-patching, I find that really appealing.
>
> The second thing that the patch attached does is to tweak
> ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
> wal_level = minimal.
>
> Another thing that this patch adds is a set of regression tests to
> stress all the various scenarios presented on this thread with table
> creation, INSERT, COPY and TRUNCATE running in the same transactions for
> both wal_level = minimal and replica, which make sure that there are no
> failures and no actual data loss. The test is useful anyway, as any
> patch presented did not present a way to test easily all the scenarios,
> except for a bash script present upthread, but this discarded some of
> the cases.
>
> I would propose that for a back-patch, except for the test which can go
> down easily to 9.6 but I have not tested that yet.
>

Many thanks for working on this.

+1 for these changes, even though the TRUNCATE fix looks perverse. If
anyone wants to propose further optimizations in this area this would
at least give us a startpoint which is correct.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Stephen Frost <sfrost(at)snowman(dot)net>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, David Steele <david(at)pgmasters(dot)net>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, kleptog(at)svana(dot)org, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-05 07:11:53
Message-ID:	20180705071152.GA23405@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 04, 2018 at 07:55:53AM -0400, Andrew Dunstan wrote:
> Many thanks for working on this.

No problem. Thanks for the lookup.

> +1 for these changes, even though the TRUNCATE fix looks perverse. If
> anyone wants to propose further optimizations in this area this would
> at least give us a startpoint which is correct.

Yes, that's exactly what I am coming at. The optimizations which are
currently broken just cannot and should not be used. If anybody wishes
to improve the current set of optimizations in place for wal_level =
minimal, let's also consider the other patch. Based on the tests I sent
in the previous patch, I have compiled five scenarios by the way:
1) BEGIN -> CREATE TABLE -> TRUNCATE -> COMMIT.
With wal_level = minimal, this fails hard with "could not read block 0
blah" when trying to read the data after commit..
2) BEGIN -> CREATE -> INSERT -> TRUNCATE -> INSERT -> COMMIT, and this
one reports an empty table, without failing, but there should be tuples
from the INSERT.
3) BEGIN -> CREATE -> INSERT -> TRUNCATE -> COPY -> COMMIT, which also
reports an empty table while there should be tuples from the COPY.
4) BEGIN -> CREATE -> INSERT -> TRUNCATE -> INSERT -> COPY -> INSERT ->
COMMIT, which fails at WAL replay with a PANIC: invalid max offset
number.
5) BEGIN -> CREATE -> INSERT -> COPY -> COMMIT, which sees only the
tuple inserted, causing an incorrect number of tuples. If you reverse
the COPY and INSERT, then this is able to pass.

This stuff really generates a good number of different failures. There
have been so many people participating on this thread that discussing
more this approach would be surely a good step forward, and this
summarizes quite nicely the set of failures discussed until now here. I
would be happy to push forward with this patch to close all the holes
mentioned.
--
Michael

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-10 14:35:58
Message-ID:	df32e286-ae2b-f45a-8f2e-4fa02684300b@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for picking this up!

(I hope this gets through the email filters this time, sending a shell
script seems to be difficult. I also trimmed the CC list, if that helps.)

On 04/07/18 07:59, Michael Paquier wrote:
> Hence I propose the patch attached which disables the TRUNCATE and COPY
> optimizations for two cases, which are the ones actually causing
> problems. One solution has been presented by Simon here for COPY, which
> is to disable the optimization when there are no blocks on a relation
> with wal_level = minimal:
> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
> For back-patching, I find that really appealing.

This fails in the case that there are any WAL-logged changes to the
table while the COPY is running. That can happen at least if the table
has an INSERT trigger, that performs operations on the same table, and
the COPY fires the trigger. That scenario is covered by the little bash
script I posted earlier in this thread
(https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi).
Attached is a new version of that script, updated to make it work with v11.

> The second thing that the patch attached does is to tweak
> ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
> wal_level = minimal.

If we go down that route, let's at least keep the TRUNCATE optimization
for temporary and unlogged tables.

- Heikki

Attachment	Content-Type	Size
test-wal-minimal-2-bash-script	text/plain	2.6 KB

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-11 03:32:41
Message-ID:	20180711033241.GQ1661@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:
> Thanks for picking this up!
>
> (I hope this gets through the email filters this time, sending a shell
> script seems to be difficult. I also trimmed the CC list, if that helps.)
>
> On 04/07/18 07:59, Michael Paquier wrote:
>> Hence I propose the patch attached which disables the TRUNCATE and COPY
>> optimizations for two cases, which are the ones actually causing
>> problems. One solution has been presented by Simon here for COPY, which
>> is to disable the optimization when there are no blocks on a relation
>> with wal_level = minimal:
>> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
>> For back-patching, I find that really appealing.
>
> This fails in the case that there are any WAL-logged changes to the table
> while the COPY is running. That can happen at least if the table has an
> INSERT trigger, that performs operations on the same table, and the COPY
> fires the trigger. That scenario is covered by the little bash script I
> posted earlier in this thread
> (https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached
> is a new version of that script, updated to make it work with v11.

Thanks for the pointer. My tap test has been covering two out of the
three scenarios you have in your script. I have been able to convert
the extra as the attached, and I have added as well an extra test with
TRUNCATE triggers. So it seems to me that we want to disable the
optimization if any type of trigger are defined on the relation copied
to as it could be possible that these triggers work on the blocks copied
as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you
think?

>> The second thing that the patch attached does is to tweak
>> ExecuteTruncateGuts so as the TRUNCATE optimization never runs for
>> wal_level = minimal.
>
> If we go down that route, let's at least keep the TRUNCATE optimization for
> temporary and unlogged tables.

Yes, that sounds right. Fixed as well. I have additionally done more
work on the comments.

Thoughts?
--
Michael

Attachment	Content-Type	Size
wal-minimal-copy-truncate-v2.patch	text/x-diff	10.2 KB

From:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-12 13:51:33
Message-ID:	aac8e19b-9159-e473-77be-53f80b658190@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/10/2018 11:32 PM, Michael Paquier wrote:
> On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:
>> Thanks for picking this up!
>>
>> (I hope this gets through the email filters this time, sending a shell
>> script seems to be difficult. I also trimmed the CC list, if that helps.)
>>
>> On 04/07/18 07:59, Michael Paquier wrote:
>>> Hence I propose the patch attached which disables the TRUNCATE and COPY
>>> optimizations for two cases, which are the ones actually causing
>>> problems. One solution has been presented by Simon here for COPY, which
>>> is to disable the optimization when there are no blocks on a relation
>>> with wal_level = minimal:
>>> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
>>> For back-patching, I find that really appealing.
>> This fails in the case that there are any WAL-logged changes to the table
>> while the COPY is running. That can happen at least if the table has an
>> INSERT trigger, that performs operations on the same table, and the COPY
>> fires the trigger. That scenario is covered by the little bash script I
>> posted earlier in this thread
>> (https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached
>> is a new version of that script, updated to make it work with v11.
> Thanks for the pointer. My tap test has been covering two out of the
> three scenarios you have in your script. I have been able to convert
> the extra as the attached, and I have added as well an extra test with
> TRUNCATE triggers. So it seems to me that we want to disable the
> optimization if any type of trigger are defined on the relation copied
> to as it could be possible that these triggers work on the blocks copied
> as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you
> think?
>

Yeah, this seems like the only sane approach.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-12 14:12:21
Message-ID:	08b11907-d0b2-c396-2978-ba5aac1972df@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 12/07/18 16:51, Andrew Dunstan wrote:
>
>
> On 07/10/2018 11:32 PM, Michael Paquier wrote:
>> On Tue, Jul 10, 2018 at 05:35:58PM +0300, Heikki Linnakangas wrote:
>>> Thanks for picking this up!
>>>
>>> (I hope this gets through the email filters this time, sending a shell
>>> script seems to be difficult. I also trimmed the CC list, if that helps.)
>>>
>>> On 04/07/18 07:59, Michael Paquier wrote:
>>>> Hence I propose the patch attached which disables the TRUNCATE and COPY
>>>> optimizations for two cases, which are the ones actually causing
>>>> problems. One solution has been presented by Simon here for COPY, which
>>>> is to disable the optimization when there are no blocks on a relation
>>>> with wal_level = minimal:
>>>> https://www.postgresql.org/message-id/CANP8+jKN4V4MJEzFN_iEtdZ+1oM=YETxvmuu1YK4UMXQY2gaGw@mail.gmail.com
>>>> For back-patching, I find that really appealing.
>>> This fails in the case that there are any WAL-logged changes to the table
>>> while the COPY is running. That can happen at least if the table has an
>>> INSERT trigger, that performs operations on the same table, and the COPY
>>> fires the trigger. That scenario is covered by the little bash script I
>>> posted earlier in this thread
>>> (https://www.postgresql.org/message-id/55AFC302.1060805%40iki.fi). Attached
>>> is a new version of that script, updated to make it work with v11.
>> Thanks for the pointer. My tap test has been covering two out of the
>> three scenarios you have in your script. I have been able to convert
>> the extra as the attached, and I have added as well an extra test with
>> TRUNCATE triggers. So it seems to me that we want to disable the
>> optimization if any type of trigger are defined on the relation copied
>> to as it could be possible that these triggers work on the blocks copied
>> as well, for any BEFORE/AFTER and STATEMENT/ROW triggers. What do you
>> think?
>
> Yeah, this seems like the only sane approach.

Doesn't have to be a trigger, could be a CHECK constraint, datatype
input function, etc. Admittedly, having a datatype input function that
inserts to the table is worth a "huh?", but I'm feeling very confident
that we can catch all such cases, and some of them might even be sensible.

- Heikki

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-13 01:47:38
Message-ID:	20180713014738.GK1167@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2018 at 05:12:21PM +0300, Heikki Linnakangas wrote:
> Doesn't have to be a trigger, could be a CHECK constraint, datatype input
> function, etc. Admittedly, having a datatype input function that inserts to
> the table is worth a "huh?", but I'm feeling very confident that we can
> catch all such cases, and some of them might even be sensible.

Sure, but do we want to be that invasive? Triggers are easy enough to
block because those are available directly within cstate so you would
know if those are triggered. CHECK constraint can be also easily looked
after by looking at the Relation information, and actually as DEFAULT
values could have an expression we'd want to block them, no? The input
datatype is well, more tricky to deal with as there is no actual way to
know if the INSERT is happening within the context of a COPY and this
could be just C code. One way to tackle that would be to enforce the
optimization to not be used if a non-system data type is used when doing
COPY...

Disabling entirely the optimization for any relation which has a CHECK
constraint or DEFAULT expression basically applies to a hell lot of
them, which makes the optimization, at least it seems to me, useless
because it is never going to apply to most real-world cases.
--
Michael

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-16 18:38:39
Message-ID:	CA+TgmoZGn7MmMGRu4NkfxyXKSCzmvq1JvqsWm=hN=GJDMTfTKg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> Doesn't have to be a trigger, could be a CHECK constraint, datatype input
> function, etc. Admittedly, having a datatype input function that inserts to
> the table is worth a "huh?", but I'm feeling very confident that we can
> catch all such cases, and some of them might even be sensible.

Is this sentence missing a "not"? i.e. "I'm not feeling very confident"?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>,Michael Paquier <michael(at)paquier(dot)xyz>,"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-16 18:41:51
Message-ID:	026AD6A1-3396-47F4-B87E-77E3F139D166@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 16 July 2018 21:38:39 EEST, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>wrote:
>> Doesn't have to be a trigger, could be a CHECK constraint, datatype
>input
>> function, etc. Admittedly, having a datatype input function that
>inserts to
>> the table is worth a "huh?", but I'm feeling very confident that we
>can
>> catch all such cases, and some of them might even be sensible.
>
>Is this sentence missing a "not"? i.e. "I'm not feeling very
>confident"?

Yes, sorry.

- Heikki

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-16 20:14:09
Message-ID:	20180716201409.2qfcneo4qkdwjvpv@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2018-Jul-12, Heikki Linnakangas wrote:

> > > Thanks for the pointer. My tap test has been covering two out of
> > > the three scenarios you have in your script. I have been able to
> > > convert the extra as the attached, and I have added as well an
> > > extra test with TRUNCATE triggers. So it seems to me that we want
> > > to disable the optimization if any type of trigger are defined on
> > > the relation copied to as it could be possible that these triggers
> > > work on the blocks copied as well, for any BEFORE/AFTER and
> > > STATEMENT/ROW triggers. What do you think?
> >
> > Yeah, this seems like the only sane approach.
>
> Doesn't have to be a trigger, could be a CHECK constraint, datatype
> input function, etc. Admittedly, having a datatype input function that
> inserts to the table is worth a "huh?", but I'm feeling very confident
> that we can catch all such cases, and some of them might even be
> sensible.

A counterexample could be a a JSON compresion scheme that uses a catalog
for a dictionary of keys. Hasn't this been described already? Also not
completely out of the question for GIS data, I think (Not sure if
PostGIS does this already.)

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-17 00:01:29
Message-ID:	20180717000129.GA3388@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jul 16, 2018 at 09:41:51PM +0300, Heikki Linnakangas wrote:
> On 16 July 2018 21:38:39 EEST, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>On Thu, Jul 12, 2018 at 10:12 AM, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
>>wrote:
>>> Doesn't have to be a trigger, could be a CHECK constraint, datatype
>>input
>>> function, etc. Admittedly, having a datatype input function that
>>inserts to
>>> the table is worth a "huh?", but I'm feeling very confident that we
>>can
>>> catch all such cases, and some of them might even be sensible.
>>
>>Is this sentence missing a "not"? i.e. "I'm not feeling very
>>confident"?
>
> Yes, sorry.

This explains a lot :p

I doubt as well that we'd be able to catch all the holes as well as the
conditions where the optimization could be run safely are rather
basically impossible to catch beforehand. I'd like to vote for getting
rid of this optimization for COPY, this can hurt more than it is
helpful. Per the lack of complaints, this could happen only in HEAD?
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	alvherre(at)2ndquadrant(dot)com
Cc:	hlinnaka(at)iki(dot)fi, andrew(dot)dunstan(at)2ndquadrant(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-17 09:24:22
Message-ID:	20180717.182422.96244544.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Mon, 16 Jul 2018 16:14:09 -0400, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com> wrote in <20180716201409(dot)2qfcneo4qkdwjvpv(at)alvherre(dot)pgsql>
> On 2018-Jul-12, Heikki Linnakangas wrote:
>
> > > > Thanks for the pointer. My tap test has been covering two out of
> > > > the three scenarios you have in your script. I have been able to
> > > > convert the extra as the attached, and I have added as well an
> > > > extra test with TRUNCATE triggers. So it seems to me that we want
> > > > to disable the optimization if any type of trigger are defined on
> > > > the relation copied to as it could be possible that these triggers
> > > > work on the blocks copied as well, for any BEFORE/AFTER and
> > > > STATEMENT/ROW triggers. What do you think?
> > >
> > > Yeah, this seems like the only sane approach.
> >
> > Doesn't have to be a trigger, could be a CHECK constraint, datatype
> > input function, etc. Admittedly, having a datatype input function that
> > inserts to the table is worth a "huh?", but I'm feeling very confident
> > that we can catch all such cases, and some of them might even be
> > sensible.
>
> A counterexample could be a a JSON compresion scheme that uses a catalog
> for a dictionary of keys. Hasn't this been described already? Also not
> completely out of the question for GIS data, I think (Not sure if
> PostGIS does this already.)

In the third case, IIUC, disabling bulk-insertion after any
WAL-logging insertion happend seems to work. The attached diff to
v2 patch makes the three TAP tests pass. It uses relcache to
store the last insertion XID but it will not be invalidated
during a COPY operation.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
wal-minimal-copy-truncate-v2-v3.diff	text/x-patch	3.2 KB

From:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-17 12:28:47
Message-ID:	3c1cf991-8846-9a6f-c669-e21f8ea9a6a4@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/16/2018 08:01 PM, Michael Paquier wrote:
>
>
> I doubt as well that we'd be able to catch all the holes as well as the
> conditions where the optimization could be run safely are rather
> basically impossible to catch beforehand. I'd like to vote for getting
> rid of this optimization for COPY, this can hurt more than it is
> helpful. Per the lack of complaints, this could happen only in HEAD?

Well, we'd be getting rid of it because of a danger of data loss which
we can't otherwise mitigate. Maybe it does need to be backpatched, even
if we haven't had complaints.

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-18 10:42:10
Message-ID:	CA+TgmoZU9nBd6m3NQohzjpdvBNtw+6UGoz+uqoaawkXy76gYSg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jul 17, 2018 at 8:28 AM, Andrew Dunstan
<andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
> Well, we'd be getting rid of it because of a danger of data loss which we
> can't otherwise mitigate. Maybe it does need to be backpatched, even if we
> haven't had complaints.

What's wrong with the approach proposed in
http://postgr.es/m/55AFC302.1060805@iki.fi ?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-18 13:06:22
Message-ID:	20180718130622.GI8565@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 18, 2018 at 06:42:10AM -0400, Robert Haas wrote:
> On Tue, Jul 17, 2018 at 8:28 AM, Andrew Dunstan
> <andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
>> Well, we'd be getting rid of it because of a danger of data loss which we
>> can't otherwise mitigate. Maybe it does need to be backpatched, even if we
>> haven't had complaints.
>
> What's wrong with the approach proposed in
> http://postgr.es/m/55AFC302.1060805@iki.fi ?

For back-branches that's very invasive so that seems risky to me
particularly seeing the low number of complaints on the matter.
--
Michael

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-18 13:29:01
Message-ID:	CA+Tgmob_aF_rQmczNDUXZz2z+MksFmCyP7-3KvuauuWoS9400g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>> What's wrong with the approach proposed in
>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
>
> For back-branches that's very invasive so that seems risky to me
> particularly seeing the low number of complaints on the matter.

Hmm. I think that if you disable the optimization, you're betting that
people won't mind losing performance in this case in a maintenance
release. If you back-patch Heikki's approach, you're betting that the
committed version doesn't have any bugs that are worse than the status
quo. Personally, I'd rather take the latter bet. Maybe the patch
isn't all there yet, but that seems like something we can work
towards. If we just give up and disable the optimization, we won't
know how many people we ticked off or how badly until after we've done
it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-18 14:58:16
Message-ID:	c7d2ca8a-d376-f19b-e95e-b879efc3b860@iki.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 18/07/18 16:29, Robert Haas wrote:
> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>>> What's wrong with the approach proposed in
>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
>>
>> For back-branches that's very invasive so that seems risky to me
>> particularly seeing the low number of complaints on the matter.
>
> Hmm. I think that if you disable the optimization, you're betting that
> people won't mind losing performance in this case in a maintenance
> release. If you back-patch Heikki's approach, you're betting that the
> committed version doesn't have any bugs that are worse than the status
> quo. Personally, I'd rather take the latter bet. Maybe the patch
> isn't all there yet, but that seems like something we can work
> towards. If we just give up and disable the optimization, we won't
> know how many people we ticked off or how badly until after we've done
> it.

Yeah. I'm not happy about backpatching a big patch like what I proposed,
and Kyotaro developed further. But I think it's the least bad option we
have, the other options discussed seem even worse.

One way to review the patch is to look at what it changes, when
wal_level is *not* set to minimal, i.e. what risk or overhead does it
pose to users who are not affected by this bug? It seems pretty safe to me.

The other aspect is, how confident are we that this actually fixes the
bug, with least impact to users using wal_level='minimal'? I think it's
the best shot we have so far. All the other proposals either don't fully
fix the bug, or hurt performance in some legit cases.

I'd suggest that we continue based on the patch that Kyotaro posted at
https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.

- Heikki

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-25 14:08:33
Message-ID:	20180725140833.GC6660@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 18, 2018 at 05:58:16PM +0300, Heikki Linnakangas wrote:
> I'd suggest that we continue based on the patch that Kyotaro posted at
> https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.

Whatever happens here, perhaps one way to move on would be to commit
first the TAP test that I proposed upthread. That would not work for
wal_level=minimal so this part should be commented out, but that's
easier this way to test basically all the cases we talked about with any
approach taken.
--
Michael

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	michael(at)paquier(dot)xyz
Cc:	hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-26 05:50:11
Message-ID:	20180726.145011.120625137.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Wed, 25 Jul 2018 23:08:33 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20180725140833(dot)GC6660(at)paquier(dot)xyz>
> On Wed, Jul 18, 2018 at 05:58:16PM +0300, Heikki Linnakangas wrote:
> > I'd suggest that we continue based on the patch that Kyotaro posted at
> > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
>
> Whatever happens here, perhaps one way to move on would be to commit
> first the TAP test that I proposed upthread. That would not work for
> wal_level=minimal so this part should be commented out, but that's
> easier this way to test basically all the cases we talked about with any
> approach taken.

https://www.postgresql.org/message-id/20180704045912.GG1672@paquier.xyz

However I'm not sure the policy (if any) allows us to add a test
that should success, I'm not opposed to do that. But even if we
did that, it won't be visible to other than us in this thread. It
seems to me more or less similar to pasting a boilerplate that
points to the above message in this thread, or just writing "this
patch passes "the" test.".

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-07-27 19:26:24
Message-ID:	d0c9e197-5219-c094-418a-e5a6fbd8cdda@2ndQuadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> On 18/07/18 16:29, Robert Haas wrote:
>> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
>> <michael(at)paquier(dot)xyz> wrote:
>>>> What's wrong with the approach proposed in
>>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
>>>
>>> For back-branches that's very invasive so that seems risky to me
>>> particularly seeing the low number of complaints on the matter.
>>
>> Hmm. I think that if you disable the optimization, you're betting that
>> people won't mind losing performance in this case in a maintenance
>> release. If you back-patch Heikki's approach, you're betting that the
>> committed version doesn't have any bugs that are worse than the status
>> quo. Personally, I'd rather take the latter bet. Maybe the patch
>> isn't all there yet, but that seems like something we can work
>> towards. If we just give up and disable the optimization, we won't
>> know how many people we ticked off or how badly until after we've done
>> it.
>
> Yeah. I'm not happy about backpatching a big patch like what I
> proposed, and Kyotaro developed further. But I think it's the least
> bad option we have, the other options discussed seem even worse.
>
> One way to review the patch is to look at what it changes, when
> wal_level is *not* set to minimal, i.e. what risk or overhead does it
> pose to users who are not affected by this bug? It seems pretty safe
> to me.
>
> The other aspect is, how confident are we that this actually fixes the
> bug, with least impact to users using wal_level='minimal'? I think
> it's the best shot we have so far. All the other proposals either
> don't fully fix the bug, or hurt performance in some legit cases.
>
> I'd suggest that we continue based on the patch that Kyotaro posted at
> https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
>

I have just spent some time reviewing Kyatoro's patch. I'm a bit
nervous, too, given the size. But I'm also nervous about leaving things
as they are. I suspect the reason we haven't heard more about this is
that these days use of "wal_level = minimal" is relatively rare.

I like the fact that this is closer to being a real fix rather than just
throwing out the optimization. Like Heikki I've come round to the view
that something like this is the least bad option.

The code looks good to me - some comments might be helpful in
heap_xlog_update()

Do we want to try this on HEAD and then backpatch it? Do we want to add
some testing along the lines Michael suggested?

cheers

andrew

--
Andrew Dunstan https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	andrew(dot)dunstan(at)2ndquadrant(dot)com
Cc:	hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-10-11 04:42:35
Message-ID:	20181011.134235.218062184.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Fri, 27 Jul 2018 15:26:24 -0400, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote in <d0c9e197-5219-c094-418a-e5a6fbd8cdda(at)2ndQuadrant(dot)com>
>
>
> On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > On 18/07/18 16:29, Robert Haas wrote:
> >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael(at)paquier(dot)xyz>
> >> wrote:
> >>>> What's wrong with the approach proposed in
> >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> >>>
> >>> For back-branches that's very invasive so that seems risky to me
> >>> particularly seeing the low number of complaints on the matter.
> >>
> >> Hmm. I think that if you disable the optimization, you're betting that
> >> people won't mind losing performance in this case in a maintenance
> >> release. If you back-patch Heikki's approach, you're betting that the
> >> committed version doesn't have any bugs that are worse than the status
> >> quo. Personally, I'd rather take the latter bet. Maybe the patch
> >> isn't all there yet, but that seems like something we can work
> >> towards. If we just give up and disable the optimization, we won't
> >> know how many people we ticked off or how badly until after we've done
> >> it.
> >
> > Yeah. I'm not happy about backpatching a big patch like what I
> > proposed, and Kyotaro developed further. But I think it's the least
> > bad option we have, the other options discussed seem even worse.
> >
> > One way to review the patch is to look at what it changes, when
> > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > pose to users who are not affected by this bug? It seems pretty safe
> > to me.
> >
> > The other aspect is, how confident are we that this actually fixes the
> > bug, with least impact to users using wal_level='minimal'? I think
> > it's the best shot we have so far. All the other proposals either
> > don't fully fix the bug, or hurt performance in some legit cases.
> >
> > I'd suggest that we continue based on the patch that Kyotaro posted at
> > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> >
>
>
>
> I have just spent some time reviewing Kyatoro's patch. I'm a bit
> nervous, too, given the size. But I'm also nervous about leaving
> things as they are. I suspect the reason we haven't heard more about
> this is that these days use of "wal_level = minimal" is relatively
> rare.

Thank you for lokking this (and sorry for the late response).

> I like the fact that this is closer to being a real fix rather than
> just throwing out the optimization. Like Heikki I've come round to the
> view that something like this is the least bad option.
>
> The code looks good to me - some comments might be helpful in
> heap_xlog_update()

Thanks. It is intending to avoid PANIC for a broken record. I
reverted the part since PANIC would be preferable in the case.

> Do we want to try this on HEAD and then backpatch it? Do we want to
> add some testing along the lines Michael suggested?

44cac93464 hit this, rebased. And added Michael's TAP test
contained in [1] as patch 0001.

I regard [2] as an orthogonal issue.

The previous patch didn't care of the case of
BEGIN;CREATE;TRUNCATE;COMMIT case. This version contains a "fix"
of nbtree (patch 0003) so that FPI of the metapage is always
emitted when building an empty index. On the other hand this
emits useless one or two FPIs (136 bytes each) on TRUNCATE in a
separate transaction, but it won't matter so much.. Other index
methods don't have this problem. Some other AMs emits initialize
WALs even in minimal mode.

This still has a bit too many elog(DEBUG2)s to see how it is
working. I'm going to remove most of them in the final version.

I started to prefix the file names with version 2.

regards.

[1] https://www.postgresql.org/message-id/20180711033241.GQ1661@paquier.xyz

[2] https://www.postgresql.org/message-id/CAKJS1f9iF55cwx-LUOreRokyi9UZESXOLHuFDkt0wksZN+KqWw@mail.gmail.com

https://commitfest.postgresql.org/20/1811/

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v2-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	7.2 KB
v2-0002-Fix-WAL-logging-problem.patch	text/x-patch	38.2 KB
v2-0003-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	2.1 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	andrew(dot)dunstan(at)2ndquadrant(dot)com
Cc:	hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-10-11 08:04:53
Message-ID:	20181011.170453.123148806.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Thu, 11 Oct 2018 13:42:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20181011(dot)134235(dot)218062184(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> Hello.
>
> At Fri, 27 Jul 2018 15:26:24 -0400, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote in <d0c9e197-5219-c094-418a-e5a6fbd8cdda(at)2ndQuadrant(dot)com>
> >
> >
> > On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > > On 18/07/18 16:29, Robert Haas wrote:
> > >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier <michael(at)paquier(dot)xyz>
> > >> wrote:
> > >>>> What's wrong with the approach proposed in
> > >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> > >>>
> > >>> For back-branches that's very invasive so that seems risky to me
> > >>> particularly seeing the low number of complaints on the matter.
> > >>
> > >> Hmm. I think that if you disable the optimization, you're betting that
> > >> people won't mind losing performance in this case in a maintenance
> > >> release. If you back-patch Heikki's approach, you're betting that the
> > >> committed version doesn't have any bugs that are worse than the status
> > >> quo. Personally, I'd rather take the latter bet. Maybe the patch
> > >> isn't all there yet, but that seems like something we can work
> > >> towards. If we just give up and disable the optimization, we won't
> > >> know how many people we ticked off or how badly until after we've done
> > >> it.
> > >
> > > Yeah. I'm not happy about backpatching a big patch like what I
> > > proposed, and Kyotaro developed further. But I think it's the least
> > > bad option we have, the other options discussed seem even worse.
> > >
> > > One way to review the patch is to look at what it changes, when
> > > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > > pose to users who are not affected by this bug? It seems pretty safe
> > > to me.
> > >
> > > The other aspect is, how confident are we that this actually fixes the
> > > bug, with least impact to users using wal_level='minimal'? I think
> > > it's the best shot we have so far. All the other proposals either
> > > don't fully fix the bug, or hurt performance in some legit cases.
> > >
> > > I'd suggest that we continue based on the patch that Kyotaro posted at
> > > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> > >
> >
> >
> >
> > I have just spent some time reviewing Kyatoro's patch. I'm a bit
> > nervous, too, given the size. But I'm also nervous about leaving
> > things as they are. I suspect the reason we haven't heard more about
> > this is that these days use of "wal_level = minimal" is relatively
> > rare.
>
> Thank you for lokking this (and sorry for the late response).
>
> > I like the fact that this is closer to being a real fix rather than
> > just throwing out the optimization. Like Heikki I've come round to the
> > view that something like this is the least bad option.
> >
> > The code looks good to me - some comments might be helpful in
> > heap_xlog_update()
>
> Thanks. It is intending to avoid PANIC for a broken record. I
> reverted the part since PANIC would be preferable in the case.
>
> > Do we want to try this on HEAD and then backpatch it? Do we want to
> > add some testing along the lines Michael suggested?
>
> 44cac93464 hit this, rebased. And added Michael's TAP test
> contained in [1] as patch 0001.
>
> I regard [2] as an orthogonal issue.
>
> The previous patch didn't care of the case of
> BEGIN;CREATE;TRUNCATE;COMMIT case. This version contains a "fix"
> of nbtree (patch 0003) so that FPI of the metapage is always
> emitted when building an empty index. On the other hand this
> emits useless one or two FPIs (136 bytes each) on TRUNCATE in a
> separate transaction, but it won't matter so much.. Other index
> methods don't have this problem. Some other AMs emits initialize
> WALs even in minimal mode.
>
> This still has a bit too many elog(DEBUG2)s to see how it is
> working. I'm going to remove most of them in the final version.
>
> I started to prefix the file names with version 2.
>
> regards.
>
> [1] https://www.postgresql.org/message-id/20180711033241.GQ1661@paquier.xyz
>
> [2] https://www.postgresql.org/message-id/CAKJS1f9iF55cwx-LUOreRokyi9UZESXOLHuFDkt0wksZN+KqWw@mail.gmail.com
>
> or
>
> https://commitfest.postgresql.org/20/1811/

I refactored getPendingSyncEntry out of RecordPendingSync,
BufferNeedsWAL and RelationTruncate. And split the second patch
into infrastracture-side and user-side ones. I expect it makes
reviewing far easier.

I reaplce RelationNeedsWAL in a part of code added in
heap_update() by bfa2ab56bb.

- v3-0001-TAP-test-for-copy-truncation-optimization.patch

TAP test

-v3-0002-Write-WAL-for-empty-nbtree-index-build.patch

nbtree "fix"

- v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch

Pending-sync infrastructure.

- v3-0004-Fix-WAL-skipping-feature.patch

Actual fix of WAL skipping feature.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v3-0004-Fix-WAL-skipping-feature.patch	text/x-patch	16.5 KB
v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	22.6 KB
v3-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	2.1 KB
v3-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	7.2 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	andrew(dot)dunstan(at)2ndquadrant(dot)com
Cc:	hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-11-14 03:47:36
Message-ID:	20181114.124736.206988673.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Thu, 11 Oct 2018 17:04:53 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20181011(dot)170453(dot)123148806(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> At Thu, 11 Oct 2018 13:42:35 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20181011(dot)134235(dot)218062184(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> I refactored getPendingSyncEntry out of RecordPendingSync,
> BufferNeedsWAL and RelationTruncate. And split the second patch
> into infrastracture-side and user-side ones. I expect it makes
> reviewing far easier.
>
> I reaplce RelationNeedsWAL in a part of code added in
> heap_update() by bfa2ab56bb.
>
> - v3-0001-TAP-test-for-copy-truncation-optimization.patch
>
> TAP test
>
> -v3-0002-Write-WAL-for-empty-nbtree-index-build.patch
>
> nbtree "fix"
>
> - v3-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch
>
> Pending-sync infrastructure.
>
> - v3-0004-Fix-WAL-skipping-feature.patch
>
> Actual fix of WAL skipping feature.

0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
Successfully built and passeed all regression/recovery tests
including additional recovery/t/016_wal_optimize.pl.

reagrds.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v4-0004-Fix-WAL-skipping-feature.patch	text/x-patch	16.7 KB
v4-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	22.6 KB
v4-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	2.1 KB
v4-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	7.2 KB

From:	Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To:	horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp
Cc:	Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-11-30 17:27:05
Message-ID:	CA+q6zcV6MUg1BEoQUywX917Oiz6JoMdoZ1Vu3RT5GgBb-yPszg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>
> 0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
> Successfully built and passeed all regression/recovery tests
> including additional recovery/t/016_wal_optimize.pl.

Thank you for working on this patch. Unfortunately, cfbot complains that
v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts.
Could you please post a rebased version one more time?

> On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
>
> On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > On 18/07/18 16:29, Robert Haas wrote:
> >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
> >> <michael(at)paquier(dot)xyz> wrote:
> >>>> What's wrong with the approach proposed in
> >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> >>>
> >>> For back-branches that's very invasive so that seems risky to me
> >>> particularly seeing the low number of complaints on the matter.
> >>
> >> Hmm. I think that if you disable the optimization, you're betting that
> >> people won't mind losing performance in this case in a maintenance
> >> release. If you back-patch Heikki's approach, you're betting that the
> >> committed version doesn't have any bugs that are worse than the status
> >> quo. Personally, I'd rather take the latter bet. Maybe the patch
> >> isn't all there yet, but that seems like something we can work
> >> towards. If we just give up and disable the optimization, we won't
> >> know how many people we ticked off or how badly until after we've done
> >> it.
> >
> > Yeah. I'm not happy about backpatching a big patch like what I
> > proposed, and Kyotaro developed further. But I think it's the least
> > bad option we have, the other options discussed seem even worse.
> >
> > One way to review the patch is to look at what it changes, when
> > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > pose to users who are not affected by this bug? It seems pretty safe
> > to me.
> >
> > The other aspect is, how confident are we that this actually fixes the
> > bug, with least impact to users using wal_level='minimal'? I think
> > it's the best shot we have so far. All the other proposals either
> > don't fully fix the bug, or hurt performance in some legit cases.
> >
> > I'd suggest that we continue based on the patch that Kyotaro posted at
> > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> >
> I have just spent some time reviewing Kyatoro's patch. I'm a bit
> nervous, too, given the size. But I'm also nervous about leaving things
> as they are. I suspect the reason we haven't heard more about this is
> that these days use of "wal_level = minimal" is relatively rare.

I'm totally out of context of this patch, but reading this makes me nervous
too. Taking into account that the problem now is lack of review, do you have
plans to spend more time reviewing this patch?

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	9erthalion6(at)gmail(dot)com
Cc:	andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2018-12-20 08:32:25
Message-ID:	20181220.173225.94657882.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Fri, 30 Nov 2018 18:27:05 +0100, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote in <CA+q6zcV6MUg1BEoQUywX917Oiz6JoMdoZ1Vu3RT5GgBb-yPszg(at)mail(dot)gmail(dot)com>
> > On Wed, Nov 14, 2018 at 4:48 AM Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> >
> > 0004 was shot by e9edc1ba0b. Rebased to the current HEAD.
> > Successfully built and passeed all regression/recovery tests
> > including additional recovery/t/016_wal_optimize.pl.
>
> Thank you for working on this patch. Unfortunately, cfbot complains that
> v4-0004-Fix-WAL-skipping-feature.patch could not be applied without conflicts.
> Could you please post a rebased version one more time?

Thanks. Here's the rebased version. I found no other amendment
required other than the apparent conflict.

> > On Fri, Jul 27, 2018 at 9:26 PM Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com> wrote:
> >
> > On 07/18/2018 10:58 AM, Heikki Linnakangas wrote:
> > > On 18/07/18 16:29, Robert Haas wrote:
> > >> On Wed, Jul 18, 2018 at 9:06 AM, Michael Paquier
> > >> <michael(at)paquier(dot)xyz> wrote:
> > >>>> What's wrong with the approach proposed in
> > >>>> http://postgr.es/m/55AFC302.1060805@iki.fi ?
> > >>>
> > >>> For back-branches that's very invasive so that seems risky to me
> > >>> particularly seeing the low number of complaints on the matter.
> > >>
> > >> Hmm. I think that if you disable the optimization, you're betting that
> > >> people won't mind losing performance in this case in a maintenance
> > >> release. If you back-patch Heikki's approach, you're betting that the
> > >> committed version doesn't have any bugs that are worse than the status
> > >> quo. Personally, I'd rather take the latter bet. Maybe the patch
> > >> isn't all there yet, but that seems like something we can work
> > >> towards. If we just give up and disable the optimization, we won't
> > >> know how many people we ticked off or how badly until after we've done
> > >> it.
> > >
> > > Yeah. I'm not happy about backpatching a big patch like what I
> > > proposed, and Kyotaro developed further. But I think it's the least
> > > bad option we have, the other options discussed seem even worse.
> > >
> > > One way to review the patch is to look at what it changes, when
> > > wal_level is *not* set to minimal, i.e. what risk or overhead does it
> > > pose to users who are not affected by this bug? It seems pretty safe
> > > to me.
> > >
> > > The other aspect is, how confident are we that this actually fixes the
> > > bug, with least impact to users using wal_level='minimal'? I think
> > > it's the best shot we have so far. All the other proposals either
> > > don't fully fix the bug, or hurt performance in some legit cases.
> > >
> > > I'd suggest that we continue based on the patch that Kyotaro posted at
> > > https://www.postgresql.org/message-id/20180330.100646.86008470.horiguchi.kyotaro%40lab.ntt.co.jp.
> > >
> > I have just spent some time reviewing Kyatoro's patch. I'm a bit
> > nervous, too, given the size. But I'm also nervous about leaving things
> > as they are. I suspect the reason we haven't heard more about this is
> > that these days use of "wal_level = minimal" is relatively rare.
>
> I'm totally out of context of this patch, but reading this makes me nervous
> too. Taking into account that the problem now is lack of review, do you have
> plans to spend more time reviewing this patch?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v5-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	7.2 KB
v5-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	2.1 KB
v5-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	22.6 KB
v5-0004-Fix-WAL-skipping-feature.patch	text/x-patch	16.4 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	9erthalion6(at)gmail(dot)com
Cc:	andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-01-30 01:26:34
Message-ID:	20190130.102634.246232856.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v6-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	7.2 KB
v6-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	2.1 KB
v6-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	22.6 KB
v6-0004-Fix-WAL-skipping-feature.patch	text/x-patch	16.4 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	9erthalion6(at)gmail(dot)com
Cc:	andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-03-04 03:24:48
Message-ID:	20190304.122448.177167234.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Rebased.

No commit hit this but I fixed one space error.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v7-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	7.2 KB
v7-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	2.1 KB
v7-0003-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	22.6 KB
v7-0004-Fix-WAL-skipping-feature.patch	text/x-patch	16.4 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-03-11 02:27:08
Message-ID:	20190311022708.GA2189728@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

This has been waiting for a review since October, so I reviewed it. The code
comment at PendingRelSync summarizes the design well, and I like that design.
I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
last paragraph, and I suspect it would have been no harder to back-patch. I
wonder if it would have been simpler and better, but I'm not asking anyone to
investigate that. Let's keep pursuing your current design.

This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
to COMMIT. Users setting a timeout on COMMIT may need to adjust, and
log_min_duration_statement analysis will reflect the change. I feel that's
fine. (There already exist ways for COMMIT to be slow.)

On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> --- a/src/backend/access/nbtree/nbtsort.c
> +++ b/src/backend/access/nbtree/nbtsort.c
> @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
> RelationOpenSmgr(wstate->index);
>
> - /* XLOG stuff */
> - if (wstate->btws_use_wal)
> + /* XLOG stuff
> + *
> + * Even if minimal mode, WAL is required here if truncation happened after
> + * being created in the same transaction. It is not needed otherwise but
> + * we don't bother identifying the case precisely.
> + */
> + if (wstate->btws_use_wal ||
> + (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))

We initialized "btws_use_wal" like this:

#define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
#define RelationNeedsWAL(relation) \
((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);

Hence, this change causes us to emit WAL for the metapage of a
RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
relfilenode. I've attached a test case for this; it is a patch that applies
on top of your v7 patches. The test checks for orphaned files after redo.

> + * If no tuple was inserted, it's possible that we are truncating a
> + * relation. We need to emit WAL for the metapage in the case. However it
> + * is not required elsewise,

Did you mean to write more words after that comma?

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c

> + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> + * any subsequent actions on the same block either. Replaying the WAL record
> + * of the subsequent action might fail otherwise, as the "before" state of
> + * the block might not match, as the earlier actions were not WAL-logged.

Good point. To participate in WAL redo properly, each "before" state must
have a distinct pd_lsn. In CREATE INDEX USING btree, the initial index build
skips WAL, but an INSERT later in the same transaction writes WAL. There,
however, each "before" state does have a distinct pd_lsn; the initial build
has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
doesn't conform to this code comment.

I think this restriction applies only to full_page_writes=off. Otherwise, the
first WAL-logged change will find pd_lsn==0 and emit a full-page image. With
a full-page image in the record, the block's "before" state doesn't matter.
Also, one could make it safe to write WAL for a particular block by issuing
heap_sync() for the block's relation.

> +/*
> + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> + */
> +void
> +RelationRemovePendingSync(Relation rel)

What is the coding rule for deciding when to call this? Currently, only
ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
much like ALTER TABLE SET TABLESPACE behaves.

> +{
> + bool found;
> +
> + rel->pending_sync = NULL;
> + rel->no_pending_sync = true;
> + if (pendingSyncs)
> + {
> + elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> + hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
> + }
> +}

We'd need a mechanism to un-remove the sync at subtransaction abort. My
attachment includes a test case demonstrating the consequences of that defect.
Please look for other areas that need to know about subtransactions; patch v7
had no code pertaining to subtransactions.

> + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",

As you mention upthread, you have many debugging elog()s. These are too
detailed to include in every binary, but I do want them in the code. See
CACHE_elog() for a good example of achieving that.

> +/*
> + * Sync to disk any relations that we skipped WAL-logging for earlier.
> + */
> +void
> +smgrDoPendingSyncs(bool isCommit)
> +{
> + if (!pendingSyncs)
> + return;
> +
> + if (isCommit)
> + {
> + HASH_SEQ_STATUS status;
> + PendingRelSync *pending;
> +
> + hash_seq_init(&status, pendingSyncs);
> +
> + while ((pending = hash_seq_search(&status)) != NULL)
> + {
> + if (pending->sync_above != InvalidBlockNumber)

I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
InvalidBlockNumber" are not sync requests at all. Those just record the fact
of a RelationTruncate() happening. If you can think of a way to improve that,
please do so. If not, it's okay.

> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c

> @@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
> /* which we mark as a reference-counted tupdesc */
> relation->rd_att->tdrefcount = 1;
>
> + /* We don't know if pending sync for this relation exists so far */
> + relation->pending_sync = NULL;
> + relation->no_pending_sync = false;

RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
prefix to these fields.

This is a nonstandard place to clear fields. Clear them in
load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
will then rely on palloc0() for implicit initialization.)

> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c

> @@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> MarkBufferDirty(buffer);
>
> /* XLOG stuff */
> - if (RelationNeedsWAL(relation))
> + if (BufferNeedsWAL(relation, buffer) ||
> + BufferNeedsWAL(relation, newbuf))

This is fine if both buffers need WAL or neither buffer needs WAL. It is not
fine when one buffer needs WAL and the other buffer does not. My attachment
includes a test case. Of the bugs I'm reporting, this one seems most
difficult to solve well.

> @@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
> * heap_sync - sync a heap, for use when no WAL has been written
> *
> * This forces the heap contents (including TOAST heap if any) down to disk.
> - * If we skipped using WAL, and WAL is otherwise needed, we must force the
> - * relation down to disk before it's safe to commit the transaction. This
> - * requires writing out any dirty buffers and then doing a forced fsync.
> + * If we did any changes to the heap bypassing the buffer manager, we must
> + * force the relation down to disk before it's safe to commit the
> + * transaction, because the direct modifications will not be flushed by
> + * the next checkpoint.
> + *
> + * We used to also use this after batch operations like COPY and CLUSTER,
> + * if we skipped using WAL and WAL is otherwise needed, but there were
> + * corner-cases involving other WAL-logged operations to the same
> + * relation, where that was not enough. heap_register_sync() should be
> + * used for that purpose instead.

We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
TABLESPACE, CLUSTER). It would make the system simpler to understand if we
eliminated the old way. If that creates more problems than it solves, please
at least write down a coding rule to explain why certain commands shouldn't
use the old way.

Thanks,
nm

Attachment	Content-Type	Size
wal-optimize-noah-tests-v1.patch	text/plain	3.2 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-03-20 08:17:54
Message-ID:	20190320.171754.171896368.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for reviewing!

At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190311022708(dot)GA2189728(at)rfd(dot)leadboat(dot)com>
> This has been waiting for a review since October, so I reviewed it. The code
> comment at PendingRelSync summarizes the design well, and I like that design.

It is Michael's work.

> I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> last paragraph, and I suspect it would have been no harder to back-patch. I
> wonder if it would have been simpler and better, but I'm not asking anyone to
> investigate that. Let's keep pursuing your current design.

I must admit that this is complex..

> This moves a shared_buffers scan and smgrimmedsync() from commands like COPY
> to COMMIT. Users setting a timeout on COMMIT may need to adjust, and
> log_min_duration_statement analysis will reflect the change. I feel that's
> fine. (There already exist ways for COMMIT to be slow.)
>
> On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > --- a/src/backend/access/nbtree/nbtsort.c
> > +++ b/src/backend/access/nbtree/nbtsort.c
> > @@ -611,8 +611,14 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
> > /* Ensure rd_smgr is open (could have been closed by relcache flush!) */
> > RelationOpenSmgr(wstate->index);
> >
> > - /* XLOG stuff */
> > - if (wstate->btws_use_wal)
> > + /* XLOG stuff
> > + *
> > + * Even if minimal mode, WAL is required here if truncation happened after
> > + * being created in the same transaction. It is not needed otherwise but
> > + * we don't bother identifying the case precisely.
> > + */
> > + if (wstate->btws_use_wal ||
> > + (blkno == BTREE_METAPAGE && BTPageGetMeta(page)->btm_root == 0))
>
> We initialized "btws_use_wal" like this:
>
> #define XLogIsNeeded() (wal_level >= WAL_LEVEL_REPLICA)
> #define RelationNeedsWAL(relation) \
> ((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
> wstate.btws_use_wal = XLogIsNeeded() && RelationNeedsWAL(wstate.index);
>
> Hence, this change causes us to emit WAL for the metapage of a
> RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
> that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
> relfilenode. I've attached a test case for this; it is a patch that applies
> on top of your v7 patches. The test checks for orphaned files after redo.

Oops! Added RelationNeedsWAL(index) there. (Attched 1st patch on
top of this patchset)

> > + * If no tuple was inserted, it's possible that we are truncating a
> > + * relation. We need to emit WAL for the metapage in the case. However it
> > + * is not required elsewise,
>
> Did you mean to write more words after that comma?

Sorry, it is just a garbage. Required work is done in
_bt_blwritepage.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
>
> > + * NB: after WAL-logging has been skipped for a block, we must not WAL-log
> > + * any subsequent actions on the same block either. Replaying the WAL record
> > + * of the subsequent action might fail otherwise, as the "before" state of
> > + * the block might not match, as the earlier actions were not WAL-logged.
>
> Good point. To participate in WAL redo properly, each "before" state must
> have a distinct pd_lsn. In CREATE INDEX USING btree, the initial index build
> skips WAL, but an INSERT later in the same transaction writes WAL. There,
> however, each "before" state does have a distinct pd_lsn; the initial build
> has pd_lsn==0, and each subsequent state has a pd_lsn driven by WAL position.
> Hence, I think the CREATE INDEX USING btree behavior is fine, even though it
> doesn't conform to this code comment.

(The NB is Michael's work.)
Yes. Btree works differently from heap. Thak you for confirmation.

> I think this restriction applies only to full_page_writes=off. Otherwise, the
> first WAL-logged change will find pd_lsn==0 and emit a full-page image. With
> a full-page image in the record, the block's "before" state doesn't matter.
> Also, one could make it safe to write WAL for a particular block by issuing
> heap_sync() for the block's relation.

Umm.. Once truncate happens, WAL is emitted for all pages. If we
decide to skip WALs on copy or similar bulk operations, WALs are
not emitted at all, including XLOG_HEAP_INIT_PAGE. So that
doesn't happen. The unlogged data is synced at commit time.

> > +/*
> > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > + */
> > +void
> > +RelationRemovePendingSync(Relation rel)
>
> What is the coding rule for deciding when to call this? Currently, only
> ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
> much like ALTER TABLE SET TABLESPACE behaves.
> > +{
> > + bool found;
> > +
> > + rel->pending_sync = NULL;
> > + rel->no_pending_sync = true;
> > + if (pendingSyncs)
> > + {
> > + elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
> > + hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
> > + }
> > +}
>
> We'd need a mechanism to un-remove the sync at subtransaction abort. My
> attachment includes a test case demonstrating the consequences of that defect.
> Please look for other areas that need to know about subtransactions; patch v7
> had no code pertaining to subtransactions.

Agreed It forgets about subtransaction rollbacks. I'll make
RelationRemovePendingSync just mark as "removed" and make
ROLLBACK TO and RELEASE process the flag make it work. (Attached
2nd patch on top of thie patchset)

>
> > + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
>
> As you mention upthread, you have many debugging elog()s. These are too
> detailed to include in every binary, but I do want them in the code. See
> CACHE_elog() for a good example of achieving that.

Agreed will do. They were need to check the behavior precisely
but usually not needed.

> > +/*
> > + * Sync to disk any relations that we skipped WAL-logging for earlier.
> > + */
> > +void
> > +smgrDoPendingSyncs(bool isCommit)
> > +{
> > + if (!pendingSyncs)
> > + return;
> > +
> > + if (isCommit)
> > + {
> > + HASH_SEQ_STATUS status;
> > + PendingRelSync *pending;
> > +
> > + hash_seq_init(&status, pendingSyncs);
> > +
> > + while ((pending = hash_seq_search(&status)) != NULL)
> > + {
> > + if (pending->sync_above != InvalidBlockNumber)
>
> I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> InvalidBlockNumber" are not sync requests at all. Those just record the fact
> of a RelationTruncate() happening. If you can think of a way to improve that,
> please do so. If not, it's okay.

After a truncation, required WAL records are emitted for the
truncated pages, so no need to sync. Does this make sense for
you? (Maybe commit is needed there)

> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
>
> > @@ -412,6 +413,10 @@ AllocateRelationDesc(Form_pg_class relp)
> > /* which we mark as a reference-counted tupdesc */
> > relation->rd_att->tdrefcount = 1;
> >
> > + /* We don't know if pending sync for this relation exists so far */
> > + relation->pending_sync = NULL;
> > + relation->no_pending_sync = false;
>
> RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
> prefix to these fields.
> This is a nonstandard place to clear fields. Clear them in
> load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
> will then rely on palloc0() for implicit initialization.)

Agreed, will do in the next version.

> > --- a/src/backend/access/heap/heapam.c
> > +++ b/src/backend/access/heap/heapam.c
>
> > @@ -3991,7 +4007,8 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
> > MarkBufferDirty(buffer);
> >
> > /* XLOG stuff */
> > - if (RelationNeedsWAL(relation))
> > + if (BufferNeedsWAL(relation, buffer) ||
> > + BufferNeedsWAL(relation, newbuf))
>
> This is fine if both buffers need WAL or neither buffer needs WAL. It is not
> fine when one buffer needs WAL and the other buffer does not. My attachment
> includes a test case. Of the bugs I'm reporting, this one seems most
> difficult to solve well.

Yeah, it is right (and it's rather silly). Thank you for
pointing out. Will fix.

> > @@ -8961,9 +8978,16 @@ heap2_redo(XLogReaderState *record)
> > * heap_sync - sync a heap, for use when no WAL has been written
> > *
> > * This forces the heap contents (including TOAST heap if any) down to disk.
> > - * If we skipped using WAL, and WAL is otherwise needed, we must force the
> > - * relation down to disk before it's safe to commit the transaction. This
> > - * requires writing out any dirty buffers and then doing a forced fsync.
> > + * If we did any changes to the heap bypassing the buffer manager, we must
> > + * force the relation down to disk before it's safe to commit the
> > + * transaction, because the direct modifications will not be flushed by
> > + * the next checkpoint.
> > + *
> > + * We used to also use this after batch operations like COPY and CLUSTER,
> > + * if we skipped using WAL and WAL is otherwise needed, but there were
> > + * corner-cases involving other WAL-logged operations to the same
> > + * relation, where that was not enough. heap_register_sync() should be
> > + * used for that purpose instead.
>
> We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
> heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
> CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> TABLESPACE, CLUSTER). It would make the system simpler to understand if we
> eliminated the old way. If that creates more problems than it solves, please
> at least write down a coding rule to explain why certain commands shouldn't
> use the old way.

Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
CREATE INDEX. I'll consider them.

I don't have enough time for now so the new version will be
posted early next week.

Thanks you for the review!

regards.

Attachment	Content-Type	Size
pending_sync_nbtsort_fix.patch	text/x-patch	1.2 KB
pending_sync_fix_tblsp_subxact.patch	text/x-patch	2.9 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-03-21 05:48:35
Message-ID:	20190321054835.GB3842129@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190311022708(dot)GA2189728(at)rfd(dot)leadboat(dot)com>
> > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > +/*
> > > + * Sync to disk any relations that we skipped WAL-logging for earlier.
> > > + */
> > > +void
> > > +smgrDoPendingSyncs(bool isCommit)
> > > +{
> > > + if (!pendingSyncs)
> > > + return;
> > > +
> > > + if (isCommit)
> > > + {
> > > + HASH_SEQ_STATUS status;
> > > + PendingRelSync *pending;
> > > +
> > > + hash_seq_init(&status, pendingSyncs);
> > > +
> > > + while ((pending = hash_seq_search(&status)) != NULL)
> > > + {
> > > + if (pending->sync_above != InvalidBlockNumber)
> >
> > I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> > InvalidBlockNumber" are not sync requests at all. Those just record the fact
> > of a RelationTruncate() happening. If you can think of a way to improve that,
> > please do so. If not, it's okay.
>
> After a truncation, required WAL records are emitted for the
> truncated pages, so no need to sync. Does this make sense for
> you? (Maybe commit is needed there)

Yes, the behavior makes sense. I wasn't saying the quoted code had the wrong
behavior. I was saying that the data structure called "pendingSyncs" is
actually "pending syncs and past truncates". It's not ideal that the variable
name differs from the variable purpose in this way. However, it's okay if you
don't find a way to improve that.

> I don't have enough time for now so the new version will be
> posted early next week.

I'll wait for that version.

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-03-25 12:32:04
Message-ID:	20190325.213204.236581069.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. This is a revised version.

At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190321054835(dot)GB3842129(at)rfd(dot)leadboat(dot)com>
> On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> > At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190311022708(dot)GA2189728(at)rfd(dot)leadboat(dot)com>
> > > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > I'm mildly unhappy that pendingSyncs entries with "pending->sync_above ==
> > > InvalidBlockNumber" are not sync requests at all. Those just record the fact
> > > of a RelationTruncate() happening. If you can think of a way to improve that,
> > > please do so. If not, it's okay.
> >
> > After a truncation, required WAL records are emitted for the
> > truncated pages, so no need to sync. Does this make sense for
> > you? (Maybe commit is needed there)
>
> Yes, the behavior makes sense. I wasn't saying the quoted code had the wrong
> behavior. I was saying that the data structure called "pendingSyncs" is
> actually "pending syncs and past truncates". It's not ideal that the variable
> name differs from the variable purpose in this way. However, it's okay if you
> don't find a way to improve that.

It is convincing. The current member names "sync_above" and
"truncated_to" are wordings based on the operations that have
happened on the relation. I changed the names to words based on
what to do on the relation. Renamed to skip_wal_min_blk and
wal_log_min_blk.

> > I don't have enough time for now so the new version will be
> > posted early next week.
>
> I'll wait for that version.

At Wed, 20 Mar 2019 17:17:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20190320(dot)171754(dot)171896368(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > Hence, this change causes us to emit WAL for the metapage of a
> > RELPERSISTENCE_UNLOGGED or RELPERSISTENCE_TEMP relation. We should never do
> > that. If we do that for RELPERSISTENCE_TEMP, redo will write to a permanent
> > relfilenode. I've attached a test case for this; it is a patch that applies
> > on top of your v7 patches. The test checks for orphaned files after redo.
>
> Oops! Added RelationNeedsWAL(index) there. (Attched 1st patch on
> top of this patchset)

Done in the attached patch. But the orphan file check in the TAP
diff was wrong. It detects orphaned pg_class entry for temprary
tables, which dissapears after the first autovacuum. The revised
tap test (check_orphan_relfilenodes) doesn't faultly fail and
catches the bug in the previous patch.

> > > + * If no tuple was inserted, it's possible that we are truncating a
> > > + * relation. We need to emit WAL for the metapage in the case. However it
> > > + * is not required elsewise,
> >
> > Did you mean to write more words after that comma?
>
> Sorry, it is just a garbage. Required work is done in
> _bt_blwritepage.

Removed.

> > We'd need a mechanism to un-remove the sync at subtransaction abort. My
> > attachment includes a test case demonstrating the consequences of that defect.
> > Please look for other areas that need to know about subtransactions; patch v7
> > had no code pertaining to subtransactions.

Added. Passed the new tests.

> > > + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
> >
> > As you mention upthread, you have many debugging elog()s. These are too
> > detailed to include in every binary, but I do want them in the code. See
> > CACHE_elog() for a good example of achieving that.
>
> Agreed will do. They were need to check the behavior precisely
> but usually not needed.

I removed all such elog()s.

> > RelationData fields other than "pgstat_info" have "rd_" prefixes; add that
> > prefix to these fields.
> > This is a nonstandard place to clear fields. Clear them in
> > load_relcache_init_file() only, like we do for rd_statvalid. (Other paths
> > will then rely on palloc0() for implicit initialization.)

Both are done.

> > > - if (RelationNeedsWAL(relation))
> > > + if (BufferNeedsWAL(relation, buffer) ||
> > > + BufferNeedsWAL(relation, newbuf))
> >
> > This is fine if both buffers need WAL or neither buffer needs WAL. It is not
> > fine when one buffer needs WAL and the other buffer does not. My attachment
> > includes a test case. Of the bugs I'm reporting, this one seems most
> > difficult to solve well.

I refactored heap_insert/delete so that the XLOG stuff can be
used from heap_update. Then modify heap_update so that it emits
XLOG_INSERT and XLOG_DELETE in addition to XLOG_UPDATE.

> > We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
> > heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
> > CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> > commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> > TABLESPACE, CLUSTER). It would make the system simpler to understand if we
> > eliminated the old way. If that creates more problems than it solves, please
> > at least write down a coding rule to explain why certain commands shouldn't
> > use the old way.
>
> Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
> CREATE INDEX. I'll consider them.

I added the CLUSTER case in the new patchset. For the SET
TABLESPACE case, it works on SMGR layer and manipulates fork
files explicitly but this stuff is Relation based and doesn't
distinguish forks. We can modify this stuff to work on smgr and
make it fork-aware but I don't think it is worth doing.

CREATE INDEX is not changed in this version. I continue to
consider it.

The attached is the new patchset.

v8-0001-TAP-test-for-copy-truncation-optimization.patch
- Revised version of test.

v8-0002-Write-WAL-for-empty-nbtree-index-build.patch
- Fixed version of v7

v8-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch
- New file, moves xlog stuff of heap_insert and heap_delete out
of the functions so that heap_update can use them.

v8-0004-Add-infrastructure-to-WAL-logging-skip-feature.patch
- Renamed variables, functions. Removed elogs.

v8-0005-Fix-WAL-skipping-feature.patch
- Fixed heap_update.

v8-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patch
- New file, modifies CLUSTER to use this feature.

v8-0007-Add-a-comment-to-ATExecSetTableSpace.patch
- New file, adds a comment that excuses for not using this stuff.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v8-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	9.5 KB
v8-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	1.6 KB
v8-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch	text/x-patch	11.5 KB
v8-0004-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	23.4 KB
v8-0005-Fix-WAL-skipping-feature.patch	text/x-patch	18.7 KB
v8-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patch	text/x-patch	8.2 KB
v8-0007-Add-a-comment-to-ATExecSetTableSpace.patch	text/x-patch	1.3 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-03-26 07:35:07
Message-ID:	20190326.163507.239339952.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. I revised the patch I think addressing all your comments.

Differences from v7 patch are:

v9-0001:

- Renamed the script from 016_ to 017_.

- Added some additional tests.

v9-0002:
- Fixed _bt_blwritepage().
It is re-modified by v9-0007.

v9-0003: New patch.
- Refactors out xlog sutff from heap_insert/delete.
(log_heap_insert(), log_heap_udpate())

v9-0004: (v7-0003, v8-0004)
- Renamed some struct names and member names.
(PendingRelSync -> RelWalRequirement
.sync_above -> skip_wal_min_blk, .truncated_to -> wal_log_min_blk)

- Rename the addtional members in RelationData to rd_*.

- Explicitly initialize the additional members only in
load_relcache_init_file().

- Added new interface functions that accept block number and
SMgrRelation.
(BlockNeedsWAL(), RecordPendingSync())

- Support subtransaction, (or invalidation).
(RelWalRequirement.create_sxid, invalidate_sxid,
RelationInvalidateWALRequirements(), smgrDoPendingSyncs())

- Support forks.
(RelWalRequirement.forks, smgrDoPendingSyncs(), RecordPendingSync())

- Removd elog(LOG)s and a leftover comment.

v9-0005: (v7-0004, v8-0005)

- Fixed heap_update().
(heap_update())

v9-0006: New patch.

- Modifies CLUSTER to skip WAL logging.

v9-0007: New patch.

- Modifies ALTER TABLE SET TABLESPACE to skip WAL logging.

v9-0008: New patch.

- Modifies btbuild to skip WAL logging.

- Modifies btinsertonpg to skip WAL logging after truncation.

- Overrites on v9-0002's change.

ALL:

- Rebased.

- Fixed typos and mistakes in comments.

> At Wed, 20 Mar 2019 17:17:54 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20190320(dot)171754(dot)171896368(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > We still use heap_sync() in CLUSTER. Can we migrate CLUSTER to the newer
> > > heap_register_sync()? Patch v7 makes some commands use the new way (COPY,
> > > CREATE TABLE AS, REFRESH MATERIALIZED VIEW, ALTER TABLE) and leaves other
> > > commands using the old way (CREATE INDEX USING btree, ALTER TABLE SET
> > > TABLESPACE, CLUSTER). It would make the system simpler to understand if we
> > > eliminated the old way. If that creates more problems than it solves, please
> > > at least write down a coding rule to explain why certain commands shouldn't
> > > use the old way.
> >
> > Perhaps doable for TABLESPACE and CLUSTER. I'm not sure about
> > CREATE INDEX. I'll consider them.
>
> I added the CLUSTER case in the new patchset. For the SET
> TABLESPACE case, it works on SMGR layer and manipulates fork
> files explicitly but this stuff is Relation based and doesn't
> distinguish forks. We can modify this stuff to work on smgr and
> make it fork-aware but I don't think it is worth doing.
>
> CREATE INDEX is not changed in this version. I continue to
> consider it.

I managed to simplify the change. Please look at v9-0008.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v9-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	9.5 KB
v9-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	1.6 KB
v9-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch	text/x-patch	11.5 KB
v9-0004-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	25.4 KB
v9-0005-Fix-WAL-skipping-feature.patch	text/x-patch	19.0 KB
v9-0006-Change-cluster-to-use-the-new-pending-sync-infrastru.patch	text/x-patch	6.1 KB
v9-0007-Change-ALTER-TABLESPACE-to-use-the-pending-sync-infr.patch	text/x-patch	3.7 KB
v9-0008-Optimize-WAL-logging-on-btree-bulk-insertion.patch	text/x-patch	4.6 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-03-31 22:31:58
Message-ID:	20190331223158.GB891537@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> last paragraph, and I suspect it would have been no harder to back-patch. I
> wonder if it would have been simpler and better, but I'm not asking anyone to
> investigate that.

Now I am asking for that. Would anyone like to try implementing that other
design, to see how much simpler it would be? I now expect the already-drafted
design to need several more iterations before it reaches a finished patch.

Separately, I reviewed v9 of the already-drafted design:

> On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > +/*
> > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > + */
> > +void
> > +RelationRemovePendingSync(Relation rel)
>
> What is the coding rule for deciding when to call this? Currently, only
> ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
> much like ALTER TABLE SET TABLESPACE behaves.

This question still applies. (The function name did change from
RelationRemovePendingSync() to RelationInvalidateWALRequirements().)

On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
> At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190321054835(dot)GB3842129(at)rfd(dot)leadboat(dot)com>
> > On Wed, Mar 20, 2019 at 05:17:54PM +0900, Kyotaro HORIGUCHI wrote:
> > > At Sun, 10 Mar 2019 19:27:08 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190311022708(dot)GA2189728(at)rfd(dot)leadboat(dot)com>
> > > > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > > + elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
> > >
> > > As you mention upthread, you have many debugging elog()s. These are too
> > > detailed to include in every binary, but I do want them in the code. See
> > > CACHE_elog() for a good example of achieving that.
> >
> > Agreed will do. They were need to check the behavior precisely
> > but usually not needed.
>
> I removed all such elog()s.

Again, I do want them in the code. Please restore them, but use a mechanism
like CACHE_elog() so they're built only if one defines a preprocessor symbol.

On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
> @@ -4097,6 +4104,8 @@ ReleaseSavepoint(const char *name)
> (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
> errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
>
> + smgrProcessWALRequirementInval(s->subTransactionId, true);
> +
> /*
> * Mark "commit pending" all subtransactions up to the target
> * subtransaction. The actual commits will happen when control gets to
> @@ -4206,6 +4215,8 @@ RollbackToSavepoint(const char *name)
> (errcode(ERRCODE_S_E_INVALID_SPECIFICATION),
> errmsg("savepoint \"%s\" does not exist within current savepoint level", name)));
>
> + smgrProcessWALRequirementInval(s->subTransactionId, false);

The smgrProcessWALRequirementInval() calls almost certainly belong in
CommitSubTransaction() and AbortSubTransaction(), not in these functions. By
doing it here, you'd get the wrong behavior in a subtransaction created via a
plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.

> +/*
> + * Process pending invalidation of WAL requirements happened in the
> + * subtransaction
> + */
> +void
> +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
> +{
> + HASH_SEQ_STATUS status;
> + RelWalRequirement *walreq;
> +
> + if (!walRequirements)
> + return;
> +
> + /* We expect that we don't have walRequirements in almost all cases */
> + hash_seq_init(&status, walRequirements);
> +
> + while ((walreq = hash_seq_search(&status)) != NULL)
> + {
> + /* remove useless entry */
> + if (isCommit ?
> + walreq->invalidate_sxid == sxid :
> + walreq->create_sxid == sxid)
> + hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);

Do not remove entries during subtransaction commit, because a parent
subtransaction might still abort. See other CommitSubTransaction() callees
for examples of correct subtransaction handling. AtEOSubXact_Files() is one
simple example.

> @@ -3567,15 +3602,26 @@ heap_update
> */
> if (RelationIsAccessibleInLogicalDecoding(relation))
> {
> - log_heap_new_cid(relation, &oldtup);
> - log_heap_new_cid(relation, heaptup);
> + if (oldbuf_needs_wal)
> + log_heap_new_cid(relation, &oldtup);
> + if (newbuf_needs_wal)
> + log_heap_new_cid(relation, heaptup);

These if(...) conditions are always true, since they're redundant with
RelationIsAccessibleInLogicalDecoding(relation). Remove the conditions or
replace them with asserts.

> }
>
> - recptr = log_heap_update(relation, buffer,
> - newbuf, &oldtup, heaptup,
> - old_key_tuple,
> - all_visible_cleared,
> - all_visible_cleared_new);
> + if (oldbuf_needs_wal && newbuf_needs_wal)
> + recptr = log_heap_update(relation, buffer, newbuf,
> + &oldtup, heaptup,
> + old_key_tuple,
> + all_visible_cleared,
> + all_visible_cleared_new);
> + else if (oldbuf_needs_wal)
> + recptr = log_heap_delete(relation, buffer, &oldtup, old_key_tuple,
> + xmax_old_tuple, false,
> + all_visible_cleared);
> + else
> + recptr = log_heap_insert(relation, buffer, newtup,
> + 0, all_visible_cleared_new);

By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
chain and infomask bits that were present before crash recovery. If that's
okay in these circumstances, please write a comment explaining why.

> @@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
> cachedBlock = BufferGetBlockNumber(buf);
>
> /* XLOG stuff */
> - if (RelationNeedsWAL(rel))
> + if (BufferNeedsWAL(rel, buf) ||
> + (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) ||
> + (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))

This appears to have the same problem that heap_update() had in v7; if
BufferNeedsWAL(rel, buf) is false and BufferNeedsWAL(rel, metabuf) is true, we
emit WAL for both buffers. If that can't actually happen today, use asserts.

I don't want the btree code to get significantly more complicated in order to
participate in the RelWalRequirement system. If btree code would get more
complicated, it's better to have btree continue using the old system. If
btree's complexity would be essentially unchanged, it's still good to use the
new system.

> @@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
>
> reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
>
> + /* Skip WAL-logging if wal_level = minimal */
> + if (!XLogIsNeeded())
> + RecordWALSkipping(index);

_bt_load() still has an smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM),
which should be unnecessary after you add this end-of-transaction sync. Also,
this code can reach an assertion failure at wal_level=minimal:

910024 2019-03-31 19:12:13.728 GMT LOG: statement: create temp table x (c int primary key)
910024 2019-03-31 19:12:13.729 GMT DEBUG: CREATE TABLE / PRIMARY KEY will create implicit index "x_pkey" for table "x"
910024 2019-03-31 19:12:13.730 GMT DEBUG: building index "x_pkey" on table "x" serially
TRAP: FailedAssertion("!(((rel)->rd_rel->relpersistence == 'p'))", File: "storage.c", Line: 460)

Also, please fix whitespace problems that "git diff --check master" reports.

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-04-02 10:54:06
Message-ID:	20190402.195406.20162559.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for reviewing.

At Sun, 31 Mar 2019 15:31:58 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190331223158(dot)GB891537(at)rfd(dot)leadboat(dot)com>
> On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > On Mon, Mar 04, 2019 at 12:24:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > +/*
> > > + * RelationRemovePendingSync() -- remove pendingSync entry for a relation
> > > + */
> > > +void
> > > +RelationRemovePendingSync(Relation rel)
> >
> > What is the coding rule for deciding when to call this? Currently, only
> > ATExecSetTableSpace() calls this. CLUSTER doesn't call it, despite behaving
> > much like ALTER TABLE SET TABLESPACE behaves.
>
> This question still applies. (The function name did change from
> RelationRemovePendingSync() to RelationInvalidateWALRequirements().)

It is called for heap_register_sync()'ed relations to avoid
syncing useless or trying to sync nonexistent files. I modifed
all CLUSTER, COPY FROM, CREATE AS, REFRESH MATVIEW and SET
TABLESPACE uses the function. (The function is renamed to
table_relation_invalidate_walskip()).

I noticed that heap_register_sync and friends are now a kind of
Table-AM function. So I added .relation_register_walskip and
.relation_invalidate_walskip in TableAMRoutine and moved the
heap_register_sync stuff as heapam_relation_register_walskip and
friends. .finish_bulk_insert() is modified to be used only
WAL-skip is active on the relation. (0004, 0005) But I'm not sure
that is the right direction.

(RelWALRequirements is renamed to RelWALSkip)

The change made smgrFinishBulkInsert (known as smgrDoPendingSync)
need to call a tableam interface. Relation is required to call it
in the designed way but relcache cannot live until there. In the
attached patch 0005, a new member TableAmRoutine *tableam is
added to RelWalSkip and calls finish_bulk_insert() via the
tableAm. But I'm quite uneasy with that...

> On Mon, Mar 25, 2019 at 09:32:04PM +0900, Kyotaro HORIGUCHI wrote:
> > At Wed, 20 Mar 2019 22:48:35 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190321054835(dot)GB3842129(at)rfd(dot)leadboat(dot)com>
> Again, I do want them in the code. Please restore them, but use a mechanism
> like CACHE_elog() so they're built only if one defines a preprocessor symbol.

Ah, sorry. I restored the messages using STORAGE_elog(). I also
needed this. (SMGR_ might be better but I'm not sure.)

> On Tue, Mar 26, 2019 at 04:35:07PM +0900, Kyotaro HORIGUCHI wrote:
> > + smgrProcessWALRequirementInval(s->subTransactionId, false);
>
> The smgrProcessWALRequirementInval() calls almost certainly belong in
> CommitSubTransaction() and AbortSubTransaction(), not in these functions. By
> doing it here, you'd get the wrong behavior in a subtransaction created via a
> plpgsql "BEGIN ... EXCEPTION WHEN OTHERS THEN" block.

Thanks. Moved it to AtSubAbort_smgr() and AtSubCommit_smgr(). (0005)

> > +/*
> > + * Process pending invalidation of WAL requirements happened in the
> > + * subtransaction
> > + */
> > +void
> > +smgrProcessWALRequirementInval(SubTransactionId sxid, bool isCommit)
> > +{
> > + HASH_SEQ_STATUS status;
> > + RelWalRequirement *walreq;
> > +
> > + if (!walRequirements)
> > + return;
> > +
> > + /* We expect that we don't have walRequirements in almost all cases */
> > + hash_seq_init(&status, walRequirements);
> > +
> > + while ((walreq = hash_seq_search(&status)) != NULL)
> > + {
> > + /* remove useless entry */
> > + if (isCommit ?
> > + walreq->invalidate_sxid == sxid :
> > + walreq->create_sxid == sxid)
> > + hash_search(walRequirements, &walreq->relnode, HASH_REMOVE, NULL);
>
> Do not remove entries during subtransaction commit, because a parent
> subtransaction might still abort. See other CommitSubTransaction() callees
> for examples of correct subtransaction handling. AtEOSubXact_Files() is one
> simple example.

Thanks. smgrProcessWALSkipInval() (0005) is changed so that:

- If a RelWalSkip entry is created in aborted subtransaction,
remove it.

- If a RelWalSkip entry is created then invalidated in committed
subtransaction, remove it.

- If a RelWalSkip entry is created and committed, change the
creator subtransaction to the parent subtransaction.

- If a RelWalSkip entry is create elsewhere and invalidated in
committed subtransaction, move the invalidation to the parent
subtransaction.

- If a RelWalSkip entry is created elsewhere and invalidated in
aborted subtransaction, cancel the invalidation.

Test is added as test3a2 and test3a3. (0001)

> > @@ -3567,15 +3602,26 @@ heap_update
> > */
> > if (RelationIsAccessibleInLogicalDecoding(relation))
> > {
> > - log_heap_new_cid(relation, &oldtup);
> > - log_heap_new_cid(relation, heaptup);
> > + if (oldbuf_needs_wal)
> > + log_heap_new_cid(relation, &oldtup);
> > + if (newbuf_needs_wal)
> > + log_heap_new_cid(relation, heaptup);
>
> These if(...) conditions are always true, since they're redundant with
> RelationIsAccessibleInLogicalDecoding(relation). Remove the conditions or
> replace them with asserts.

Ah.. I see. It is not the minimal case. Added a comment and an
assertion. (0006)

+ * catalog. Both oldbuf_needs_wal and newbuf_needs_wal must be true
+ * when logical decoding is active.

> By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
> chain and infomask bits that were present before crash recovery. If that's
> okay in these circumstances, please write a comment explaining why.

Sounds reasonable. Added a comment. (Honestly I completely forgot
about that.. Thanks!) (0006)

+ * Insert log record. Using delete or insert log loses HOT chain
+ * information but that happens only when newbuf is different from
+ * buffer, where HOT cannot happen.

> > @@ -1096,7 +1097,9 @@ _bt_insertonpg(Relation rel,
> > | | | cachedBlock = BufferGetBlockNumber(buf);
> >
> > | | /* XLOG stuff */
> > - | | if (RelationNeedsWAL(rel))
> > + | | if (BufferNeedsWAL(rel, buf) ||
> > + | | | (!P_ISLEAF(lpageop) && BufferNeedsWAL(rel, cbuf)) ||
> > + | | | (BufferIsValid(metabuf) && BufferNeedsWAL(rel, metabuf)))
>
> This appears to have the same problem that heap_update() had in v7; if
> BufferNeedsWAL(rel, buf) is false and BufferNeedsWAL(rel, metabuf) is true, we
> emit WAL for both buffers. If that can't actually happen today, use asserts.
>
> I don't want the btree code to get significantly more complicated in order to
> participate in the RelWalRequirement system. If btree code would get more
> complicated, it's better to have btree continue using the old system. If
> btree's complexity would be essentially unchanged, it's still good to use the
> new system.

It was broken. I tried to fix it but page split baffled me. I
reverted it and added a comment there explaining the reason for
not applying BufferNeedsWAL stuff to nbtree. WAL-logging skip
feature is now restricted to work only on non-index
heaps. (getWalSkipEntry and RecordPendingSync in 0005)

> > @@ -334,6 +334,10 @@ btbuild(Relation heap, Relation index, IndexInfo *indexInfo)
> >
> > | reltuples = _bt_spools_heapscan(heap, index, &buildstate, indexInfo);
> >
> > + | /* Skip WAL-logging if wal_level = minimal */
> > + | if (!XLogIsNeeded())
> > + | | RecordWALSkipping(index);
>
> _bt_load() still has an smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM),
> which should be unnecessary after you add this end-of-transaction sync. Also,
> this code can reach an assertion failure at wal_level=minimal:
>
> 910024 2019-03-31 19:12:13.728 GMT LOG: statement: create temp table x (c int primary key)
> 910024 2019-03-31 19:12:13.729 GMT DEBUG: CREATE TABLE / PRIMARY KEY will create implicit index "x_pkey" for table "x"
> 910024 2019-03-31 19:12:13.730 GMT DEBUG: building index "x_pkey" on table "x" serially
> TRAP: FailedAssertion("!(((rel)->rd_rel->relpersistence == 'p'))", File: "storage.c", Line: 460)

This is what I mentioned as "broken" above. Sorry for the
silly mistake.

> Also, please fix whitespace problems that "git diff --check master" reports.

Thanks. Good to know the command.

After all, this patch set contains the following files.

v10-0001-TAP-test-for-copy-truncation-optimization.patch

Tap test script. Multi-level subtransaction case is added.

v10-0002-Write-WAL-for-empty-nbtree-index-build.patch

As mentioned above, nbtree patch has been shrinked to the
initial state of a workaround. Comment is rewrited. (v9-0002 +
v9-0008)

v10-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch

Not substantially changed.

v10-0004-Add-new-interface-to-TableAmRoutine.patch

New file. Adds two new interfaces to TableAmRoutine and modified
one interface.

v10-0005-Add-infrastructure-to-WAL-logging-skip-feature.patch

Heavily revised version of v9-0004.
Some functions are renamed.
Fixed subtransaction handling.
Added STORAGE_elog() stuff.
Uses table-am functions.
Changes heapam stuff.

v10-0006-Fix-WAL-skipping-feature.patch

Revised version of v9-0005 + v9-0006 + v9-0007.

Added comment and assertion in heap_insert().

v10-0007-Remove-TABLE-HEAP_INSERT_SKIP_WAL.patch

Separated from v9-0005 so that subsequent patches are sane.

Removes TABLE/HEAP_ISNERT_SKIP_WAL.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v10-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	10.7 KB
v10-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	1.6 KB
v10-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch	text/x-patch	11.5 KB
v10-0004-Add-new-interface-to-TableAmRoutine.patch	text/x-patch	5.6 KB
v10-0005-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	32.2 KB
v10-0006-Fix-WAL-skipping-feature.patch	text/x-patch	28.5 KB
v10-0007-Remove-TABLE-HEAP_INSERT_SKIP_WAL.patch	text/x-patch	2.3 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-04-03 14:16:02
Message-ID:	CA+TgmoYEST4xYaU10gM=XXeA-oxbFh=qSfy0X4PXDCWubcgj=g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 2, 2019 at 6:54 AM Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
> > chain and infomask bits that were present before crash recovery. If that's
> > okay in these circumstances, please write a comment explaining why.
>
> Sounds reasonable. Added a comment. (Honestly I completely forgot
> about that.. Thanks!) (0006)

If you haven't already, I think you should set up a master and a
standby and wal_consistency_checking=all and run tests of this feature
on the master and see if anything breaks on the master or the standby.
I'm not sure that emitting an insert or delete record is going to
reproduce the exact same state on the standby that exists on the
master.

+ * Insert log record. Using delete or insert log loses HOT chain
+ * information but that happens only when newbuf is different from
+ * buffer, where HOT cannot happen.

"HOT chain information" seems pretty vague.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	robertmhaas(at)gmail(dot)com
Cc:	noah(at)leadboat(dot)com, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-04-04 02:03:20
Message-ID:	20190404.110320.84838706.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for looking this.

At Wed, 3 Apr 2019 10:16:02 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+TgmoYEST4xYaU10gM=XXeA-oxbFh=qSfy0X4PXDCWubcgj=g(at)mail(dot)gmail(dot)com>
> On Tue, Apr 2, 2019 at 6:54 AM Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > By using DELETE and INSERT records to implement an UPDATE, you lose the ctid
> > > chain and infomask bits that were present before crash recovery. If that's
> > > okay in these circumstances, please write a comment explaining why.
> >
> > Sounds reasonable. Added a comment. (Honestly I completely forgot
> > about that.. Thanks!) (0006)
>
> If you haven't already, I think you should set up a master and a
> standby and wal_consistency_checking=all and run tests of this feature
> on the master and see if anything breaks on the master or the standby.
> I'm not sure that emitting an insert or delete record is going to
> reproduce the exact same state on the standby that exists on the
> master.

All of this patch is for wal_level = minimal. Doesn't make
changes in other cases. Updates are always replicated as
XLOG_HEAP_(HOT_)UPDATE. Crash recovery cases involving log_insert
or log_update are exercised by the TAP test.

> + * Insert log record. Using delete or insert log loses HOT chain
> + * information but that happens only when newbuf is different from
> + * buffer, where HOT cannot happen.
>
> "HOT chain information" seems pretty vague.

Thanks. Actually I was a bit uneasy with "information". Does the
following make sense?

> * Insert log record, using delete or insert instead of update log
> * when only one of the two buffers needs WAL-logging. If this were a
> * HOT-update, redoing the WAL record would result in a broken
> * hot-chain. However, that never happens because updates complete on
> * a single page always use log_update.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-04-04 14:52:59
Message-ID:	CA+TgmoZE0jW0jbQxAtoJgJNwrR1hyx3x8pUjQr=ggenLxnPoEQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 3, 2019 at 10:03 PM Kyotaro HORIGUCHI
<horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > * Insert log record, using delete or insert instead of update log
> > * when only one of the two buffers needs WAL-logging. If this were a
> > * HOT-update, redoing the WAL record would result in a broken
> > * hot-chain. However, that never happens because updates complete on
> > * a single page always use log_update.

It makes sense grammatically, but I'm not sure I believe that it's
sound technically. Even though it's only used in the non-HOT case,
it's still important that the CTID, XMIN, and XMAX fields are set
correctly during both normal operation and recovery.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	robertmhaas(at)gmail(dot)com
Cc:	noah(at)leadboat(dot)com, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-04-05 03:55:20
Message-ID:	20190405.125520.164840258.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Thu, 4 Apr 2019 10:52:59 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in <CA+TgmoZE0jW0jbQxAtoJgJNwrR1hyx3x8pUjQr=ggenLxnPoEQ(at)mail(dot)gmail(dot)com>
> On Wed, Apr 3, 2019 at 10:03 PM Kyotaro HORIGUCHI
> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
> > > * Insert log record, using delete or insert instead of update log
> > > * when only one of the two buffers needs WAL-logging. If this were a
> > > * HOT-update, redoing the WAL record would result in a broken
> > > * hot-chain. However, that never happens because updates complete on
> > > * a single page always use log_update.
>
> It makes sense grammatically, but I'm not sure I believe that it's

Great to hear that! I rewrote it as the following.

+ * Insert log record. When we are not running WAL-skipping, always use
+ * update log. Otherwise use delete or insert log instead when only
+ * one of the two buffers needs WAL-logging. If this were a
+ * HOT-update, redoing the WAL record would result in a broken
+ * hot-chain. However, that never happens because updates complete on
+ * a single page always use log_update.
+ *
+ * Using delete or insert log in place of udpate log leads to
+ * inconsistent series of WAL records. But note that WAL-skipping
+ * happens only when we are updating a tuple in a relation that has
+ * been create in the same transaction. Once commited, the WAL records
+ * recovers the same state of the relation as the synced state at the
+ * commit. Or the maybe-broken relation due to a crash before commit
+ * will be removed in recovery.

> sound technically. Even though it's only used in the non-HOT case,
> it's still important that the CTID, XMIN, and XMAX fields are set
> correctly during both normal operation and recovery.

log_heap_delete()/log_heap_update() record the infomasks of the
deleted tuple as is. Xmax is stored from the same
variable. offnum is taken from the deleted tuple and buffer is
registered and xlrec.flags is set to the same value. As the
result Xmax, infomasks and ctid are restored to the same state by
heap_xlog_xlog_delete(). I didn't add a comment about that.

log_heap_insert()/log_heap_update() record the infomasks of the
inserted tuple as is. Xmin/Cmin and ctid related info are handled
the same way. But log_heap_insert() assumes that Xmax =
invalid. But that happens only when another transaction can see
it, which is not the case here. I added a command and assertion
before calling log_heap_insert().

+ * Coming here means that the old tuple is invisible and
+ * inoperable to another transaction. So xmax_new_tuple is
+ * expected to be InvalidTransactionId here.
+ */
+ Assert (xmax_new_tuple == InvalidTransactionId);
+ recptr = log_heap_insert(relation, buffer, newtup,

I noticed that I accidentally moved log_heap_new_cid stuff to
log_heap_insert/delete(). I restored them.

The attached v11 is the new version addressing the aboves and
rebased.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v11-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	10.7 KB
v11-0002-Write-WAL-for-empty-nbtree-index-build.patch	text/x-patch	1.6 KB
v11-0003-Move-XLOG-stuff-from-heap_insert-and-heap_delete.patch	text/x-patch	11.1 KB
v11-0004-Add-new-interface-to-TableAmRoutine.patch	text/x-patch	5.6 KB
v11-0005-Add-infrastructure-to-WAL-logging-skip-feature.patch	text/x-patch	32.2 KB
v11-0006-Fix-WAL-skipping-feature.patch	text/x-patch	29.5 KB
v11-0007-Remove-TABLE-HEAP_INSERT_SKIP_WAL.patch	text/x-patch	2.3 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-13 00:37:05
Message-ID:	20190513003705.GA1202614@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > last paragraph, and I suspect it would have been no harder to back-patch. I
> > wonder if it would have been simpler and better, but I'm not asking anyone to
> > investigate that.
>
> Now I am asking for that. Would anyone like to try implementing that other
> design, to see how much simpler it would be?

Anyone? I've been deferring review of v10 and v11 in hopes of seeing the
above-described patch first.

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-14 04:59:10
Message-ID:	20190514.135910.258194307.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190513003705(dot)GA1202614(at)rfd(dot)leadboat(dot)com>
> On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> > On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > > last paragraph, and I suspect it would have been no harder to back-patch. I
> > > wonder if it would have been simpler and better, but I'm not asking anyone to
> > > investigate that.
> >
> > Now I am asking for that. Would anyone like to try implementing that other
> > design, to see how much simpler it would be?

Yeah, I think it is a bit too-complex for the value. But I think
it is the best way as far as we keep reusing a file on
truncation of the whole file.

> Anyone? I've been deferring review of v10 and v11 in hopes of seeing the
> above-described patch first.

The siginificant portion of the complexity in this patch comes
from need to behave differently per block according to remebered
logged and truncated block numbers.

0005:
+ * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+ * any subsequent actions on the same block either. Replaying the WAL record
+ * of the subsequent action might fail otherwise, as the "before" state of
+ * the block might not match, as the earlier actions were not WAL-logged.
+ * Likewise, after we have WAL-logged an operation for a block, we must
+ * WAL-log any subsequent operations on the same page as well. Replaying
+ * a possible full-page-image from the earlier WAL record would otherwise
+ * revert the page to the old state, even if we sync the relation at end
+ * of transaction.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we can't skip WAL-logging for any
+ * of the truncated blocks anymore, as replaying the truncation record will
+ * destroy all the data inserted after that. But if we have already decided
+ * to skip WAL-logging changes to a relation, and the relation is truncated,
+ * we don't need to WAL-log the truncation either.

If this consideration holds and given the optimizations on
WAL-skip and truncation, there's no way to avoid the per-block
behavior as far as we are allowing mixture of
logged-modifications and WAL-skipped COPY on the same relation
within a transaction.

We could avoid the per-block behavior change by making the
wal-inhibition per-relation basis. That will reduce the patch
size by the amount of BufferNeedsWALs and log_heap_update, but
not that large.

inhibit wal-skipping after any wal-logged modifications in the relation.
inhibit wal-logging after any wal-skipped modifications in the relation.
wal-skipped relations are synced at commit-time.
truncation of wal-skipped relation creates a new relfilenode.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-17 06:50:50
Message-ID:	20190517065050.GA1298884@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 14, 2019 at 01:59:10PM +0900, Kyotaro HORIGUCHI wrote:
> At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190513003705(dot)GA1202614(at)rfd(dot)leadboat(dot)com>
> > On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> > > On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > > > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > > > last paragraph, and I suspect it would have been no harder to back-patch. I
> > > > wonder if it would have been simpler and better, but I'm not asking anyone to
> > > > investigate that.
> > >
> > > Now I am asking for that. Would anyone like to try implementing that other
> > > design, to see how much simpler it would be?
>
> Yeah, I think it is a bit too-complex for the value. But I think
> it is the best way as far as we keep reusing a file on
> truncation of the whole file.

The design of v11-0006-Fix-WAL-skipping-feature.patch doesn't, in general,
work for WAL records touching more than one buffer. For heapam, that patch
works around this problem by emitting XLOG_HEAP_INSERT or XLOG_HEAP_DELETE
when we'd normally emit XLOG_HEAP_UPDATE. As a result, post-crash-recovery
heap page bits differ from the bits present when we don't crash. Though I'm
85% confident this does not introduce a bug today, this is fragile. That is
the main complexity I wish to avoid.

I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
paragraph will be simpler, not more complex. In the implementation I'm
envisioning, smgrDoPendingDeletes() would change name, perhaps to
AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
durability by syncing (for large nodes) or by WAL-logging each page (for small
nodes). RelationNeedsWAL() would return false whenever the applicable
relfilenode appears in pendingDeletes. Access methods would remove their
smgrimmedsync() calls, but they would otherwise not change. Would anyone like
to try implementing that?

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-20 06:54:30
Message-ID:	20190520.155430.215084510.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Thu, 16 May 2019 23:50:50 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190517065050(dot)GA1298884(at)rfd(dot)leadboat(dot)com>
> On Tue, May 14, 2019 at 01:59:10PM +0900, Kyotaro HORIGUCHI wrote:
> > At Sun, 12 May 2019 17:37:05 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190513003705(dot)GA1202614(at)rfd(dot)leadboat(dot)com>
> > > On Sun, Mar 31, 2019 at 03:31:58PM -0700, Noah Misch wrote:
> > > > On Sun, Mar 10, 2019 at 07:27:08PM -0700, Noah Misch wrote:
> > > > > I also liked the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi
> > > > > last paragraph, and I suspect it would have been no harder to back-patch. I
> > > > > wonder if it would have been simpler and better, but I'm not asking anyone to
> > > > > investigate that.
> > > >
> > > > Now I am asking for that. Would anyone like to try implementing that other
> > > > design, to see how much simpler it would be?
> >
> > Yeah, I think it is a bit too-complex for the value. But I think
> > it is the best way as far as we keep reusing a file on
> > truncation of the whole file.
>
> The design of v11-0006-Fix-WAL-skipping-feature.patch doesn't, in general,
> work for WAL records touching more than one buffer. For heapam, that patch
> works around this problem by emitting XLOG_HEAP_INSERT or XLOG_HEAP_DELETE
> when we'd normally emit XLOG_HEAP_UPDATE. As a result, post-crash-recovery
> heap page bits differ from the bits present when we don't crash. Though I'm
> 85% confident this does not introduce a bug today, this is fragile. That is
> the main complexity I wish to avoid.

Ok, I see your point. The same issue happens on index pages more
aggressively. I didn't allow wal-skipping on indexes for the
reason.

> I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
> paragraph will be simpler, not more complex. In the implementation I'm
> envisioning, smgrDoPendingDeletes() would change name, perhaps to
> AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
> durability by syncing (for large nodes) or by WAL-logging each page (for small
> nodes). RelationNeedsWAL() would return false whenever the applicable
> relfilenode appears in pendingDeletes. Access methods would remove their
> smgrimmedsync() calls, but they would otherwise not change. Would anyone like
> to try implementing that?

Following this direction, the attached PoC works *at least for*
the wal_optimization TAP tests, but doing pending flush not in
smgr but in relcache. This is extending skip-wal feature to
indexes. And makes the old 0002 patch on nbtree useless.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v12-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	10.7 KB
v12-0002-Fix-WAL-skipping-feature.patch	text/x-patch	7.7 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-21 12:29:48
Message-ID:	20190521.212948.34357392.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Mon, 20 May 2019 15:54:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20190520(dot)155430(dot)215084510(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
> > paragraph will be simpler, not more complex. In the implementation I'm
> > envisioning, smgrDoPendingDeletes() would change name, perhaps to
> > AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
> > durability by syncing (for large nodes) or by WAL-logging each page (for small
> > nodes). RelationNeedsWAL() would return false whenever the applicable
> > relfilenode appears in pendingDeletes. Access methods would remove their
> > smgrimmedsync() calls, but they would otherwise not change. Would anyone like
> > to try implementing that?
>
> Following this direction, the attached PoC works *at least for*
> the wal_optimization TAP tests, but doing pending flush not in
> smgr but in relcache. This is extending skip-wal feature to
> indexes. And makes the old 0002 patch on nbtree useless.

This is a tidier version of the patch.

- Passes regression tests including 018_wal_optimize.pl

- Move the substantial work to table/index AMs.

Each AM can decide whether to support WAL skip or not.
Currently heap and nbtree support it.

- The timing of sync is moved from AtEOXact to PreCommit. This is
because heap_sync() needs xact state = INPROGRESS.

- matview and cluster is broken, since swapping to new
relfilenode doesn't change rd_newRelfilenodeSubid. I'll address
that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v13-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	10.7 KB
v13-0002-Fix-WAL-skipping-feature.patch	text/x-patch	22.4 KB

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-23 07:10:35
Message-ID:	20190523.161035.171704812.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Attached is a new version.

At Tue, 21 May 2019 21:29:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20190521(dot)212948(dot)34357392(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>

> At Mon, 20 May 2019 15:54:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote in <20190520(dot)155430(dot)215084510(dot)horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
> > > I suspect the design in the https://postgr.es/m/559FA0BA.3080808@iki.fi last
> > > paragraph will be simpler, not more complex. In the implementation I'm
> > > envisioning, smgrDoPendingDeletes() would change name, perhaps to
> > > AtEOXact_Storage(). For every relfilenode it does not delete, it would ensure
> > > durability by syncing (for large nodes) or by WAL-logging each page (for small
> > > nodes). RelationNeedsWAL() would return false whenever the applicable
> > > relfilenode appears in pendingDeletes. Access methods would remove their
> > > smgrimmedsync() calls, but they would otherwise not change. Would anyone like
> > > to try implementing that?
> >
> > Following this direction, the attached PoC works *at least for*
> > the wal_optimization TAP tests, but doing pending flush not in
> > smgr but in relcache. This is extending skip-wal feature to
> > indexes. And makes the old 0002 patch on nbtree useless.
>
> This is a tidier version of the patch.
>
> - Passes regression tests including 018_wal_optimize.pl
>
> - Move the substantial work to table/index AMs.
>
> Each AM can decide whether to support WAL skip or not.
> Currently heap and nbtree support it.
>
> - The timing of sync is moved from AtEOXact to PreCommit. This is
> because heap_sync() needs xact state = INPROGRESS.
>
> - matview and cluster is broken, since swapping to new
> relfilenode doesn't change rd_newRelfilenodeSubid. I'll address
> that.

cluster/matview are fixed.

A obstacle to fix them was the unreliability of
newRelfilenodeSubid. As mentioned in the comment of
RelationData, newRelfilenodeSubid may dissapear by certain
sequence of commands.

In the attched v14, I added "rd_firstRelfilenodeSubid", which
stores the subtransaction id where the first relfilenode
replacementin the current transaction. It suivives any sequence
of commands, including one mentioned in CopyFrom's comment (which
I removed by this patch).

With the attached patch, on relations based on table/index AMs
that supports WAL-skipping, WAL-logging is eliminated if the
relation is created in the current transaction, or relfilenode is
replaced in the current transaction. At-commit file sync is
surely performed. (Only Heap and Btree support it.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v14-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	10.7 KB
v14-0002-Fix-WAL-skipping-feature.patch	text/x-patch	37.3 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-25 02:33:32
Message-ID:	20190525023332.GE1624191@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> Following this direction, the attached PoC works *at least for*
> the wal_optimization TAP tests, but doing pending flush not in
> smgr but in relcache.

This task, syncing files created in the current transaction, is not the kind
of task normally assigned to a cache. We already have a module, storage.c,
that maintains state about files created in the current transaction. Why did
you use relcache instead of storage.c?

On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> This is a tidier version of the patch.

> - Move the substantial work to table/index AMs.
>
> Each AM can decide whether to support WAL skip or not.
> Currently heap and nbtree support it.

Why would an AM find it important to disable WAL skip?

From:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-27 05:08:26
Message-ID:	20190527.140826.258215605.horiguchi.kyotaro@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for the comment!

At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190525023332(dot)GE1624191(at)rfd(dot)leadboat(dot)com>
> On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > Following this direction, the attached PoC works *at least for*
> > the wal_optimization TAP tests, but doing pending flush not in
> > smgr but in relcache.
>
> This task, syncing files created in the current transaction, is not the kind
> of task normally assigned to a cache. We already have a module, storage.c,
> that maintains state about files created in the current transaction. Why did
> you use relcache instead of storage.c?

The reason was at-commit sync needs buffer flush beforehand. But
FlushRelationBufferWithoutRelCache() in v11 can do
that. storage.c is reasonable as the place.

> On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > This is a tidier version of the patch.
>
> > - Move the substantial work to table/index AMs.
> >
> > Each AM can decide whether to support WAL skip or not.
> > Currently heap and nbtree support it.
>
> Why would an AM find it important to disable WAL skip?

The reason is currently it's AM's responsibility to decide
whether to skip WAL or not.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-05-27 23:02:25
Message-ID:	20190527230225.GA59385@gust.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
> At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190525023332(dot)GE1624191(at)rfd(dot)leadboat(dot)com>
> > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > > Following this direction, the attached PoC works *at least for*
> > > the wal_optimization TAP tests, but doing pending flush not in
> > > smgr but in relcache.
> >
> > This task, syncing files created in the current transaction, is not the kind
> > of task normally assigned to a cache. We already have a module, storage.c,
> > that maintains state about files created in the current transaction. Why did
> > you use relcache instead of storage.c?
>
> The reason was at-commit sync needs buffer flush beforehand. But
> FlushRelationBufferWithoutRelCache() in v11 can do
> that. storage.c is reasonable as the place.

Okay. I do want this to work in 9.5 and later, but I'm not aware of a reason
relcache.c would be a better code location in older branches. Unless you
think of a reason to prefer relcache.c, please use storage.c.

> > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > This is a tidier version of the patch.
> >
> > > - Move the substantial work to table/index AMs.
> > >
> > > Each AM can decide whether to support WAL skip or not.
> > > Currently heap and nbtree support it.
> >
> > Why would an AM find it important to disable WAL skip?
>
> The reason is currently it's AM's responsibility to decide
> whether to skip WAL or not.

I see. Skipping the sync would be a mere optimization; no AM would require it
for correctness. An AM might want RelationNeedsWAL() to keep returning true
despite the sync happening, perhaps because it persists data somewhere other
than the forks of pg_class.relfilenode. Since the index and table APIs
already assume one relfilenode captures all persistent data, I'm not seeing a
use case for an AM overriding this behavior. Let's take away the AM's
responsibility for this decision, making the system simpler. A future patch
could let AM code decide, if someone find a real-world use case for
AM-specific logic around when to skip WAL.

From:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Noah Misch <noah(at)leadboat(dot)com>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-06-28 23:46:34
Message-ID:	CAA4eK1KD0F6gHwvFdOjU_1hjA=pBChUmuo=v+dxe7Q_HcAyXWQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, May 28, 2019 at 4:33 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> On Mon, May 27, 2019 at 02:08:26PM +0900, Kyotaro HORIGUCHI wrote:
> > At Fri, 24 May 2019 19:33:32 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190525023332(dot)GE1624191(at)rfd(dot)leadboat(dot)com>
> > > On Mon, May 20, 2019 at 03:54:30PM +0900, Kyotaro HORIGUCHI wrote:
> > > > Following this direction, the attached PoC works *at least for*
> > > > the wal_optimization TAP tests, but doing pending flush not in
> > > > smgr but in relcache.
> > >
> > > This task, syncing files created in the current transaction, is not the kind
> > > of task normally assigned to a cache. We already have a module, storage.c,
> > > that maintains state about files created in the current transaction. Why did
> > > you use relcache instead of storage.c?
> >
> > The reason was at-commit sync needs buffer flush beforehand. But
> > FlushRelationBufferWithoutRelCache() in v11 can do
> > that. storage.c is reasonable as the place.
>
> Okay. I do want this to work in 9.5 and later, but I'm not aware of a reason
> relcache.c would be a better code location in older branches. Unless you
> think of a reason to prefer relcache.c, please use storage.c.
>
> > > On Tue, May 21, 2019 at 09:29:48PM +0900, Kyotaro HORIGUCHI wrote:
> > > > This is a tidier version of the patch.
> > >
> > > > - Move the substantial work to table/index AMs.
> > > >
> > > > Each AM can decide whether to support WAL skip or not.
> > > > Currently heap and nbtree support it.
> > >
> > > Why would an AM find it important to disable WAL skip?
> >
> > The reason is currently it's AM's responsibility to decide
> > whether to skip WAL or not.
>
> I see. Skipping the sync would be a mere optimization; no AM would require it
> for correctness. An AM might want RelationNeedsWAL() to keep returning true
> despite the sync happening, perhaps because it persists data somewhere other
> than the forks of pg_class.relfilenode. Since the index and table APIs
> already assume one relfilenode captures all persistent data, I'm not seeing a
> use case for an AM overriding this behavior. Let's take away the AM's
> responsibility for this decision, making the system simpler. A future patch
> could let AM code decide, if someone find a real-world use case for
> AM-specific logic around when to skip WAL.
>

It seems there is some feedback for this patch and the CF is going to
start in 2 days. Are you planning to work on this patch for next CF,
if not then it is better to bump this? It is not a good idea to see
the patch in "waiting on author" in the beginning of CF unless the
author is actively working on the patch and is going to produce a
version in next few days.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-07-10 04:19:14
Message-ID:	20190710.131914.109303138.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. Rebased the patch to master(bd56cd75d2).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v16-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	10.7 KB
v16-0002-Fix-WAL-skipping-feature.patch	text/x-patch	29.8 KB
v16-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch	text/x-patch	7.4 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-07-12 01:03:35
Message-ID:	20190712010335.GB1610889@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> Hello. Rebased the patch to master(bd56cd75d2).

It looks like you did more than just a rebase, because this v16 no longer
modifies many files that v14 did modify. (That's probably good, since you had
pending review comments.) What other changes did you make?

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-07-12 08:30:41
Message-ID:	20190712.173041.236938840.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Many message seem lost during moving to new environmet..
I'm digging the archive but coudn't find the message for v15..

At Thu, 11 Jul 2019 18:03:35 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190712010335(dot)GB1610889(at)rfd(dot)leadboat(dot)com>
> On Wed, Jul 10, 2019 at 01:19:14PM +0900, Kyotaro Horiguchi wrote:
> > Hello. Rebased the patch to master(bd56cd75d2).
>
> It looks like you did more than just a rebase, because this v16 no longer
> modifies many files that v14 did modify. (That's probably good, since you had
> pending review comments.) What other changes did you make?

Yeah.. Maybe I forgot to send pre-v15 or v16 before rebasing.

v14: WAL-logging is controled by AMs and syncing at commit is
controled according to the behavior. At-commit sync is still
controlled per-relation basis, which means it must be
processed before transaction state becomes TRNAS_COMMIT. So
it needs to be separated into PreCommit_RelationSync() from
AtEOXact_RelationCache().

v15: The biggest change is that at-commit sync is changed to smgr
basis. At-commit sync is programmed at creation of a storage
file (RelationCreateStorage), and smgrDoPendingDelete(or
smgrDoPendingOperations after rename) runs syncs. AM are no
longer involved and all permanent relations are WAL-skipped at
all in the creation transaction while wal_level=minimal.

All storages created for a relation are once synced then
removed at commit.

v16: rebased.

The v16 seems no longer works so I'll send further rebased version.

Sorry for the late reply and confusion..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-07-12 08:37:25
Message-ID:	20190712.173725.146281273.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190712(dot)173041(dot)236938840(dot)horikyota(dot)ntt(at)gmail(dot)com>
> The v16 seems no longer works so I'll send further rebased version.

It's just by renaming of TestLib::real_dir to perl2host.
This is rebased version v17.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v17-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	10.7 KB
v17-0002-Fix-WAL-skipping-feature.patch	text/x-patch	30.0 KB
v17-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch	text/x-patch	7.4 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-07-25 01:39:36
Message-ID:	CAKPRHz+Yi-PXGdqcJ0gsEf9=Nx=p_-4MnAgXCVUh=STQZiH5+Q@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I found that CF-bot complaining on this.

Seems that some comment fixes by the recent 21039555cd are the
cause.

No substantial change have been made by this rebasing.

regards.

On Fri, Jul 12, 2019 at 5:37 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Fri, 12 Jul 2019 17:30:41 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190712(dot)173041(dot)236938840(dot)horikyota(dot)ntt(at)gmail(dot)com>
> > The v16 seems no longer works so I'll send further rebased version.
>
> It's just by renaming of TestLib::real_dir to perl2host.
> This is rebased version v17.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v18-0001-TAP-test-for-copy-truncation-optimization.patch	application/octet-stream	10.7 KB
v18-0002-Fix-WAL-skipping-feature.patch	application/octet-stream	29.9 KB
v18-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch	application/octet-stream	7.4 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-07-27 06:26:07
Message-ID:	20190727062607.GB2294302@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> No substantial change have been made by this rebasing.

Thanks. I'll likely review this on 2019-08-20. If someone opts to review it
earlier, I welcome that.

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Robert Haas <robertmhaas(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-01 23:35:06
Message-ID:	CA+hUKGJKcMFocY71nV3XM-8U=+0T278h0DQ8CPOcO_uzERZ8Og@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jul 27, 2019 at 6:26 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > No substantial change have been made by this rebasing.
>
> Thanks. I'll likely review this on 2019-08-20. If someone opts to review it
> earlier, I welcome that.

Cool. That'll be in time to be marked committed in the September CF,
this patch's 16th.

--
Thomas Munro
https://enterprisedb.com

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	thomas(dot)munro(at)gmail(dot)com
Cc:	noah(at)leadboat(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-02 06:32:19
Message-ID:	20190802.153219.207687418.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Fri, 2 Aug 2019 11:35:06 +1200, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> wrote in <CA+hUKGJKcMFocY71nV3XM-8U=+0T278h0DQ8CPOcO_uzERZ8Og(at)mail(dot)gmail(dot)com>
> On Sat, Jul 27, 2019 at 6:26 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > No substantial change have been made by this rebasing.
> >
> > Thanks. I'll likely review this on 2019-08-20. If someone opts to review it
> > earlier, I welcome that.
>
> Cool. That'll be in time to be marked committed in the September CF,
> this patch's 16th.

Yeah, this patch has been reborn far simpler and generic (or
robust) thanks to Noah.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-18 03:52:30
Message-ID:	20190818035230.GB3021338@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

For two-phase commit, PrepareTransaction() needs to execute pending syncs.

On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
> /* Remember if it's a system catalog */
> is_system_catalog = IsSystemRelation(OldHeap);
>
> - /*
> - * We need to log the copied data in WAL iff WAL archiving/streaming is
> - * enabled AND it's a WAL-logged rel.
> - */
> - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> -
> /* use_wal off requires smgr_targblock be initially invalid */
> Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);

Since you're deleting the use_wal variable, update that last comment.

> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> {
> SMgrRelation srel;
>
> - srel = smgropen(pending->relnode, pending->backend);
> -
> - /* allocate the initial array, or extend it, if needed */
> - if (maxrels == 0)
> + if (pending->dosync)
> {
> - maxrels = 8;
> - srels = palloc(sizeof(SMgrRelation) * maxrels);
> + /* Perform pending sync of WAL-skipped relation */
> + FlushRelationBuffersWithoutRelcache(pending->relnode,
> + false);
> + srel = smgropen(pending->relnode, pending->backend);
> + smgrimmedsync(srel, MAIN_FORKNUM);

This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
false due to this code, use RelationNeedsWAL() for multiple forks, and then
not actually sync all forks.

The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
or if it's smaller than some threshold, WAL-log the contents of the whole file
at that point." Please write the part to WAL-log the contents of small files
instead of syncing them.

> --- a/src/backend/commands/copy.c
> +++ b/src/backend/commands/copy.c
> @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> * If it does commit, we'll have done the table_finish_bulk_insert() at
> * the bottom of this routine first.
> *
> - * As mentioned in comments in utils/rel.h, the in-same-transaction test
> - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> - * can be cleared before the end of the transaction. The exact case is
> - * when a relation sets a new relfilenode twice in same transaction, yet
> - * the second one fails in an aborted subtransaction, e.g.
> - *
> - * BEGIN;
> - * TRUNCATE t;
> - * SAVEPOINT save;
> - * TRUNCATE t;
> - * ROLLBACK TO save;
> - * COPY ...

The comment material being deleted is still correct, so don't delete it.
Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
attached patch adds an assertion that RelationNeedsWAL() and the
pendingDeletes array have the same opinion about the relfilenode, and it
expands a test case to fail that assertion.

> --- a/src/include/utils/rel.h
> +++ b/src/include/utils/rel.h
> @@ -74,11 +74,13 @@ typedef struct RelationData
> SubTransactionId rd_createSubid; /* rel was created in current xact */
> SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
> * current xact */
> + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
> + * first in current xact */

In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
all the lines printed. Many bits of code need to look at all three,
e.g. RelationClose(). This field needs to be 100% reliable. In other words,
it must equal InvalidSubTransactionId if and only if the relfilenode matches
the relfilenode that would be in place if the top transaction rolled back.

Attachment	Content-Type	Size
wal-optimize-noah-tests-v2.patch	text/plain	2.6 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-19 09:59:59
Message-ID:	20190819.185959.118543656.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for taking time.

At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190818035230(dot)GB3021338(at)rfd(dot)leadboat(dot)com>
> For two-phase commit, PrepareTransaction() needs to execute pending syncs.

Now TwoPhaseFileHeader has two new members for (commit-time)
pending syncs. Pending-syncs are useless on wal-replay, but that
is needed for commit-prepared.

> On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/access/heap/heapam_handler.c
> > +++ b/src/backend/access/heap/heapam_handler.c
> > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
...
> > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > -
> > /* use_wal off requires smgr_targblock be initially invalid */
> > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
>
> Since you're deleting the use_wal variable, update that last comment.

Oops. Rewrote it.

> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
...
> > + smgrimmedsync(srel, MAIN_FORKNUM);
>
> This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
> FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
> no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> false due to this code, use RelationNeedsWAL() for multiple forks, and then
> not actually sync all forks.

I agree that all forks needs syncing, but FSM and VM are checking
RelationNeedsWAL(modified). To make sure, are you suggesting to
sync all forks instead of emitting WAL for them, or suggesting
that VM and FSM to emit WALs even when the modified
RelationNeedsWAL returns false (+ sync all forks)?

> The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
> or if it's smaller than some threshold, WAL-log the contents of the whole file
> at that point." Please write the part to WAL-log the contents of small files
> instead of syncing them.

I'm not sure the point of the behavior. I suppose that the "log"
is a sequence of new_page records. It also needs to be synced and
it is always larger than the file to be synced. I can't think of
an appropriate threshold without the point.

> > --- a/src/backend/commands/copy.c
> > +++ b/src/backend/commands/copy.c
> > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > * If it does commit, we'll have done the table_finish_bulk_insert() at
> > * the bottom of this routine first.
> > *
> > - * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> > - * can be cleared before the end of the transaction. The exact case is
> > - * when a relation sets a new relfilenode twice in same transaction, yet
> > - * the second one fails in an aborted subtransaction, e.g.
> > - *
> > - * BEGIN;
> > - * TRUNCATE t;
> > - * SAVEPOINT save;
> > - * TRUNCATE t;
> > - * ROLLBACK TO save;
> > - * COPY ...
>
> The comment material being deleted is still correct, so don't delete it.
> Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
> attached patch adds an assertion that RelationNeedsWAL() and the
> pendingDeletes array have the same opinion about the relfilenode, and it
> expands a test case to fail that assertion.

(Un?)Fortunately, that doesn't fail.. (with rebased version on
the recent master) I'll recheck that tomorrow.

> > --- a/src/include/utils/rel.h
> > +++ b/src/include/utils/rel.h
> > @@ -74,11 +74,13 @@ typedef struct RelationData
> > SubTransactionId rd_createSubid; /* rel was created in current xact */
> > SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
> > * current xact */
> > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
> > + * first in current xact */
>
> In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> all the lines printed. Many bits of code need to look at all three,
> e.g. RelationClose().

Agreed. I'll recheck that.

> This field needs to be 100% reliable. In other words,
> it must equal InvalidSubTransactionId if and only if the relfilenode matches
> the relfilenode that would be in place if the top transaction rolled back.

I don't get this. I think the variable moves as you suggested. It
is handled same way with fd_new* in AtEOSubXact_cleanup but the
difference is in assignment but rollback. rd_fist* won't change
after the first assignment so rollback of the subid means
relfilenode is also rolled back to the initial value at the
beginning of the top transaction.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-20 06:03:14
Message-ID:	20190820060314.GA3086296@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190818035230(dot)GB3021338(at)rfd(dot)leadboat(dot)com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
>
> Now TwoPhaseFileHeader has two new members for (commit-time)
> pending syncs. Pending-syncs are useless on wal-replay, but that
> is needed for commit-prepared.

There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
command, which is far too late to be syncing new relation files. (A crash may
have already destroyed their data.) PrepareTransaction(), which implements
the PREPARE TRANSACTION command, is the right place for these syncs.

A failure in these new syncs needs to prevent the transaction from being
marked committed. Hence, in CommitTransaction(), these new syncs need to
happen after the last step that could create assign a new relfilenode and
before RecordTransactionCommit(). I suspect it's best to do it after
PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > + smgrimmedsync(srel, MAIN_FORKNUM);
> >
> > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
>
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

I hadn't thought that far. What do you think is best?

> > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
> > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > at that point." Please write the part to WAL-log the contents of small files
> > instead of syncing them.
>
> I'm not sure the point of the behavior. I suppose that the "log"
> is a sequence of new_page records. It also needs to be synced and
> it is always larger than the file to be synced. I can't think of
> an appropriate threshold without the point.

Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
every buffer header containing a buffer of the current database. The belief
has been that writing one page to xlog is cheaper than FlushRelationBuffers()
in a busy system with large shared_buffers.

> > > --- a/src/backend/commands/copy.c
> > > +++ b/src/backend/commands/copy.c
> > > @@ -2725,28 +2722,9 @@ CopyFrom(CopyState cstate)
> > > * If it does commit, we'll have done the table_finish_bulk_insert() at
> > > * the bottom of this routine first.
> > > *
> > > - * As mentioned in comments in utils/rel.h, the in-same-transaction test
> > > - * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
> > > - * can be cleared before the end of the transaction. The exact case is
> > > - * when a relation sets a new relfilenode twice in same transaction, yet
> > > - * the second one fails in an aborted subtransaction, e.g.
> > > - *
> > > - * BEGIN;
> > > - * TRUNCATE t;
> > > - * SAVEPOINT save;
> > > - * TRUNCATE t;
> > > - * ROLLBACK TO save;
> > > - * COPY ...
> >
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
>
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

Did you build with --enable-cassert?

> > This field needs to be 100% reliable. In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.
>
> I don't get this. I think the variable moves as you suggested. It
> is handled same way with fd_new* in AtEOSubXact_cleanup but the
> difference is in assignment but rollback. rd_fist* won't change
> after the first assignment so rollback of the subid means
> relfilenode is also rolled back to the initial value at the
> beginning of the top transaction.

$ git grep -n 'rd_firstRelfilenodeSubid = '
src/backend/commands/cluster.c:1061: rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
src/backend/utils/cache/relcache.c:3067: relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
src/backend/utils/cache/relcache.c:3173: relation->rd_firstRelfilenodeSubid = parentSubid;
src/backend/utils/cache/relcache.c:3175: relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;

swap_relation_files() is the only place initializing this field. Many paths
that assign a new relfilenode will never call swap_relation_files().

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-20 08:17:57
Message-ID:	20190820.171757.41796743.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190819(dot)185959(dot)118543656(dot)horikyota(dot)ntt(at)gmail(dot)com>
> > The comment material being deleted is still correct, so don't delete it.
> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
>
> (Un?)Fortunately, that doesn't fail.. (with rebased version on
> the recent master) I'll recheck that tomorrow.

I saw the assertion failure. It's a part of intended
behavior. In this patch, relcache doesn't hold the whole history
of relfilenodes so we cannot remove useless pending syncs
perfectly. On the other hand they are harmless except that they
cause extra sync of files that are removed immediately. So I
choosed that once registered pending syncs are not removed.

If we want consistency here, we need to record creator subxid in
PendingRelOps (PendingRelDelete) struct and rather large work at
subtransaction end.

> > > --- a/src/include/utils/rel.h
> > > +++ b/src/include/utils/rel.h
> > > @@ -74,11 +74,13 @@ typedef struct RelationData
> > > SubTransactionId rd_createSubid; /* rel was created in current xact */
> > > SubTransactionId rd_newRelfilenodeSubid; /* new relfilenode assigned in
> > > * current xact */
> > > + SubTransactionId rd_firstRelfilenodeSubid; /* new relfilenode assigned
> > > + * first in current xact */
> >
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> > all the lines printed. Many bits of code need to look at all three,
> > e.g. RelationClose().
>
> Agreed. I'll recheck that.
>
> > This field needs to be 100% reliable. In other words,
> > it must equal InvalidSubTransactionId if and only if the relfilenode matches
> > the relfilenode that would be in place if the top transaction rolled back.
>
> I don't get this. I think the variable moves as you suggested. It
> is handled same way with fd_new* in AtEOSubXact_cleanup but the
> difference is in assignment but rollback. rd_fist* won't change
> after the first assignment so rollback of the subid means
> relfilenode is also rolled back to the initial value at the
> beginning of the top transaction.

So I'll add this in the next version to see how it looks.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-21 07:32:38
Message-ID:	20190821.163238.176512239.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. New version is attached.

At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190819(dot)185959(dot)118543656(dot)horikyota(dot)ntt(at)gmail(dot)com>
> Thank you for taking time.
>
> At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190818035230(dot)GB3021338(at)rfd(dot)leadboat(dot)com>
> > For two-phase commit, PrepareTransaction() needs to execute pending syncs.

Now TwoPhaseFileHeader has two new members for pending syncs. It
is useless on wal-replay, but that is needed for commit-prepared.

> > On Thu, Jul 25, 2019 at 10:39:36AM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/access/heap/heapam_handler.c
> > > +++ b/src/backend/access/heap/heapam_handler.c
> > > @@ -715,12 +702,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
> ...
> > > - use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
> > > -
> > > /* use_wal off requires smgr_targblock be initially invalid */
> > > Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
> >
> > Since you're deleting the use_wal variable, update that last comment.

Oops! Rewrote it.

> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -428,21 +450,34 @@ smgrDoPendingDeletes(bool isCommit)
> ...
> > > + smgrimmedsync(srel, MAIN_FORKNUM);
> >
> > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
> > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
> > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > not actually sync all forks.
>
> I agree that all forks needs syncing, but FSM and VM are checking
> RelationNeedsWAL(modified). To make sure, are you suggesting to
> sync all forks instead of emitting WAL for them, or suggesting
> that VM and FSM to emit WALs even when the modified
> RelationNeedsWAL returns false (+ sync all forks)?

All forks are synced and have no WALs emitted (as before) in the
attached version 19. FSM and VM are not changed.

This is not included in this version. I'll continue to consider
this.

The code is changed to use rd_firstRelfilenodeSubid instead of
rd_firstRelfilenodeSubid which has the issue mentioned in the
deleted section. So this is right but irrelevant to the code
here. The same thing is written in the comment in RelationData.

(In short, not reverted)

> > Moreover, the code managing rd_firstRelfilenodeSubid has a similar bug. The
> > attached patch adds an assertion that RelationNeedsWAL() and the
> > pendingDeletes array have the same opinion about the relfilenode, and it
> > expands a test case to fail that assertion.
..
> > In general, to add a field like this, run "git grep -n 'rd_.*Subid'" and audit
> > all the lines printed. Many bits of code need to look at all three,
> > e.g. RelationClose().

I forgot to maintain rd_firstRelfilenode in many places and the
assertion failure no longer happens after I fixed it. Opposite to
my previous mail, of course useless pending entries are removed
at subtransction abort and no needless syncs happen in that
meaning. But another type of useless sync was seen with the
previous version 18.

(In short fixed.)

Sorry, I confused this with another similar behavior of the
previous version 18, where files are synced even if it is to be
removed immediately at commit. In this version
smgrDoPendingOperations doesn't sync to-be-deleted files.

While checking this, I found that smgrDoPendingDeletes is making
unnecessary call to smgrclose() which lead server to crash while
deleting files. I removed it.

Please find the new version attached.

Changes:

- Rebased to f8cf524da1.

- Fixed prepare transaction. test2a catches this.
(twophase.c)

- Fixed a comment in heapam_relation_copy_for_cluster.

- All forks are synced. (smgrDoPendingDeletes/Operations, SyncRelationFiles)

- Fixed handling of rd_firstRelfilenodeSubid.
(RelationBuildLocalRelation, RelationSetNewRelfilenode,
load_relcache_init_file)

- Prevent to-be-deleted files from syncing. (smgrDoPendingDeletes/Operations)

- Fixed a crash bug caused by smgrclose() in smgrDoPendingOperations.

Minor changes:

- Renamed: PendingRelOps => PendingRelOp
- Type changed: bool PendingRelOp.dosync => PendingOpType PendingRelOp.op

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v19-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	11.3 KB
v19-0002-Fix-WAL-skipping-feature.patch	text/x-patch	43.7 KB
v19-0003-Rename-smgrDoPendingDeletes-to-smgrDoPendingOperatio.patch	text/x-patch	10.5 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-22 04:08:09
Message-ID:	20190822040809.GA3117472@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Aug 21, 2019 at 04:32:38PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Aug 2019 18:59:59 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190819(dot)185959(dot)118543656(dot)horikyota(dot)ntt(at)gmail(dot)com>
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190818035230(dot)GB3021338(at)rfd(dot)leadboat(dot)com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
>
> Now TwoPhaseFileHeader has two new members for pending syncs. It
> is useless on wal-replay, but that is needed for commit-prepared.

Syncs need to happen in PrepareTransaction(), not in commit-prepared. I wrote
about that in https://postgr.es/m/20190820060314.GA3086296@rfd.leadboat.com

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-22 12:06:06
Message-ID:	20190822.210606.07927021.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190820060314(dot)GA3086296(at)rfd(dot)leadboat(dot)com>
> On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190818035230(dot)GB3021338(at)rfd(dot)leadboat(dot)com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> >
> > Now TwoPhaseFileHeader has two new members for (commit-time)
> > pending syncs. Pending-syncs are useless on wal-replay, but that
> > is needed for commit-prepared.
>
> There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
> command, which is far too late to be syncing new relation files. (A crash may
> have already destroyed their data.) PrepareTransaction(), which implements
> the PREPARE TRANSACTION command, is the right place for these syncs.
>
> A failure in these new syncs needs to prevent the transaction from being
> marked committed. Hence, in CommitTransaction(), these new syncs need to

Agreed.

> happen after the last step that could create assign a new relfilenode and
> before RecordTransactionCommit(). I suspect it's best to do it after
> PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.

> > > This should sync all forks, not just MAIN_FORKNUM. Code that writes WAL for
> > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL(). There may be
> > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > > not actually sync all forks.
> >
> > I agree that all forks needs syncing, but FSM and VM are checking
> > RelationNeedsWAL(modified). To make sure, are you suggesting to
> > sync all forks instead of emitting WAL for them, or suggesting
> > that VM and FSM to emit WALs even when the modified
> > RelationNeedsWAL returns false (+ sync all forks)?
>
> I hadn't thought that far. What do you think is best?

As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.

> > > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
> > > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > > at that point." Please write the part to WAL-log the contents of small files
> > > instead of syncing them.
> >
> > I'm not sure the point of the behavior. I suppose that the "log"
> > is a sequence of new_page records. It also needs to be synced and
> > it is always larger than the file to be synced. I can't think of
> > an appropriate threshold without the point.
>
> Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
> every buffer header containing a buffer of the current database. The belief
> has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> in a busy system with large shared_buffers.

I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers. If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..

If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).

If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.

Any thoughts?

# Sorry time's up today.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-26 05:08:43
Message-ID:	20190826050843.GB3153606@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
> At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190820060314(dot)GA3086296(at)rfd(dot)leadboat(dot)com>
> > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190818035230(dot)GB3021338(at)rfd(dot)leadboat(dot)com>
> > > > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > > > not appearing here. It said, "Instead, at COMMIT, we'd fsync() the relation,
> > > > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > > > at that point." Please write the part to WAL-log the contents of small files
> > > > instead of syncing them.
> > >
> > > I'm not sure the point of the behavior. I suppose that the "log"
> > > is a sequence of new_page records. It also needs to be synced and
> > > it is always larger than the file to be synced. I can't think of
> > > an appropriate threshold without the point.
> >
> > Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
> > every buffer header containing a buffer of the current database. The belief
> > has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> > in a busy system with large shared_buffers.
>
> I'm at a loss.. The decision between WAL and sync is made at
> commit time, when we no longer have a pin on a buffer. When
> emitting WAL, opposite to the assumption, lock needs to be
> re-acquired for every page to emit log_new_page. What is worse,
> we may need to reload evicted buffers. If the file has been
> CopyFrom'ed, ring buffer strategy makes the situnation farther
> worse. That doesn't seem cheap at all..

Consider a one-page relfilenode. Doing all the things you list for a single
page may be cheaper than locking millions of buffer headers.

> If there were any chance on WAL for smaller files here, it would
> be on the files smaller than the ring size of bulk-write
> strategy(16MB).

Like you, I expect the optimal threshold is less than 16MB, though you should
benchmark to see. Under the ideal threshold, when a transaction creates a new
relfilenode just smaller than the threshold, that transaction will be somewhat
slower than it would be if the threshold were zero. Locking every buffer
header causes a distributed slow-down for other queries, and protecting the
latency of non-DDL queries is typically more useful than accelerating
TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries;
beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
than the buffer scan harms them. That's about where the threshold should be.

This should be GUC-controlled, especially since this is back-patch material.
We won't necessarily pick the best value on the first attempt, and the best
value could depend on factors like the filesystem, the storage hardware, and
the database's latency goals. One could define the GUC as an absolute size
(e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
threshold is 1MB when shared_buffers is 1GB). I'm not sure which is better.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-27 06:49:32
Message-ID:	20190827.154932.250364935.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190826050843(dot)GB3153606(at)rfd(dot)leadboat(dot)com>
noah> On Thu, Aug 22, 2019 at 09:06:06PM +0900, Kyotaro Horiguchi wrote:
noah> > At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190820060314(dot)GA3086296(at)rfd(dot)leadboat(dot)com>
> > > On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > > > I'm not sure the point of the behavior. I suppose that the "log"
> > > > is a sequence of new_page records. It also needs to be synced and
> > > > it is always larger than the file to be synced. I can't think of
> > > > an appropriate threshold without the point.
> > >
> > > Yes, it would be a sequence of new-page records. FlushRelationBuffers() locks
> > > every buffer header containing a buffer of the current database. The belief
> > > has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> > > in a busy system with large shared_buffers.
> >
> > I'm at a loss.. The decision between WAL and sync is made at
> > commit time, when we no longer have a pin on a buffer. When
> > emitting WAL, opposite to the assumption, lock needs to be
> > re-acquired for every page to emit log_new_page. What is worse,
> > we may need to reload evicted buffers. If the file has been
> > CopyFrom'ed, ring buffer strategy makes the situnation farther
> > worse. That doesn't seem cheap at all..
>
> Consider a one-page relfilenode. Doing all the things you list for a single
> page may be cheaper than locking millions of buffer headers.

If I understand you correctly, I would say that *all* buffers
that don't belong to in-transaction-created files are skipped
before taking locks. No lock conflict happens with other
backends.

FlushRelationBuffers uses double-checked-locking as follows:

FlushRelationBuffers_common():
..
if(!islocal) {
for (i for all buffers) {
if (RelFileNodeEquals(bufHder->tag.rnode, rnode)) {
LockBufHdr(bufHdr);
if (RelFileNodeEquals(bufHder->tag.rnode, rnode) && valid & dirty) {
PinBuffer_Locked(bubHder);
LWLockAcquire();
FlushBuffer();

128GB shared buffers contain 16M buffers. On my
perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
has only 6GB, the test is ignoring the effect of cache that comes
from the difference of the buffer size). (attached 1)

With WAL-emitting we find every buffers of the file using buffer
hash, we suffer partition locks instead of the 15ms of local
latency. That seems worse.

> > If there were any chance on WAL for smaller files here, it would
> > be on the files smaller than the ring size of bulk-write
> > strategy(16MB).
>
> Like you, I expect the optimal threshold is less than 16MB, though you should
> benchmark to see. Under the ideal threshold, when a transaction creates a new
> relfilenode just smaller than the threshold, that transaction will be somewhat
> slower than it would be if the threshold were zero. Locking every buffer

I looked closer on this.

For a 16MB file, the cost of write-fsyncing cost is almost the
same to that of WAL-emitting cost. It was about 200 ms on the
Vista-era machine with non-performant rotating magnetic disks
with xfs. (attached 2, 3) Although write-fsyncing of relation
file makes no lock conflict with other backends, WAL-emitting
delays other backends' commits at most by that milliseconds.

In summary, the characteristics of the two methods on a 16MB file
are as the follows.

File write:
- 15ms of buffer scan without locks (@128GB shared buffer)

+ no hash search for a buffer

= take locks on all buffers only of the file one by one (to write)

+ plus 200ms of write-fdatasync (of whole the relation file),
which doesn't conflict with other backends. (except via CPU
time slots and IO bandwidth.)

WAL write :
+ no buffer scan

- 2048 times (16M/8k) of partition lock on finding every buffer
for the target file, which can conflict with other backends.

= take locks on all buffers only of the file one by one (to take FPW)

- plus 200ms of open(create)-write-fdatasync (of a WAL file (of
default size)), which can delay commits on other backends at
most by that duration.

> header causes a distributed slow-down for other queries, and protecting the
> latency of non-DDL queries is typically more useful than accelerating
> TRUNCATE, CREATE TABLE, etc. Writing more WAL also slows down other queries;
> beyond a certain relfilenode size, the extra WAL harms non-DDL queries more
> than the buffer scan harms them. That's about where the threshold should be.

If the discussion above is correct, we shouldn't use WAL-write
even for files around 16MB. For smaller shared_buffers and file
size, the delays are:

Scan all buffers takes:
15 ms for 128GB shared_buffers
4.5ms for 32GB shared_buffers

fdatasync takes:
200 ms for 16MB/sync
51 ms for 1MB/sync
46 ms for 512kB/sync
40 ms for 256kB/sync
37 ms for 128kB/sync
35 ms for <64kB/sync

It seems reasonable for 5400rpm disks. The threashold seems 64kB
on my configuration. It can differ by configuration but I think
not so largely. (I'm not sure about SSD or in-memory
filesystems.)

So for smaller than 64kB files:

File write:
-- 15ms of buffer scan without locks
+ no hash search for a buffer
= plus 35 ms of write-fdatasync

WAL write :
++ no buffer scan
- one partition lock on finding every buffer for the target
file, which can conflict with other backends. (but ignorable.)
= plus 35 ms of (open(create)-)write-fdatasync

It's possible that WAL records with smaller size is needless of
time for its own sync. This is the most obvious gain by WAL
emitting. considring 5-15ms of buffer scanning time, 256 or 512
kilobytes are the candidate default threshold but it would be
safe to use 64kB.

> This should be GUC-controlled, especially since this is back-patch material.

Is this size of patch back-patchable?

> We won't necessarily pick the best value on the first attempt, and the best
> value could depend on factors like the filesystem, the storage hardware, and
> the database's latency goals. One could define the GUC as an absolute size
> (e.g. 1MB) or as a ratio of shared_buffers (e.g. GUC value of 0.001 means the
> threshold is 1MB when shared_buffers is 1GB). I'm not sure which is better.

I'm not sure whether the knob shows apparent performance gain and
whether we can offer the criteria to identify the proper
value. But I'll add this feature with a GUC
effective_io_block_size defaults to 64kB as the threshold in the
next version. (The name and default value are arguable, of course.)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-27 06:59:11
Message-ID:	20190827.155911.17794108.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190827(dot)154932(dot)250364935(dot)horikyota(dot)ntt(at)gmail(dot)com>
> 128GB shared buffers contain 16M buffers. On my
> perhaps-Windows-Vista-era box, such loop takes 15ms. (Since it
> has only 6GB, the test is ignoring the effect of cache that comes
> from the difference of the buffer size). (attached 1)
...
> For a 16MB file, the cost of write-fsyncing cost is almost the
> same to that of WAL-emitting cost. It was about 200 ms on the
> Vista-era machine with non-performant rotating magnetic disks
> with xfs. (attached 2, 3) Although write-fsyncing of relation
> file makes no lock conflict with other backends, WAL-emitting
> delays other backends' commits at most by that milliseconds.

FWIW, the attached are the programs I used to take the numbers.

testloop.c: to take time to loop over buffers in FlushRelationBuffers

testfile.c: to take time to sync a heap file. (means one file for the size)

testfile2.c: to take time to emit a wal record. (means 16MB per file)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-08-28 06:42:10
Message-ID:	20190828.154210.204505676.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, Noah.

At Tue, 27 Aug 2019 15:49:32 +0900 (Tokyo Standard Time), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in <20190827(dot)154932(dot)250364935(dot)horikyota(dot)ntt(at)gmail(dot)com>
> I'm not sure whether the knob shows apparent performance gain and
> whether we can offer the criteria to identify the proper
> value. But I'll add this feature with a GUC
> effective_io_block_size defaults to 64kB as the threshold in the
> next version. (The name and default value are arguable, of course.)

This is a new version of the patch based on the discussion.

The differences from v19 are the follows.

- Removed the new stuff in two-phase.c.

The action on PREPARE TRANSACTION is now taken in
PrepareTransaction(). Instead of storing pending syncs in
two-phase files, the function immediately syncs all files that
can survive the transaction end. (twophase.c, xact.c)

- Separate pendingSyncs from pendingDeletes.

pendingSyncs gets handled differently from pendingDeletes so it
is separated.

- Let smgrDoPendingSyncs() to avoid performing fsync on
to-be-deleted files.

In previous versions the function syncs all recorded files even
if it is being deleted. Since we use WAL-logging as the
alternative of fsync now, performance gets more significance
g than before. Thus this version avoids uesless fsyncs.

- Use log_newpage instead of fsync for small tables.

As in the discussion up-thread, I think I understand how
WAL-logging works better than fsync. smgrDoPendingSync issues
log_newpage for all blocks in the table smaller than the GUC
variable "effective_io_block_size". I found
log_newpage_range() that does exact what is needed here but it
requires Relation that is no available there. I removed an
assertion in CreateFakeRelcacheEntry so that it works while
non-recovery mode.

- Rebased and fixed some bugs.

I'm trying to measure performance difference on WAL/fsync.

By the way, smgrDoPendingDelete is called from CommitTransaction
and AbortTransaction directlry, and from AbortSubTransaction via
AtSubAbort_smgr(), which calls only smgrDoPendingDeletes() and is
called only from AbortSubTransaction. I think these should be
unified either way. Any opinions?

CommitTransaction()
+ msgrDoPendingDelete()

AbortTransaction()
+ msgrDoPendingDelete()

AbortSubTransactoin()
AtSubAbort_smgr()
+ msgrDoPendingDelete()

# Looking around, the prefixes AtEOact/PreCommit/AtAbort don't
# seem to be used keeping a principle.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v20-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	11.3 KB
v20-0002-Fix-WAL-skipping-feature.patch	text/x-patch	50.0 KB
v20-0003-Documentation-for-effective_io_block_size.patch	text/x-patch	1.6 KB
v20-0004-Additional-test-for-new-GUC-setting.patch	text/x-patch	1.8 KB

From:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	noah(at)leadboat(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-09-02 21:15:00
Message-ID:	20190902211500.GA32761@alvherre.pgsql
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I have updated this patch's status to "needs review", since v20 has not
received any comments yet.

Noah, you're listed as committer for this patch. Are you still on the
hook for getting it done during the v13 timeframe?

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-09-02 22:28:51
Message-ID:	20190902222851.GA3293865@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Sep 02, 2019 at 05:15:00PM -0400, Alvaro Herrera wrote:
> I have updated this patch's status to "needs review", since v20 has not
> received any comments yet.
>
> Noah, you're listed as committer for this patch. Are you still on the
> hook for getting it done during the v13 timeframe?

Yes, assuming "getting it done" = "getting the CF entry to state other than
Needs Review".

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-09-10 11:45:17
Message-ID:	20190910114517.GA29650@gust.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[Casual readers with opinions on GUC naming: consider skipping to the end.]

MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
rd_createSubid is set; see attached test case. It needs to skip WAL whenever
RelationNeedsWAL() returns false.

On Tue, Aug 27, 2019 at 03:49:32PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 25 Aug 2019 22:08:43 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in <20190826050843(dot)GB3153606(at)rfd(dot)leadboat(dot)com>
> > Consider a one-page relfilenode. Doing all the things you list for a single
> > page may be cheaper than locking millions of buffer headers.
>
> If I understand you correctly, I would say that *all* buffers
> that don't belong to in-transaction-created files are skipped
> before taking locks. No lock conflict happens with other
> backends.
>
> FlushRelationBuffers uses double-checked-locking as follows:

I had misread the code; you're right.

> > This should be GUC-controlled, especially since this is back-patch material.
>
> Is this size of patch back-patchable?

Its size is not an obstacle. It's not ideal to back-patch such a user-visible
performance change, but it would be worse to leave back branches able to
corrupt data during recovery.

On Wed, Aug 28, 2019 at 03:42:10PM +0900, Kyotaro Horiguchi wrote:
> - Use log_newpage instead of fsync for small tables.

> I'm trying to measure performance difference on WAL/fsync.

I would measure it with simultaneous pgbench instances:

1. DDL pgbench instance repeatedly creates and drops a table of X kilobytes,
using --rate to make this happen a fixed number of times per second.
2. Regular pgbench instance runs the built-in script at maximum qps.

For each X, try one test run with effective_io_block_size = X-1 and one with
effective_io_block_size = X. If the regular pgbench instance gets materially
higher qps with effective_io_block_size = X-1, the ideal default is <X.
Otherwise, the ideal default is >=X.

> + <varlistentry id="guc-effective-io-block-size" xreflabel="effective_io_block_size">
> + <term><varname>effective_io_block_size</varname> (<type>integer</type>)
> + <indexterm>
> + <primary><varname>effective_io_block_size</varname> configuration parameter</primary>
> + </indexterm>
> + </term>
> + <listitem>
> + <para>
> + Specifies the expected maximum size of a file for which <function>fsync</function> returns in the minimum required duration. It is approximately the size of a track or sylinder for magnetic disks.
> + The value is specified in kilobytes and the default is <literal>64</literal> kilobytes.
> + </para>
> + <para>
> + When <xref linkend="guc-wal-level"/> is <literal>minimal</literal>,
> + WAL-logging is skipped for tables created in-trasaction. If a table
> + is smaller than that size at commit, it is WAL-logged instead of
> + issueing <function>fsync</function> on it.
> +
> + </para>
> + </listitem>
> + </varlistentry>

Cylinder and track sizes are obsolete as user-visible concepts. (They're not
constant for a given drive, and I think modern disks provide no way to read
the relevant parameters.) I like the name "wal_skip_threshold", and my second
choice would be "wal_skip_min_size". Possibly documented as follows:

When wal_level is minimal and a transaction commits after creating or
rewriting a permanent table, materialized view, or index, this setting
determines how to persist the new data. If the data is smaller than this
setting, write it to the WAL log; otherwise, use an fsync of the data file.
Depending on the properties of your storage, raising or lowering this value
might help if such commits are slowing concurrent transactions. The default
is 64 kilobytes (64kB).

Any other opinions on the GUC name?

Attachment	Content-Type	Size
wal-optimize-noah-tests-v3.patch	text/x-diff	1.4 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-10-25 04:12:51
Message-ID:	20191025.131251.322449872063947371.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. Thanks for the comment.

# Sorry in advance for possilbe breaking the thread.

> MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
> rd_createSubid is set; see attached test case. It needs to skip WAL whenever
> RelationNeedsWAL() returns false.

Thanks for pointing out that. And the test patch helped me very much.

Most of callers can tell that to the function, but SetHintBits()
cannot easily. Rather I think we shouldn't even try to do
that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
for sync-pending state of the relfilenode for the buffer. In the
attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
but it is called only at the time FPW is added so I believe it doesn't
affect performance so much. However, we can use hash for pendingSyncs
instead of liked list. Anyway the change is in its own file
v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
0002.

AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
non-wal_minimal code paths.

> Cylinder and track sizes are obsolete as user-visible concepts. (They're not
> onstant for a given drive, and I think modern disks provide no way to read
> the relevant parameters.) I like the name "wal_skip_threshold", and my second

I strongly agree. Thanks for the draft. I used it as-is. I don't come
up with an appropriate second description of the GUC so I just removed
it.

# it was "For rotating magnetic disks, it is around the size of a
# track or sylinder."

> the relevant parameters.) I like the name "wal_skip_threshold", and
> my second choice would be "wal_skip_min_size". Possibly documented
> as follows:
..
> Any other opinions on the GUC name?

I prefer the first candidate. I already used the terminology in
storage.c and the name fits more to the context.

> * We emit newpage WAL records for smaller size of relations.
> *
> * Small WAL records have a chance to be emitted at once along with
> * other backends' WAL records. We emit WAL records instead of syncing
> * for files that are smaller than a certain threshold expecting faster
- * commit. The threshold is defined by the GUC effective_io_block_size.
+ * commit. The threshold is defined by the GUC wal_skip_threshold.

The attached are:

- v21-0001-TAP-test-for-copy-truncation-optimization.patch
same as v20

- v21-0002-Fix-WAL-skipping-feature.patch
GUC name changed.

- v21-0003-Fix-MarkBufferDirtyHint.patch
PoC of fixing the function. will be merged into 0002. (New)

- v21-0004-Documentation-for-wal_skip_threshold.patch
GUC name and description changed. (Previous 0003)

- v21-0005-Additional-test-for-new-GUC-setting.patch
including adjusted version of wal-optimize-noah-tests-v3.patch
Maybe test names need further adjustment. (Previous 0004)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v21-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	11.3 KB
v21-0002-Fix-WAL-skipping-feature.patch	text/x-patch	49.9 KB
v21-0003-Fix-MarkBufferDirtyHint.patch	text/x-patch	2.3 KB
v21-0004-Documentation-for-wal_skip_threshold.patch	text/x-patch	1.8 KB
v21-0005-Additional-test-for-new-GUC-setting.patch	text/x-patch	2.8 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-10-25 04:46:14
Message-ID:	CAKPRHzJopGtEdbDFuJ_zfhZ0QVBKGGMdQ4cKXC5k5+yqFbGJoQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ugh!

2019年10月25日(金) 13:13 Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>:

> that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
> for sync-pending state of the relfilenode for the buffer. In the
> attached patch (0003)
> regards.
>

It's wrong that it also skips chnging flags.
I"ll fix it soon

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, robertmhaas(at)gmail(dot)com, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-10-25 13:20:32
Message-ID:	CAKPRHzJSgMuNeCtKgSxBJd2zOgK3BKL13Pkn_6_Sr9qXCRU=fQ@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 25, 2019 at 1:13 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> Hello. Thanks for the comment.
>
> # Sorry in advance for possilbe breaking the thread.
>
> > MarkBufferDirtyHint() writes WAL even when rd_firstRelfilenodeSubid or
> > rd_createSubid is set; see attached test case. It needs to skip WAL whenever
> > RelationNeedsWAL() returns false.
>
> Thanks for pointing out that. And the test patch helped me very much.
>
> Most of callers can tell that to the function, but SetHintBits()
> cannot easily. Rather I think we shouldn't even try to do
> that. Instead, In the attached, MarkBufferDirtyHint() asks storage.c
> for sync-pending state of the relfilenode for the buffer. In the
> attached patch (0003) RelFileNodeSkippingWAL loops over pendingSyncs
> but it is called only at the time FPW is added so I believe it doesn't
> affect performance so much. However, we can use hash for pendingSyncs
> instead of liked list. Anyway the change is in its own file
> v21-0003-Fix-MarkBufferDirtyHint.patch, which will be merged into
> 0002.
>
> AFAICS all XLogInsert is guarded by RelationNeedsWAL() or in the
> non-wal_minimal code paths.
>
> > Cylinder and track sizes are obsolete as user-visible concepts. (They're not
> > onstant for a given drive, and I think modern disks provide no way to read
> > the relevant parameters.) I like the name "wal_skip_threshold", and my second
>
> I strongly agree. Thanks for the draft. I used it as-is. I don't come
> up with an appropriate second description of the GUC so I just removed
> it.
>
> # it was "For rotating magnetic disks, it is around the size of a
> # track or sylinder."
>
> > the relevant parameters.) I like the name "wal_skip_threshold", and
> > my second choice would be "wal_skip_min_size". Possibly documented
> > as follows:
> ..
> > Any other opinions on the GUC name?
>
> I prefer the first candidate. I already used the terminology in
> storage.c and the name fits more to the context.
>
> > * We emit newpage WAL records for smaller size of relations.
> > *
> > * Small WAL records have a chance to be emitted at once along with
> > * other backends' WAL records. We emit WAL records instead of syncing
> > * for files that are smaller than a certain threshold expecting faster
> - * commit. The threshold is defined by the GUC effective_io_block_size.
> + * commit. The threshold is defined by the GUC wal_skip_threshold.

> It's wrong that it also skips changing flags.
> I"ll fix it soon

This is the fixed verison v22.

The attached are:

- v22-0001-TAP-test-for-copy-truncation-optimization.patch
Same as v20, 21

- v22-0002-Fix-WAL-skipping-feature.patch
GUC name changed. Same as v21.

- v22-0003-Fix-MarkBufferDirtyHint.patch
PoC of fixing the function. will be merged into 0002. (New in v21,
fixed in v22)

- v21-0004-Documentation-for-wal_skip_threshold.patch
GUC name and description changed. (Previous 0003, same as v21)

- v21-0005-Additional-test-for-new-GUC-setting.patch
including adjusted version of wal-optimize-noah-tests-v3.patch
Maybe test names need further adjustment. (Previous 0004, same as v21)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v22-0004-Documentation-for-wal_skip_threshold.patch	application/octet-stream	1.8 KB
v22-0001-TAP-test-for-copy-truncation-optimization.patch	application/octet-stream	11.3 KB
v22-0002-Fix-WAL-skipping-feature.patch	application/octet-stream	49.6 KB
v22-0005-Additional-test-for-new-GUC-setting.patch	application/octet-stream	2.8 KB
v22-0003-Fix-MarkBufferDirtyHint.patch	application/octet-stream	2.8 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-05 21:16:14
Message-ID:	CA+TgmoYzTPpxvPUUs_ALW1YYNxV6mCrj6zwqnaN6FqdNc1JUzw@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> This is the fixed verison v22.

I'd like to offer a few thoughts on this thread and on these patches,
which is now more than 4 years old and more than 150 messages in
length.

First, I'd like to restate my understanding of the problem just to see
whether I've got the right idea and whether we're all on the same
page. When wal_level=minimal, we sometimes try to skip WAL logging on
newly-created relations in favor of fsync-ing the relation at commit
time. The idea is that if the transaction aborts or is aborted by a
crash, the contents of the relation don't need to be reproduced
because they are irrelevant, so no WAL is needed, and if the
transaction commits we can't lose any data on a crash because we've
already fsync'd, and standbys don't matter because wal_level=minimal
precludes having any. However, we're not entirely consistent about
skipping WAL-logging: some operations do and others don't, and this
causes confusion if a crash occurs, because we might try to replay
some of the things that happened to that relation but not all of them.
For example, the original poster complained about a sequence of steps
where an index truncation was logged but subsequent index insertions
were not; a badly-timed crash will replay the truncation but can't
replay the index insertions because they weren't logged in the first
place; consequently, while the state was actually OK at the beginning
of replay, it's no longer OK by the end. Replaying nothing would've
been OK, but replaying some things and not others isn't.

Second, for anyone who is not following this thread closely but is
interested in a summary, I'd like to summarize how I believe that the
current patch proposes to solve the problem. As I understand it, the
approach taken by the patch is to try to change things so that we log
nothing at all for relations created or truncated in the current
top-level transaction, and everything for others. To achieve this, the
patch makes a number of changes, three of which seem to me to be
particularly key. One, the patch changes the relcache infrastructure
with the goal of making it possible to reliably identify whether a
relation has been created or truncated in the current toplevel
transaction; our current code does have tracking for this, but it's
not 100% accurate. Two, the patch changes the definition of
RelationNeedsWAL() so that it not only checks that the relation is a
permanent one, but also that either wal_level != minimal or the
relation was not created in the current transaction. It seems to me
that if RelationNeedsWAL() is used to gate every test for whether or
not to write WAL pertaining to a particular relation, this ought to
achieve the desired behavior of logging either everything or nothing.
It is not quite clear to me how we can be sure that we use that in
every relevant place. Three, the patch replaces the various ad-hoc
bits of code which fsync relations which perform unlogged operations
on permanent relations with a new tracking mechanism that arranges to
perform all of the relevant fsync() calls at commit time. This is
further augmented with a mechanism that instead logs all the relation
pages in lieu of fsync()ing if the relation is very small, on the
theory that logging a few FPIs will be cheaper than an fsync(). I view
this additional mechanism as perhaps a bit much for a bug fix patch,
but I understand that the goal is to prevent a performance regression,
and it's not really over the top, so I think it's probably OK.

Third, I'd like to offer a couple of general comments on the state of
these patches. Broadly, I think they look pretty good. They seem quite
well-engineered to me and as far as I can see the overall approach is
sound. I think there are a number of places where the comments could
be better; I'll include a few points on that further down. I also
think that the code in swap_relation_files() which takes ExclusiveLock
on the relations looks quite strange. It's hard to understand why it's
needed at all, or why that lock level is used. On the flip side, I
think that the test suite looks really impressive and should be of
considerable help not only in making sure that this is fixed but
detecting if it gets broken again in the future. Perhaps it doesn't
cover every scenario we care about, but if that turns out to be the
case, it seems like it would be easily to further generalize. I really
like the idea of this *kind* of test framework.

Comments on comments, and other nitpicking:

- in-trasaction is mis-spelled in the doc patch. accidentially is
mis-spelled in the 0002 patch.
- I think the header comment for the new TAP test could do a far
better job explaining the overall goal of this testing than it
actually does.
- I think somewhere in relcache.c or rel.h there ought to be comments
explaining the precise degree to which rd_createSubid,
rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable,
including problem scenarios. This patch removes some language of this
sort from CopyFrom(), which was a funny place to have that information
in the first place, but I don't see that it adds anything to replace
it. I also think that we ought to explain - for the fields that are
reliable - that they need to be reliable precisely for the purpose of
not breaking this stuff. There's a bit of this right now:

+ * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
+ * relfilenode change has took place in the current transaction. Unlike
+ * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
+ * means that the currently active relfilenode is transaction-local and we
+ * sync the relation at commit instead of WAL-logging.

...but I think that needs to be somewhat expanded and clarified.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-05 22:53:35
Message-ID:	20191105225335.GA395764@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 05, 2019 at 04:16:14PM -0500, Robert Haas wrote:
> On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > This is the fixed verison v22.
>
> I'd like to offer a few thoughts on this thread and on these patches,
> which is now more than 4 years old and more than 150 messages in
> length.
...

Your understanding matches mine. Thanks for studying this. I had been
feeling nervous about being the sole reviewer of the latest design.

> Comments on comments, and other nitpicking:

I started pre-commit editing on 2019-10-28, and comment+README updates have
been the largest part of that. I'll check my edits against the things you
list here, and I'll share on-list before committing. I've now marked the CF
entry Ready for Committer.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	robertmhaas(at)gmail(dot)com
Cc:	noah(at)leadboat(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-06 08:29:27
Message-ID:	20191106.172927.978481031093477019.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for looking this.

At Tue, 5 Nov 2019 16:16:14 -0500, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> On Fri, Oct 25, 2019 at 9:21 AM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > This is the fixed verison v22.
> First, I'd like to restate my understanding of the problem just to see
..
> Second, for anyone who is not following this thread closely but is

Thanks for restating the issue and summarizing this patch. All of the
description match my understanding.

> perform all of the relevant fsync() calls at commit time. This is
> further augmented with a mechanism that instead logs all the relation
> pages in lieu of fsync()ing if the relation is very small, on the
> theory that logging a few FPIs will be cheaper than an fsync(). I view
> this additional mechanism as perhaps a bit much for a bug fix patch,
> but I understand that the goal is to prevent a performance regression,
> and it's not really over the top, so I think it's probably OK.

Thanks. It would need some benchmarking as mentioned upthread. My new
machine became to work steadily so I will do that.

> sound. I think there are a number of places where the comments could
> be better; I'll include a few points on that further down. I also
> think that the code in swap_relation_files() which takes ExclusiveLock
> on the relations looks quite strange. It's hard to understand why it's
> needed at all, or why that lock level is used. On the flip side, I

Right. It *was* a mistake of AccessExclusiveLock. On second thought,
callers must have taken locks on them with required level for
relfilenode swapping. However, one problematic case is toast indexes
of the target relation, which are not locked at all. Finally I used
AccessShareLock as it doesn't raise lock level other than
NoLock. Anyway the toast relation is not accessible outside the
session. (Done in the attached)

> think that the test suite looks really impressive and should be of
> considerable help not only in making sure that this is fixed but
> detecting if it gets broken again in the future. Perhaps it doesn't
> cover every scenario we care about, but if that turns out to be the
> case, it seems like it would be easily to further generalize. I really
> like the idea of this *kind* of test framework.

The paths running swap_relation_files are not covered. CLUSTER,
REFRESH MATVIEW and ALTER TABLE. CLUSTER and ALTER TABLE can interact
with INSERTs but MATVIEW cannot. Copying some of the existing test
cases using them will work. (Not yet done).

> Comments on comments, and other nitpicking:
>
> - in-trasaction is mis-spelled in the doc patch. accidentially is
> mis-spelled in the 0002 patch.

Thanks. I found another couple of typos "issuing"->"issueing",
"skipped"->"skpped" by ispell'ing git diff output and all fixed.

> - I think the header comment for the new TAP test could do a far
> better job explaining the overall goal of this testing than it
> actually does.

I rewrote it...

> - I think somewhere in relcache.c or rel.h there ought to be comments
> explaining the precise degree to which rd_createSubid,
> rd_newRelfilenodeSubid, and rd_firstRelfilenodeSubid are reliable,
> including problem scenarios. This patch removes some language of this
> sort from CopyFrom(), which was a funny place to have that information
> in the first place, but I don't see that it adds anything to replace
> it. I also think that we ought to explain - for the fields that are
> reliable - that they need to be reliable precisely for the purpose of
> not breaking this stuff. There's a bit of this right now:
>
> + * rd_firstRelfilenodeSubid is the ID of the first subtransaction the
> + * relfilenode change has took place in the current transaction. Unlike
> + * newRelfilenodeSubid, this won't be accidentially forgotten. A valid OID
> + * means that the currently active relfilenode is transaction-local and we
> + * sync the relation at commit instead of WAL-logging.
>
> ...but I think that needs to be somewhat expanded and clarified.

Agreed. It would be crude but I put augmenting descriptions of how the
variables work and descriptions in contrast to rd_first*.

# rd_first* is not a hint in the sense that it is reliable but it is
# mentioned as hint in some places, which will need fix.

If the fix of MarkBufferDirtyHint is ok, I'll merge it into 0002.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v23-0001-TAP-test-for-copy-truncation-optimization.patch	text/x-patch	11.6 KB
v23-0002-Fix-WAL-skipping-feature.patch	text/x-patch	52.0 KB
v23-0003-Fix-MarkBufferDirtyHint.patch	text/x-patch	2.3 KB
v23-0004-Documentation-for-wal_skip_threshold.patch	text/x-patch	1.8 KB
v23-0005-Additional-test-for-new-GUC-setting.patch	text/x-patch	2.9 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-18 04:54:34
Message-ID:	20191118045434.GA1173436@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> I started pre-commit editing on 2019-10-28, and comment+README updates have
> been the largest part of that. I'll check my edits against the things you
> list here, and I'll share on-list before committing. I've now marked the CF
> entry Ready for Committer.

Having dedicated many days to that, I am attaching v24nm. I know of two
remaining defects:

=== Defect 1: gistGetFakeLSN()

When I modified pg_regress.c to use wal_level=minimal for all suites,
src/test/isolation/specs/predicate-gist.spec failed the assertion in
gistGetFakeLSN(). One could reproduce the problem just by running this
sequence in psql:

begin;
create table gist_point_tbl(id int4, p point);
create index gist_pointidx on gist_point_tbl using gist(p);
insert into gist_point_tbl (id, p)
select g, point(g*10, g*10) from generate_series(1, 1000) g;

I've included a wrong-in-general hack to make the test pass. I see two main
options for fixing this:

(a) Introduce an empty WAL record that reserves an LSN and has no other
effect. Make GiST use that for permanent relations that are skipping WAL.
Further optimizations are possible. For example, we could use a backend-local
counter (like the one gistGetFakeLSN() uses for temp relations) until the
counter is greater a recent real LSN. That optimization is probably too
clever, though it would make the new WAL record almost never appear.

(b) Exempt GiST from most WAL skipping. GiST index build could still skip
WAL, but it would do its own smgrimmedsync() in addition to the one done at
commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
other AM-independent code that skips WAL.

Overall, I like the cleanliness of (a). The main argument for (b) is that it
ensures we have all the features to opt-out of WAL skipping, which could be
useful for out-of-tree index access methods. (I think we currently have the
features for a tableam to do so, but not for an indexam to do so.) Overall, I
lean toward (a). Any other ideas or preferences?

=== Defect 2: repetitive work when syncing many relations

For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
sophisticated about optimizing the shared buffers scan. Commit 279628a
introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
further reduce the chance of causing performance regressions. (One could,
however, work around the problem by raising wal_skip_threshold.) Kyotaro, if
you agree, could you modify v24nm to implement that?

Notable changes in v24nm:

- Wrote section "Skipping WAL for New RelFileNode" in
src/backend/access/transam/README to be the main source concerning the new
coding rules.

- Updated numerous comments and doc sections.

- Eliminated the pendingSyncs list in favor of a "sync" field in
pendingDeletes. I mostly did this to eliminate the possibility of the lists
getting out of sync. This removed considerable parallel code for managing a
second list at end-of-xact. We now call smgrDoPendingSyncs() only when
committing or preparing a top-level transaction.

- Whenever code sets an rd_*Subid field of a Relation, it must call
EOXactListAdd(). swap_relation_files() was not doing so, so the field
remained set during the next transaction. I introduced
RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
so it also affects the mapped relation case.

- In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
rd_createSubid remained set. (That happened before this patch, but it has
been harmless.) I fixed this in heap_create().

- Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary
when checksums are enabled. Observe the precedent that
RelationCopyStorage() has not been exempting FSM_FORKNUM.

- Pass log_newpage_range() a "false" for page_std, for the same reason
RelationCopyStorage() does.

- log_newpage_range() ignored its forkNum and page_std arguments, so we logged
the wrong data for non-main forks. Before this patch, callers always passed
MAIN_FORKNUM and "true", hence the lack of complaints.

- Restored table_finish_bulk_insert(), though heapam no longer provides a
callback. The API is still well-defined, and other table AMs might have use
for it. Removing it feels like a separate proposal.

- Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit
itself in light of this patch.

- Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
it was overcounting.

- Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.

- Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
"Write Ahead Log" -> "Settings", between similar settings
wal_writer_flush_after and commit_delay. The other place I considered was
"Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
backend_flush_after.

- Gave each test a unique name. Changed test table names to be descriptive,
e.g. test7 became trunc_trig.

- Squashed all patches into one. Split patches are good when one could
reasonably choose to push a subset, but that didn't apply here. I wouldn't
push a GUC implementation without its documentation. Since the tests fail
without the main bug fix, I wouldn't push tests separately.

By the way, based on the comment at zheap_prepare_insert(), I expect zheap
will exempt itself from skipping WAL. It may stop calling RelationNeedsWAL()
and instead test for RELPERSISTENCE_PERMANENT.

Attachment	Content-Type	Size
skip-wal-v24nm.patch	text/plain	81.3 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-20 06:05:46
Message-ID:	20191120.150546.1050157217400213784.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I'm in the benchmarking week..

Thanks for reviewing!.

I'll look into that soon.

By the way, before finalize this, I'd like to share the result of a
brief benchmarking.

First, I measured the direct effect of WAL skipping.
I measured the time required to do the following sequence for the
COMMIT-FPW-WAL case and COMMIT-fsync case. WAL and heap files are on
non-server spec HDD.

BEGIN;
TRUNCATE t;
INSERT INTO t (SELECT a FROM generate_series(1, n) a);
COMMIT;

REPLICA means the time with wal_level = replica
SYNC means the time with wal_level = minimal and force file sync.
WAL means the time with wal_level = minimal and force commit-WAL.
pages is the number of pages of the table.
(REPLICA comes from run.sh 1, SYNC/WAL comes from run.sh 2)

pages REPLICA SYNC WAL
1: 144 ms 683 ms 217 ms
3: 303 ms 995 ms 385 ms
5: 271 ms 1007 ms 217 ms
10: 157 ms 1043 ms 224 ms
17: 189 ms 1007 ms 193 ms
31: 202 ms 1091 ms 230 ms
56: 265 ms 1175 ms 226 ms
100: 510 ms 1307 ms 270 ms
177: 790 ms 1523 ms 524 ms
316: 1827 ms 1643 ms 719 ms
562: 1904 ms 2109 ms 1148 ms
1000: 3060 ms 2979 ms 2113 ms
1778: 6077 ms 3945 ms 3618 ms
3162: 13038 ms 7078 ms 6734 ms

There was a crossing point around 3000 pages. (bench1() finds that by
bisecting, run.sh 3).

With multiple sessions, the crossing point but does not go so
small.

10 processes (run.pl 4 10) The numbers in parentheses are WAL[n]/WAL[n-1].
pages SYNC WAL
316: 8436 ms 4694 ms
562: 12067 ms 9627 ms (x2.1) # WAL wins
1000: 19154 ms 43262 ms (x4.5) # SYNC wins. WAL's slope becomes steep.
1778: 32495 ms 63863 ms (x1.4)

100 processes (run.pl 4 100)
pages SYNC WAL
10: 13275 ms 1868 ms
17: 15919 ms 4438 ms (x2.3)
31: 17063 ms 6431 ms (x1.5)
56: 23193 ms 14276 ms (x2.2) # WAL wins
100: 35220 ms 67843 ms (x4.8) # SYNC wins. WAL's slope becomes steep.

With 10 pgbench sessions.
pages SYNC WAL
1: 915 ms 301 ms
3: 1634 ms 508 ms
5: 1634 ms 293ms
10: 1671 ms 1043 ms
17: 1600 ms 333 ms
31: 1864 ms 314 ms
56: 1562 ms 448 ms
100: 1538 ms 394 ms
177: 1697 ms 1047 ms
316: 3074 ms 1788 ms
562: 3306 ms 1245 ms
1000: 3440 ms 2182 ms
1778: 5064 ms 6464 ms # WAL's slope becomes steep
3162: 8675 ms 8165 ms

I don't think the result of 100 processes is meaningful, so excluding
the result a candidate for wal_skip_threshold can be 1000.

Thoughts? The attached is the benchmark script.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-20 08:31:43
Message-ID:	20191120.173143.1442042654954107403.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I looked the version.

> Notable changes in v24nm:
>
> - Wrote section "Skipping WAL for New RelFileNode" in
> src/backend/access/transam/README to be the main source concerning the new
> coding rules.

Thanks for writing this.

+Prefer to do the same in future access methods. However, two other approaches
+can work. First, an access method can irreversibly transition a given fork
+from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
+smgrimmedsync(). Second, an access method can opt to write WAL
+unconditionally for permanent relations. When using the second method, do not
+call RelationCopyStorage(), which skips WAL.

Even using these methods, TransactionCommit flushes out buffers then
sync files again. Isn't a description something like the following
needed?

===
Even an access method switched a in-transaction created relfilenode to
WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
file then smgrimmedsync() the file.
===

> - Updated numerous comments and doc sections.
>
> - Eliminated the pendingSyncs list in favor of a "sync" field in
> pendingDeletes. I mostly did this to eliminate the possibility of the lists
> getting out of sync. This removed considerable parallel code for managing a
> second list at end-of-xact. We now call smgrDoPendingSyncs() only when
> committing or preparing a top-level transaction.

Mmm. Right. The second list was a trace of older versions, maybe that
needed additional works at rollback. Actually as of v23 the function
syncs no files at rollback. It is wiser to merging the two.

> - Whenever code sets an rd_*Subid field of a Relation, it must call
> EOXactListAdd(). swap_relation_files() was not doing so, so the field
> remained set during the next transaction. I introduced
> RelationAssumeNewRelfilenode() to handle both tasks, and I located the call
> so it also affects the mapped relation case.

Ugh.. Thanks for pointing out. By the way

+ /*
+ * Recognize that rel1's relfilenode (swapped from rel2) is new in this
+ * subtransaction. Since the next step for rel2 is deletion, don't bother
+ * recording the newness of its relfilenode.
+ */
+ rel1 = relation_open(r1, AccessExclusiveLock);
+ RelationAssumeNewRelfilenode(rel1);

It cannot be accessed from other sessions. Theoretically it doesn't
need a lock but NoLock cannot be used there since there's a path that
doesn't take lock on the relation. But AEL seems too strong and it
causes unecessary side effect. Couldn't we use weaker locks?

... Time is up. I'll continue looking this.

regards.

> - In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild,
> rd_createSubid remained set. (That happened before this patch, but it has
> been harmless.) I fixed this in heap_create().
>
> - Made smgrDoPendingSyncs() stop exempting FSM_FORKNUM. A sync is necessary
> when checksums are enabled. Observe the precedent that
> RelationCopyStorage() has not been exempting FSM_FORKNUM.
>
> - Pass log_newpage_range() a "false" for page_std, for the same reason
> RelationCopyStorage() does.
>
> - log_newpage_range() ignored its forkNum and page_std arguments, so we logged
> the wrong data for non-main forks. Before this patch, callers always passed
> MAIN_FORKNUM and "true", hence the lack of complaints.
>
> - Restored table_finish_bulk_insert(), though heapam no longer provides a
> callback. The API is still well-defined, and other table AMs might have use
> for it. Removing it feels like a separate proposal.
>
> - Removed TABLE_INSERT_SKIP_WAL. Any out-of-tree code using it should revisit
> itself in light of this patch.
>
> - Fixed smgrDoPendingSyncs() to reinitialize total_blocks for each relation;
> it was overcounting.
>
> - Made us skip WAL after SET TABLESPACE, like we do after CLUSTER.
>
> - Moved the wal_skip_threshold docs from "Resource Consumption" -> "Disk" to
> "Write Ahead Log" -> "Settings", between similar settings
> wal_writer_flush_after and commit_delay. The other place I considered was
> "Resource Consumption" -> "Asynchronous Behavior", due to the similarity of
> backend_flush_after.
>
> - Gave each test a unique name. Changed test table names to be descriptive,
> e.g. test7 became trunc_trig.
>
> - Squashed all patches into one. Split patches are good when one could
> reasonably choose to push a subset, but that didn't apply here. I wouldn't
> push a GUC implementation without its documentation. Since the tests fail
> without the main bug fix, I wouldn't push tests separately.
>
> By the way, based on the comment at zheap_prepare_insert(), I expect zheap
> will exempt itself from skipping WAL. It may stop calling RelationNeedsWAL()
> and instead test for RELPERSISTENCE_PERMANENT.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-21 07:01:07
Message-ID:	20191121.160107.1405593316918593372.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I should have replied this first.

At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > been the largest part of that. I'll check my edits against the things you
> > list here, and I'll share on-list before committing. I've now marked the CF
> > entry Ready for Committer.
>
> Having dedicated many days to that, I am attaching v24nm. I know of two
> remaining defects:
>
> === Defect 1: gistGetFakeLSN()
>
> When I modified pg_regress.c to use wal_level=minimal for all suites,
> src/test/isolation/specs/predicate-gist.spec failed the assertion in
> gistGetFakeLSN(). One could reproduce the problem just by running this
> sequence in psql:
>
> begin;
> create table gist_point_tbl(id int4, p point);
> create index gist_pointidx on gist_point_tbl using gist(p);
> insert into gist_point_tbl (id, p)
> select g, point(g*10, g*10) from generate_series(1, 1000) g;
>
> I've included a wrong-in-general hack to make the test pass. I see two main
> options for fixing this:
>
> (a) Introduce an empty WAL record that reserves an LSN and has no other
> effect. Make GiST use that for permanent relations that are skipping WAL.
> Further optimizations are possible. For example, we could use a backend-local
> counter (like the one gistGetFakeLSN() uses for temp relations) until the
> counter is greater a recent real LSN. That optimization is probably too
> clever, though it would make the new WAL record almost never appear.
>
> (b) Exempt GiST from most WAL skipping. GiST index build could still skip
> WAL, but it would do its own smgrimmedsync() in addition to the one done at
> commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> other AM-independent code that skips WAL.
>
> Overall, I like the cleanliness of (a). The main argument for (b) is that it
> ensures we have all the features to opt-out of WAL skipping, which could be
> useful for out-of-tree index access methods. (I think we currently have the
> features for a tableam to do so, but not for an indexam to do so.) Overall, I
> lean toward (a). Any other ideas or preferences?

I don't like (b), too.

What we need there is any sequential numbers for page LSN but really
compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
case? Or, I'm not sure but I suppose that nothing happenes when
UNLOGGED GiST index gets turned into LOGGED one.

Rewriting table like SET LOGGED will work but not realistic.

> === Defect 2: repetitive work when syncing many relations
>
> For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> sophisticated about optimizing the shared buffers scan. Commit 279628a
> introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
> further reduce the chance of causing performance regressions. (One could,
> however, work around the problem by raising wal_skip_threshold.) Kyotaro, if
> you agree, could you modify v24nm to implement that?

Seems reasonable. Please wait a minite.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
use_real_lsn_as_fake_lsn.patch	text/x-patch	1.0 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-21 07:11:23
Message-ID:	20191121.161123.1761509341556304095.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Wow.. This is embarrassing.. *^^*.

At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> I should have replied this first.
>
> At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > On Tue, Nov 05, 2019 at 02:53:35PM -0800, Noah Misch wrote:
> > > I started pre-commit editing on 2019-10-28, and comment+README updates have
> > > been the largest part of that. I'll check my edits against the things you
> > > list here, and I'll share on-list before committing. I've now marked the CF
> > > entry Ready for Committer.
> >
> > Having dedicated many days to that, I am attaching v24nm. I know of two
> > remaining defects:
> >
> > === Defect 1: gistGetFakeLSN()
> >
> > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > gistGetFakeLSN(). One could reproduce the problem just by running this
> > sequence in psql:
> >
> > begin;
> > create table gist_point_tbl(id int4, p point);
> > create index gist_pointidx on gist_point_tbl using gist(p);
> > insert into gist_point_tbl (id, p)
> > select g, point(g*10, g*10) from generate_series(1, 1000) g;
> >
> > I've included a wrong-in-general hack to make the test pass. I see two main
> > options for fixing this:
> >
> > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > effect. Make GiST use that for permanent relations that are skipping WAL.
> > Further optimizations are possible. For example, we could use a backend-local
> > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > counter is greater a recent real LSN. That optimization is probably too
> > clever, though it would make the new WAL record almost never appear.
> >
> > (b) Exempt GiST from most WAL skipping. GiST index build could still skip
> > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > other AM-independent code that skips WAL.
> >
> > Overall, I like the cleanliness of (a). The main argument for (b) is that it
> > ensures we have all the features to opt-out of WAL skipping, which could be
> > useful for out-of-tree index access methods. (I think we currently have the
> > features for a tableam to do so, but not for an indexam to do so.) Overall, I
> > lean toward (a). Any other ideas or preferences?
>
> I don't like (b), too.
>
> What we need there is any sequential numbers for page LSN but really
> compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the

> case? Or, I'm not sure but I suppose that nothing happenes when
> UNLOGGED GiST index gets turned into LOGGED one.

Yes, I just forgot to remove these lines when writing the following.

> Rewriting table like SET LOGGED will work but not realistic.
>
> > === Defect 2: repetitive work when syncing many relations
> >
> > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> > sophisticated about optimizing the shared buffers scan. Commit 279628a
> > introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
> > further reduce the chance of causing performance regressions. (One could,
> > however, work around the problem by raising wal_skip_threshold.) Kyotaro, if
> > you agree, could you modify v24nm to implement that?
>
> Seems reasonable. Please wait a minite.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-21 10:48:58
Message-ID:	20191121.194858.792948544668017842.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Thu, 21 Nov 2019 16:01:07 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > For deleting relfilenodes, smgrDoPendingDeletes() collects a list for
> > smgrdounlinkall() to pass to DropRelFileNodesAllBuffers(), which is
> > sophisticated about optimizing the shared buffers scan. Commit 279628a
> > introduced that, in 2013. I think smgrDoPendingSyncs() should do likewise, to
> Seems reasonable. Please wait a minite.

This is the first cut of that. This makes the function FlushRelationBuffersWithoutRelcache useless, which was introduced in this work. The first patch reverts it, then the second patch adds the bulk sync feature.

The new function FlushRelFileNodesAllBuffers, differently from
DropRelFileNodesAllBuffers, takes SMgrRelation which is required by
FlushBuffer(). So it takes somewhat tricky way, where type
SMgrSortArray pointer to which is compatible with RelFileNode is used.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0001-Revert-FlushRelationBuffersWithoutRelcache.patch	text/x-patch	3.3 KB
0002-Improve-the-performance-of-relation-syncs.patch	text/x-patch	8.6 KB

From:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-22 12:21:31
Message-ID:	1f9a76fe-6f77-eb5f-9292-9b1c92f4f5bd@2ndquadrant.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 2019-11-05 22:16, Robert Haas wrote:
> First, I'd like to restate my understanding of the problem just to see
> whether I've got the right idea and whether we're all on the same
> page. When wal_level=minimal, we sometimes try to skip WAL logging on
> newly-created relations in favor of fsync-ing the relation at commit
> time.

How useful is this behavior, relative to all the effort required?

Even if the benefit is significant, how many users can accept running
with wal_level=minimal and thus without replication or efficient backups?

Is there perhaps an alternative approach involving unlogged tables to
get a similar performance benefit?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-23 16:35:09
Message-ID:	20191123163509.GA39577@gust.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Nov 22, 2019 at 01:21:31PM +0100, Peter Eisentraut wrote:
> On 2019-11-05 22:16, Robert Haas wrote:
> >First, I'd like to restate my understanding of the problem just to see
> >whether I've got the right idea and whether we're all on the same
> >page. When wal_level=minimal, we sometimes try to skip WAL logging on
> >newly-created relations in favor of fsync-ing the relation at commit
> >time.
>
> How useful is this behavior, relative to all the effort required?
>
> Even if the benefit is significant, how many users can accept running with
> wal_level=minimal and thus without replication or efficient backups?

That longstanding optimization is too useful to remove, but likely not useful
enough to add today if we didn't already have it. The initial-data-load use
case remains plausible. I can also imagine using wal_level=minimal for data
warehouse applications where one can quickly rebuild from the authoritative
data.

> Is there perhaps an alternative approach involving unlogged tables to get a
> similar performance benefit?

At wal_level=replica, it seems inevitable that ALTER TABLE SET LOGGED will
need to WAL-log the table contents. I suppose we could keep wal_level=minimal
and change its only difference from wal_level=replica to be that ALTER TABLE
SET LOGGED skips WAL. Currently, ALTER TABLE SET LOGGED also rewrites the
table; that would need to change. I'd want to add ALTER INDEX SET LOGGED,
too. After all that, users would need to modify their applications. Overall,
it's possible, but it's not a clear win over the status quo.

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-23 21:21:36
Message-ID:	20191123212136.GA41522@gust.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> By the way, before finalize this, I'd like to share the result of a
> brief benchmarking.

What non-default settings did you use? Please give the output of this or a
similar command:

select name, setting from pg_settings where setting <> boot_val;

If you run more benchmarks and weren't already using wal_buffers=16MB, I
recommend using it.

> With 10 pgbench sessions.
> pages SYNC WAL
> 1: 915 ms 301 ms
> 3: 1634 ms 508 ms
> 5: 1634 ms 293ms
> 10: 1671 ms 1043 ms
> 17: 1600 ms 333 ms
> 31: 1864 ms 314 ms
> 56: 1562 ms 448 ms
> 100: 1538 ms 394 ms
> 177: 1697 ms 1047 ms
> 316: 3074 ms 1788 ms
> 562: 3306 ms 1245 ms
> 1000: 3440 ms 2182 ms
> 1778: 5064 ms 6464 ms # WAL's slope becomes steep
> 3162: 8675 ms 8165 ms

For picking a default wal_skip_threshold, it would have been more informative
to see how this changes pgbench latency statistics. Some people want DDL to
be fast, but more people want DDL not to reduce the performance of concurrent
non-DDL. This benchmark procedure may help:

1. Determine $DDL_COUNT, a number of DDL transactions that take about one
minute when done via syncs.
2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
3. Wait 10s.
4. Start one DDL backend that runs $DDL_COUNT transactions.
5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I would compare pgbench tps and latency between the seconds when DDL is and is
not running. As you did in earlier tests, I would repeat it using various
page counts, with and without sync.

On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> +Prefer to do the same in future access methods. However, two other approaches
> +can work. First, an access method can irreversibly transition a given fork
> +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> +smgrimmedsync(). Second, an access method can opt to write WAL
> +unconditionally for permanent relations. When using the second method, do not
> +call RelationCopyStorage(), which skips WAL.
>
> Even using these methods, TransactionCommit flushes out buffers then
> sync files again. Isn't a description something like the following
> needed?
>
> ===
> Even an access method switched a in-transaction created relfilenode to
> WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> file then smgrimmedsync() the file.
> ===

It is enough that the text says to prefer the approach that core access
methods use. The extra flush and sync when using a non-preferred approach
wastes some performance, but it is otherwise harmless.

> + rel1 = relation_open(r1, AccessExclusiveLock);
> + RelationAssumeNewRelfilenode(rel1);
>
> It cannot be accessed from other sessions. Theoretically it doesn't
> need a lock but NoLock cannot be used there since there's a path that
> doesn't take lock on the relation. But AEL seems too strong and it
> causes unecessary side effect. Couldn't we use weaker locks?

We could use NoLock. I assumed we already hold AccessExclusiveLock, in which
case this has no side effects.

On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > === Defect 1: gistGetFakeLSN()
> >
> > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > gistGetFakeLSN(). One could reproduce the problem just by running this
> > sequence in psql:
> >
> > begin;
> > create table gist_point_tbl(id int4, p point);
> > create index gist_pointidx on gist_point_tbl using gist(p);
> > insert into gist_point_tbl (id, p)
> > select g, point(g*10, g*10) from generate_series(1, 1000) g;
> >
> > I've included a wrong-in-general hack to make the test pass. I see two main
> > options for fixing this:
> >
> > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > effect. Make GiST use that for permanent relations that are skipping WAL.
> > Further optimizations are possible. For example, we could use a backend-local
> > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > counter is greater a recent real LSN. That optimization is probably too
> > clever, though it would make the new WAL record almost never appear.
> >
> > (b) Exempt GiST from most WAL skipping. GiST index build could still skip
> > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > other AM-independent code that skips WAL.
> >
> > Overall, I like the cleanliness of (a). The main argument for (b) is that it
> > ensures we have all the features to opt-out of WAL skipping, which could be
> > useful for out-of-tree index access methods. (I think we currently have the
> > features for a tableam to do so, but not for an indexam to do so.) Overall, I
> > lean toward (a). Any other ideas or preferences?
>
> I don't like (b), too.
>
> What we need there is any sequential numbers for page LSN but really
> compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
> case?

No. If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
GiST pages need an increasing LSN value.

I noticed an additional defect:

BEGIN;
CREATE TABLE t (c) AS SELECT 1;
CHECKPOINT; -- write and fsync the table's one page
TRUNCATE t; -- no WAL
COMMIT; -- no FPI, just the commit record

If we crash after the COMMIT and before the next fsync or OS-elected sync of
the table's file, the table will stay on disk with its pre-TRUNCATE content.

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-24 12:53:16
Message-ID:	20191124125316.GC2266@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 23, 2019 at 11:35:09AM -0500, Noah Misch wrote:
> That longstanding optimization is too useful to remove, but likely not useful
> enough to add today if we didn't already have it. The initial-data-load use
> case remains plausible. I can also imagine using wal_level=minimal for data
> warehouse applications where one can quickly rebuild from the authoritative
> data.

I can easily imagine cases where a user would like to use the benefit
of the optimization for an initial data load, and afterwards update
wal_level to replica so as they avoid the initial WAL burst which
serves no real purpose. So the first argument is pretty strong IMO,
the second much less.
--
Michael

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-25 02:08:54
Message-ID:	20191125.110854.1439648354665518969.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> > By the way, before finalize this, I'd like to share the result of a
> > brief benchmarking.
>
> What non-default settings did you use? Please give the output of this or a
> similar command:

Only wal_level=minimal and max_wal_senders=0.

> select name, setting from pg_settings where setting <> boot_val;
>
> If you run more benchmarks and weren't already using wal_buffers=16MB, I
> recommend using it.

Roger.

> > With 10 pgbench sessions.
> > pages SYNC WAL
> > 1: 915 ms 301 ms
> > 3: 1634 ms 508 ms
> > 5: 1634 ms 293ms
> > 10: 1671 ms 1043 ms
> > 17: 1600 ms 333 ms
> > 31: 1864 ms 314 ms
> > 56: 1562 ms 448 ms
> > 100: 1538 ms 394 ms
> > 177: 1697 ms 1047 ms
> > 316: 3074 ms 1788 ms
> > 562: 3306 ms 1245 ms
> > 1000: 3440 ms 2182 ms
> > 1778: 5064 ms 6464 ms # WAL's slope becomes steep
> > 3162: 8675 ms 8165 ms
>
> For picking a default wal_skip_threshold, it would have been more informative
> to see how this changes pgbench latency statistics. Some people want DDL to
> be fast, but more people want DDL not to reduce the performance of concurrent
> non-DDL. This benchmark procedure may help:
>
> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
>
> I would compare pgbench tps and latency between the seconds when DDL is and is
> not running. As you did in earlier tests, I would repeat it using various
> page counts, with and without sync.

I understood the "DDL" is not pure DDLs but a kind of
define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
FROM".

> On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> > +Prefer to do the same in future access methods. However, two other approaches
> > +can work. First, an access method can irreversibly transition a given fork
> > +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> > +smgrimmedsync(). Second, an access method can opt to write WAL
> > +unconditionally for permanent relations. When using the second method, do not
> > +call RelationCopyStorage(), which skips WAL.
> >
> > Even using these methods, TransactionCommit flushes out buffers then
> > sync files again. Isn't a description something like the following
> > needed?
> >
> > ===
> > Even an access method switched a in-transaction created relfilenode to
> > WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> > file then smgrimmedsync() the file.
> > ===
>
> It is enough that the text says to prefer the approach that core access
> methods use. The extra flush and sync when using a non-preferred approach
> wastes some performance, but it is otherwise harmless.

Ah, right and I agreed.

> > + rel1 = relation_open(r1, AccessExclusiveLock);
> > + RelationAssumeNewRelfilenode(rel1);
> >
> > It cannot be accessed from other sessions. Theoretically it doesn't
> > need a lock but NoLock cannot be used there since there's a path that
> > doesn't take lock on the relation. But AEL seems too strong and it
> > causes unecessary side effect. Couldn't we use weaker locks?
>
> We could use NoLock. I assumed we already hold AccessExclusiveLock, in which
> case this has no side effects.

I forgot that this optimization is used only in non-replication
configuragion. So I agree that AEL doesn't have no side
effect.

> On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > === Defect 1: gistGetFakeLSN()
> > >
> > > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > > gistGetFakeLSN(). One could reproduce the problem just by running this
> > > sequence in psql:
> > >
> > > begin;
> > > create table gist_point_tbl(id int4, p point);
> > > create index gist_pointidx on gist_point_tbl using gist(p);
> > > insert into gist_point_tbl (id, p)
> > > select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > >
> > > I've included a wrong-in-general hack to make the test pass. I see two main
> > > options for fixing this:
> > >
> > > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > > effect. Make GiST use that for permanent relations that are skipping WAL.
> > > Further optimizations are possible. For example, we could use a backend-local
> > > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > > counter is greater a recent real LSN. That optimization is probably too
> > > clever, though it would make the new WAL record almost never appear.
> > >
> > > (b) Exempt GiST from most WAL skipping. GiST index build could still skip
> > > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > > commit. Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > > other AM-independent code that skips WAL.
> > >
> > > Overall, I like the cleanliness of (a). The main argument for (b) is that it
> > > ensures we have all the features to opt-out of WAL skipping, which could be
> > > useful for out-of-tree index access methods. (I think we currently have the
> > > features for a tableam to do so, but not for an indexam to do so.) Overall, I
> > > lean toward (a). Any other ideas or preferences?
> >
> > I don't like (b), too.
> >
> > What we need there is any sequential numbers for page LSN but really
> > compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
> > case?
>
> No. If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
> GiST pages need an increasing LSN value.

Sorry, I noticed that after the mail went out. I agree to (a) and will
do that.

> I noticed an additional defect:
>
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
>
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

The TRUNCATE replaces relfilenode in the catalog and the pre-TRUNCATE
content wouldn't be seen after COMMIT. Since the file has no pages,
it's right that no FPI emitted. What we should make sure the empty
file's metadata is synced out. But I think that kind of failure
shoudn't happen on modern file systems. If we don't rely on such
behavior, we can make sure thhat by turning the zero-pages case from
WAL into file sync. I'll do that in the next version.

I'll post the next version as a single patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-25 03:08:39
Message-ID:	20191125030839.GA51906@gust.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > This benchmark procedure may help:
> >
> > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > minute when done via syncs.
> > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > 3. Wait 10s.
> > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> >
> > I would compare pgbench tps and latency between the seconds when DDL is and is
> > not running. As you did in earlier tests, I would repeat it using various
> > page counts, with and without sync.
>
> I understood the "DDL" is not pure DDLs but a kind of
> define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
> FROM".

When I wrote "DDL", I meant the four-command transaction that you already used
in benchmarks.

> > I noticed an additional defect:
> >
> > BEGIN;
> > CREATE TABLE t (c) AS SELECT 1;
> > CHECKPOINT; -- write and fsync the table's one page
> > TRUNCATE t; -- no WAL
> > COMMIT; -- no FPI, just the commit record
> >
> > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > the table's file, the table will stay on disk with its pre-TRUNCATE content.
>
> The TRUNCATE replaces relfilenode in the catalog

No, it does not. Since the relation is new in the transaction, the TRUNCATE
uses the heap_truncate_one_rel() strategy.

> Since the file has no pages, it's right that no FPI emitted.

Correct.

> If we don't rely on such
> behavior, we can make sure thhat by turning the zero-pages case from
> WAL into file sync. I'll do that in the next version.

The zero-pages case is not special. Here's an example of the problem with a
nonzero size:

BEGIN;
CREATE TABLE t (c) AS SELECT * FROM generate_series(1,100000);
CHECKPOINT; -- write and fsync the table's many pages
TRUNCATE t; -- no WAL
INSERT INTO t VALUES (0); -- no WAL
COMMIT; -- FPI for one page; nothing removes the additional pages

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-25 20:58:14
Message-ID:	CA+TgmoYzjPYCyU5o6XQ1B6JGhYM8MXGNGxhPpz=tFYdHEHGeGA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Nov 23, 2019 at 4:21 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> I noticed an additional defect:
>
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
>
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

Shouldn't the TRUNCATE be triggering an fsync() to happen before
COMMIT is permitted to complete? You'd have the same problem if the
TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-25 21:50:25
Message-ID:	20191125215025.GA53580@gust.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Nov 25, 2019 at 03:58:14PM -0500, Robert Haas wrote:
> On Sat, Nov 23, 2019 at 4:21 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> > I noticed an additional defect:
> >
> > BEGIN;
> > CREATE TABLE t (c) AS SELECT 1;
> > CHECKPOINT; -- write and fsync the table's one page
> > TRUNCATE t; -- no WAL
> > COMMIT; -- no FPI, just the commit record
> >
> > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > the table's file, the table will stay on disk with its pre-TRUNCATE content.
>
> Shouldn't the TRUNCATE be triggering an fsync() to happen before
> COMMIT is permitted to complete?

With wal_skip_threshold=0, you do get an fsync(). The patch tries to avoid
at-commit fsync of small files by WAL-logging file contents instead. However,
the patch doesn't WAL-log enough to handle files that decreased in size.

> You'd have the same problem if the
> TRUNCATE were replaced by INSERT, unless fsync() happens in that case.

I think an insert would be fine. You'd get an FPI record for the relation's
one page, which fully reproduces the relation.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-26 12:37:52
Message-ID:	20191126.213752.2132434859202124793.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sun, 24 Nov 2019 22:08:39 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Mon, Nov 25, 2019 at 11:08:54AM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > I noticed an additional defect:
> > >
> > > BEGIN;
> > > CREATE TABLE t (c) AS SELECT 1;
> > > CHECKPOINT; -- write and fsync the table's one page
> > > TRUNCATE t; -- no WAL
> > > COMMIT; -- no FPI, just the commit record
> > >
> > > If we crash after the COMMIT and before the next fsync or OS-elected sync of
> > > the table's file, the table will stay on disk with its pre-TRUNCATE content.
> >
> > The TRUNCATE replaces relfilenode in the catalog
>
> No, it does not. Since the relation is new in the transaction, the TRUNCATE
> uses the heap_truncate_one_rel() strategy.
..
> The zero-pages case is not special. Here's an example of the problem with a
> nonzero size:

I got it. That is, if the file has had blocks beyond the size at
commit, we should sync the file even if it is small enough. It nees to
track beore-trunction size as this patch used to have.

pendingSyncHash is resurrected to do truncate-size tracking. That
information cannot be stored in SMgrRelation, which will be dissapper
by invalidation, or Relation, which is not available in storage layer.
smgrDoPendingDeletes is needed to be called at aboft again to clean up
useless hash. I'm not sure the exact cause but
AssertPendingSyncs_RelationCache() fails at abort (so it is not called
at abort).

smgrDoPendingSyncs and RelFileNodeSkippingWAL() become simpler by
using the hash.

Is is not fully checked. I didn't merged and mesured performance yet,
but I post the status-quo patch for now.

- v25-0001-version-nm.patch

Noah's v24 patch.

- v25-0002-Revert-FlushRelationBuffersWithoutRelcache.patch

Remove useless function (added by this patch..).

- v25-0003-Improve-the-performance-of-relation-syncs.patch

Make smgrDoPendingSyncs scan shared buffer once.

v25-0004-Adjust-gistGetFakeLSN.patch

Amendment for gistGetFakeLSN. This uses GetXLogInsertRecPtr as long as
it is different from the previous call and emits dummy WAL if we need
a new LSN. Since other than switch_wal record cannot be empty so the
dummy WAL has an integer content for now.

v25-0005-Sync-files-shrinked-by-truncation.patch

Amendment for the truncation problem.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v25-0001-version-nm.patch	text/x-patch	70.5 KB
v25-0002-Revert-FlushRelationBuffersWithoutRelcache.patch	text/x-patch	3.3 KB
v25-0003-Improve-the-performance-of-relation-syncs.patch	text/x-patch	8.6 KB
v25-0004-Adjust-gistGetFakeLSN.patch	text/x-patch	5.1 KB
v25-0005-Sync-files-shrinked-by-truncation.patch	text/x-patch	10.0 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-28 11:56:20
Message-ID:	20191128.205620.2015649987051831334.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 26 Nov 2019 21:37:52 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail> Is is not fully checked. I didn't merged and mesured performance yet,
> but I post the status-quo patch for now.

It was actually inconsistency caused by swap_relation_files.

1. rd_createSubid of relcache for r2 is not turned off. This prevents
the relcache entry from flushed. Commit processes pendingSyncs and
leaves the relcache entry with rd_createSubid != Invalid. It is
inconsistency.

2. relation_open(r1) returns a relcache entry with its relfilenode has
the old value (relfilenode1) since command counter has not been
incremented. On the other hand if it is incremented just before,
AssertPendingSyncConsistency() aborts because of the inconsistency
between relfilenode and rd_firstRel*.

As the result, I returned to think that we need to modify both
relcache entries with right relfilenode.

I once thought that taking AEL in the function has no side effect but
the code path is executed also when wal_level = replica or higher. And
as I mentioned upthread, we can even get there without taking any lock
on r1 or sometimes ShareLock. So upgrading to AEL emits Standby/LOCK
WAL and propagates to standby. After all I'd like to take the weakest
lock (AccessShareLock) there.

The attached is the new version of the patch.

- v26-0001-version-nm24.patch
Same with v24

- v26-0002-change-swap_relation_files.patch

Changes to swap_relation_files as mentioned above.

- v26-0003-Improve-the-performance-of-relation-syncs.patch

Do multiple pending syncs by one shared_buffers scanning.

- v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patch

v26-0003 makes the function useless. Remove it.

- v26-0005-Fix-gistGetFakeLSN.patch

gistGetFakeLSN fix.

- v26-0006-Sync-files-shrinked-by-truncation.patch

Fix the problem of commit-time-FPI after truncation after checkpoint.
I'm not sure this is the right direction but pendingSyncHash is
removed from pendingDeletes list again.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v26-0001-version-nm24.patch	text/x-patch	70.5 KB
v26-0002-change-swap_relation_files.patch	text/x-patch	2.4 KB
v26-0003-Improve-the-performance-of-relation-syncs.patch	text/x-patch	8.5 KB
v26-0004-Revert-FlushRelationBuffersWithoutRelcache.patch	text/x-patch	3.3 KB
v26-0005-Fix-gistGetFakeLSN.patch	text/x-patch	5.7 KB
v26-0006-Sync-files-shrinked-by-truncation.patch	text/x-patch	9.8 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-28 12:35:08
Message-ID:	20191128.213508.1108483203561250557.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I measured the performance with the latest patch set.

> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

I did the following benchmarking.

1. Initialize bench database

$ pgbench -i -s 20

2. Start server with wal_level = replica (all other variables are not
changed) then run the attached ./bench.sh

$ ./bench.sh <count> <pages> <mode>

where count is the number of repetition, pages is the number of pages
to write in a run, and mode is "s" (sync) or "w"(WAL). The <mode>
doesn't affect if wal_level = replica. The script shows the following
result.

before: mean numbers before "the DDL" starts.
during: mean numbers while "the DDL" is running.
after : mean numbers after "the DDL" ends.
DDL time: the time took to run "the DDL".

3. Restart server with wal_level = replica then run the bench.sh
twice.

$ ./bench.sh <count> <pages> s
$ ./bench.sh <count> <pages> w

Finally I got three graphs. (attached 1, 2, 3. PNGs)

* Graph 1 - The affect of the DDL on pgbench's TPS

The virtical axis means "during TPS" / "before TPS" in %. Larger is
better. The horizontal axis means the table pages size.

Replica and Minimal-sync are almost flat. Minimal-WAL getting worse
as table size increases. 500 pages seems to be the crosspoint.

* Graph 2 - The affect of the DDL on pgbench's latency.

The virtical axis means "during-letency" / "before-latency" in
%. Smaller is better. Like TPS but more quickly WAL-latency gets
worse as table size increases. The crosspoint seems to be 300 pages
or so.

* Graph 3 - The affect of pgbench's work load on DDL runtime.

The virtical axis means "time the DDL takes to run with pgbench" /
"time the DDL to run solely". Smaller is better. Replica and
Minimal-SYNC shows similar tendency. On Minimal-WAL the DDL runs
quite fast with small tables. The crosspoint seems to be about 2500
pages.

Seeing this, I became to be worry that the optimization might give far
smaller advantage than expected. Putting aside that, it seems to me
that the default value for the threshold would be 500-1000, same as
the previous benchmark showed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
	image/png	8.5 KB
	image/png	11.7 KB
	image/png	8.9 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-11-28 22:23:19
Message-ID:	20191128222319.GA89611@gust.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> I measured the performance with the latest patch set.
>
> > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > minute when done via syncs.
> > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > 3. Wait 10s.
> > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

If you have the raw data requested in (5), please share them here so folks
have the option to reproduce your graphs and calculations.

> I did the following benchmarking.
>
> 1. Initialize bench database
>
> $ pgbench -i -s 20
>
> 2. Start server with wal_level = replica (all other variables are not
> changed) then run the attached ./bench.sh

The bench.sh attachment was missing; please attach it. Please give the output
of this command:

select name, setting from pg_settings where setting <> boot_val;

> 3. Restart server with wal_level = replica then run the bench.sh
> twice.

I assume this is wal_level=minimal, not wal_level=replica.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-03 11:51:46
Message-ID:	20191203.205146.1521643852457054060.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > I measured the performance with the latest patch set.
> >
> > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > > minute when done via syncs.
> > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > 3. Wait 10s.
> > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
>
> If you have the raw data requested in (5), please share them here so folks
> have the option to reproduce your graphs and calculations.

Sorry, I forgot to attach the scripts. The raw data was vanished into
unstable connection and the steps was quite crude. I prioritized on
showing some numbers at the time. I revised the scripts into more
automated way and will take numbers again.

> > > 2. Start server with wal_level = replica (all other variables are not
> > > changed) then run the attached ./bench.sh
> >
> > The bench.sh attachment was missing; please attach it. Please give the output
> > of this command:
> >
> > select name, setting from pg_settings where setting <> boot_val;

The result for "replica" setting in the benchmark script are used as
base numbers (or the denominator of the percentages).

> > 3. Restart server with wal_level = replica then run the bench.sh
> > twice.
>
> I assume this is wal_level=minimal, not wal_level=replica.

Oops! It's wrong I ran once with replica, then twice with minimal.

Anyway, I revised the benchmarking scripts and attached them. The
parameters written in benchmain.sh were decided as ./bench2.pl 5
<count> <pages> s with wal_level=minimal server takes around 60
seconds.

I'll send the complete data tomorrow (in JST). The attached f.txt is
the result of preliminary test only with pages=100 and 250 (with HDD).

The attached files are:
benchmain.sh - main script
bench2.sh - run a benchmark with a single set of parameters
bench1.pl - behchmark client program
summarize.pl - script to summarize benchmain.sh's output
f.txt.gz - result only for pages=100, DDL count = 2200 (not 2250)

How to run:

$ /..unpatched_path../initdb -D <unpatched_datadir>
(wal_level=replica, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$ /..patched_path../initdb -D <patched_datadir>
(wal_level=minimal, max_wal_senders=0, log_checkpoints=yes, max_wal_size=10GB)
$./benchmain.sh > <result_file> # output raw data
$./summarize.pl [-v] < <result_file> # show summary

With the attached f.txt, summarize.pl gives the following output.
WAL wins with the that pages.

$ cat f.txt | ./summarize.pl
## params: wal_level=replica mode=none pages=100 count=353 scale=20
(% are relative to "before")
before: tps 262.3 (100.0%), lat 39.840 ms (100.0%) (29 samples)
during: tps 120.7 ( 46.0%), lat 112.508 ms (282.4%) (35 samples)
after: tps 106.3 ( 40.5%), lat 163.492 ms (410.4%) (86 samples)
DDL time: 34883 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=100 count=353 scale=20
(% are relative to "before")
before: tps 226.3 (100.0%), lat 48.091 ms (100.0%) (29 samples)
during: tps 83.0 ( 36.7%), lat 184.942 ms (384.6%) (100 samples)
after: tps 82.6 ( 36.5%), lat 196.863 ms (409.4%) (21 samples)
DDL time: 99239 ms ( 284.5% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=100 count=353 scale=20
(% are relative to "before")
before: tps 240.3 (100.0%), lat 44.686 ms (100.0%) (29 samples)
during: tps 129.6 ( 53.9%), lat 113.585 ms (254.2%) (31 samples)
after: tps 124.5 ( 51.8%), lat 141.992 ms (317.8%) (90 samples)
DDL time: 30392 ms ( 87.1% relative to mode=none)
## params: wal_level=replica mode=none pages=250 count=258 scale=20
(% are relative to "before")
before: tps 266.3 (100.0%), lat 45.884 ms (100.0%) (29 samples)
during: tps 87.9 ( 33.0%), lat 148.433 ms (323.5%) (54 samples)
after: tps 105.6 ( 39.6%), lat 153.216 ms (333.9%) (67 samples)
DDL time: 53176 ms ( 100.0% relative to mode=none)
## params: wal_level=minimal mode=sync pages=250 count=258 scale=20
(% are relative to "before")
before: tps 225.1 (100.0%), lat 47.705 ms (100.0%) (29 samples)
during: tps 93.7 ( 41.6%), lat 143.231 ms (300.2%) (83 samples)
after: tps 93.8 ( 41.7%), lat 186.097 ms (390.1%) (38 samples)
DDL time: 82104 ms ( 154.4% relative to mode=none)
## params: wal_level=minimal mode=WAL pages=250 count=258 scale=20
(% are relative to "before")
before: tps 230.2 (100.0%), lat 48.472 ms (100.0%) (29 samples)
during: tps 90.3 ( 39.2%), lat 183.365 ms (378.3%) (48 samples)
after: tps 123.9 ( 53.8%), lat 131.129 ms (270.5%) (73 samples)
DDL time: 47660 ms ( 89.6% relative to mode=none)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
f.txt.gz	application/octet-stream	17.7 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-04 04:47:35
Message-ID:	20191204.134735.906643794665403860.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 03 Dec 2019 20:51:46 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> I'll send the complete data tomorrow (in JST). The attached f.txt is
> the result of preliminary test only with pages=100 and 250 (with HDD).

The attached files are the latest set of the test scripts and the result:
benchmark_scripts.tar.gz
benchmain.sh - main script
bench2.sh - run a benchmark with a single set of parameters
bench1.pl - behchmark client program
summarize.pl - script to summarize benchmain.sh's output
graph.xlsx - MS-Excel file for the graph below.
result.txt.gz - raw result of benchmain.sh
summary.txt.gz - cooked result by summarize.pl -s
graph.png - graphs

summarize.pl [-v|-s|-d]
-s: print summary table for spreadsheets (TSV)
-v: show pgbench summary
-d: debug print

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
benchmark_scripts.tar.gz	application/octet-stream	38.6 KB
result.txt.gz	application/octet-stream	108.9 KB
summary.txt.gz	application/octet-stream	3.1 KB
	image/png	46.4 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-08 18:09:51
Message-ID:	20191208180951.GA1694300@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I reviewed your latest code, and it's nearly complete. mdimmedsync() syncs
only "active segments" (as defined in md.c), but smgrDoPendingSyncs() must
sync active and inactive segments. This matters when mdtruncate() truncated
the relation after the last checkpoint, causing active segments to become
inactive. In such cases, syncs of the inactive segments will have been queued
for execution during the next checkpoint. Since we skipped the
XLOG_SMGR_TRUNCATE record, we must complete those syncs before commit. Let's
just modify smgrimmedsync() to always sync active and inactive segments;
that's fine to do in other smgrimmedsync() callers, even though they operate
on relations that can't have inactive segments.

On Tue, Dec 03, 2019 at 08:51:46PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > > I measured the performance with the latest patch set.
> > >
> > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > > > minute when done via syncs.
> > > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > > 3. Wait 10s.
> > > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.

> wal_buffers | 512

This value (4 MiB) is lower than a tuned production system would have. In
future benchmarks (if any) use wal_buffers=2048 (16 MiB).

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-09 09:04:06
Message-ID:	20191209.180406.694980847634806231.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

At Sun, 8 Dec 2019 10:09:51 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> I reviewed your latest code, and it's nearly complete. mdimmedsync() syncs
> only "active segments" (as defined in md.c), but smgrDoPendingSyncs() must
> sync active and inactive segments. This matters when mdtruncate() truncated
> the relation after the last checkpoint, causing active segments to become
> inactive. In such cases, syncs of the inactive segments will have been queued
> for execution during the next checkpoint. Since we skipped the
> XLOG_SMGR_TRUNCATE record, we must complete those syncs before commit. Let's

Got it! You're so great. Thanks.

> just modify smgrimmedsync() to always sync active and inactive segments;
> that's fine to do in other smgrimmedsync() callers, even though they operate
> on relations that can't have inactive segments.

Agreed and done that way. Even it's not harmful to leave inactive
segments open but I choosed to close them after sync. As being
mentioned in the added comment in the function, inactive segments may
not be closed if error happened on file sync. md works properly even
in the case (as the file comment says) and, anyway, mdnblocks leaves
the first inactive segment open if there's no partial segment.

I don't understand why mdclose checks for (v->mdfd_vfd >= 0) of open
segment but anyway mdimmedsync is believing that that won't happen and
I follow the assumption. (I suspect that the if condition in mdclose
should be an assertion..)

> On Tue, Dec 03, 2019 at 08:51:46PM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 28 Nov 2019 17:23:19 -0500, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > On Thu, Nov 28, 2019 at 09:35:08PM +0900, Kyotaro Horiguchi wrote:
> > > > I measured the performance with the latest patch set.
> > > >
> > > > > 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
> > > > > minute when done via syncs.
> > > > > 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> > > > > 3. Wait 10s.
> > > > > 4. Start one DDL backend that runs $DDL_COUNT transactions.
> > > > > 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
>
> > wal_buffers | 512
>
> This value (4 MiB) is lower than a tuned production system would have. In
> future benchmarks (if any) use wal_buffers=2048 (16 MiB).

Yeah, only 0.5GB of shared_buffers makes the default value of
wal_buffers reach to the heaven. I think I can take numbers on that
condition. (I doubt that it's meaningful if I increase only
wal_buffers manually.)

Anyway the default value ought to be defined based on the default
configuration.

In the attached patch, I merged all pieces in the previous version and
the change made this time (only md.c is changed this time).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v27-0001-Rework-WAL-skipping-optimization.patch	text/x-patch	82.7 KB

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-09 15:56:40
Message-ID:	CA+TgmoZ0i_OZD+E5Fytv_x8VLkxeKYiNVXY2bF4xw_+er7SiBA@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Dec 9, 2019 at 4:04 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> Yeah, only 0.5GB of shared_buffers makes the default value of
> wal_buffers reach to the heaven. I think I can take numbers on that
> condition. (I doubt that it's meaningful if I increase only
> wal_buffers manually.)

Heaven seems a bit exalted, but I think we really only have a formula
because somebody might have really small shared_buffers for some
reason and be unhappy about us gobbling up a comparatively large
amount of memory for WAL buffers. The current limit means that normal
installations get what they need without manual tuning, and small
installations - where performance presumably sucks anyway for other
reasons - keep a small memory footprint.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	robertmhaas(at)gmail(dot)com
Cc:	noah(at)leadboat(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-10 07:59:25
Message-ID:	20191210.165925.1031511471757182290.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Mon, 9 Dec 2019 10:56:40 -0500, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> On Mon, Dec 9, 2019 at 4:04 AM Kyotaro Horiguchi
> <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > Yeah, only 0.5GB of shared_buffers makes the default value of
> > wal_buffers reach to the heaven. I think I can take numbers on that
> > condition. (I doubt that it's meaningful if I increase only
> > wal_buffers manually.)
>
> Heaven seems a bit exalted, but I think we really only have a formula
> because somebody might have really small shared_buffers for some
> reason and be unhappy about us gobbling up a comparatively large
> amount of memory for WAL buffers. The current limit means that normal
> installations get what they need without manual tuning, and small
> installations - where performance presumably sucks anyway for other
> reasons - keep a small memory footprint.

True. I meant the ceiling of defaultly tuned value, and the larger
value may work on the larger system.

Anyway, I ran the benchmark with
shared_buffers=1GB/wal_buffers=16MB(defalut). pgbench -s 20 uses 256MB
of storage so all of them can be loaded on shared memory.

The attached graph shows larger benefit in TPS drop and latency
increase for HDD. The DDL pages at the corsspoint between commit-FPW
and commit-sync moves from roughly 300 to 200 in TPS and latency, and
1000 to 600 in DDL runtime. If we can rely on the two graphs, 500 (or
512) pages seems to be the most promising candidate for the default
value of wal_skip_threshold.
regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
result_shmem1GB.tar.gz	application/octet-stream	111.8 KB
	image/png	49.9 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	robertmhaas(at)gmail(dot)com
Cc:	noah(at)leadboat(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-24 07:35:35
Message-ID:	20191224.163535.585049355887962215.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 10 Dec 2019 16:59:25 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> shared_buffers=1GB/wal_buffers=16MB(defalut). pgbench -s 20 uses 256MB
> of storage so all of them can be loaded on shared memory.
>
> The attached graph shows larger benefit in TPS drop and latency
> increase for HDD. The DDL pages at the corsspoint between commit-FPW
> and commit-sync moves from roughly 300 to 200 in TPS and latency, and
> 1000 to 600 in DDL runtime. If we can rely on the two graphs, 500 (or
> 512) pages seems to be the most promising candidate for the default
> value of wal_skip_threshold.
> regards.

I rebased the patch and changed the default value for the GUC variable
wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and
guc.c. 4096kB is choosed as it is the nice round number of 500 pages *
8kB = 4000kB.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0001-Rework-WAL-skipping-optimization.patch	text/x-patch	82.7 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	robertmhaas(at)gmail(dot)com
Cc:	noah(at)leadboat(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-25 08:26:30
Message-ID:	20191225.172630.1945561382648080417.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 24 Dec 2019 16:35:35 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> I rebased the patch and changed the default value for the GUC variable
> wal_skip_threshold to 4096 kilobytes in config.sgml, storage.c and
> guc.c. 4096kB is choosed as it is the nice round number of 500 pages *
> 8kB = 4000kB.

The value in the doc was not correct. Fixed only the value from 3192
to 4096kB.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v29-0001-Rework-WAL-skipping-optimization.patch	text/x-patch	82.7 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-26 00:15:21
Message-ID:	20191226001521.GA1772687@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

By improving AssertPendingSyncs_RelationCache() and by testing with
-DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
Would you fix these?

=== Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO

A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
when running "make check" under wal_level=minimal. I test this way:

printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
make check TEMP_CONFIG=$PWD/minimal.conf

Self-contained demonstration:
begin;
create table t (c int);
savepoint q; drop table t; rollback to q; -- forgets table is skipping wal
commit; -- assertion failure

=== Defect 2: Forgets to skip WAL due to oversimplification in heap_create()

In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
to transfer WAL-skipped state to the new index relation. Before v24nm, the
new index relation skipped WAL unconditionally. Since v24nm, the new index
relation never skips WAL. I've added a test to alter_table.sql that reveals
this problem under wal_level=minimal.

=== Defect 3: storage.c checks size decrease of MAIN_FORKNUM only

storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it
possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
net size decrease? I haven't tested, but this sequence seems possible:

TRUNCATE
reduces MAIN_FORKNUM from 100 blocks to 0 blocks
reduces FSM_FORKNUM from 3 blocks to 0 blocks
COPY
raises MAIN_FORKNUM from 0 blocks to 110 blocks
does not change FSM_FORKNUM
COMMIT
should fsync, but wrongly chooses log_newpage_range() approach

If that's indeed a problem, beside the obvious option of tracking every fork's
max_truncated, we could convert max_truncated to a bool and use fsync anytime
the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical
for database operations, the choice to subject it to checksums entails
protecting it here.) If that's not a problem, would you explain?

=== Non-defect notes

Once you have a correct patch, would you run check-world with
-DCLOBBER_CACHE_ALWAYS? That may reveal additional defects. It may take a
day or more, but that's fine.

The new smgrimmedsync() calls are potentially fragile, because they sometimes
target a file of a dropped relation. However, the mdexists() test prevents
anything bad from happening. No change is necessary. Example:

SET wal_skip_threshold = 0;
BEGIN;
SAVEPOINT q;
CREATE TABLE t (c) AS SELECT 1;
ROLLBACK TO q; -- truncates the relfilenode
CHECKPOINT; -- unlinks the relfilenode
COMMIT; -- calls mdexists() on the relfilenode

=== Notable changes in v30nm

- Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

- Changed AssertPendingSyncs_RelationCache() to open all relations on which
the transaction holds locks. This permits detection of cases where
RelationNeedsWAL() returns true but storage.c will sync the relation.

Removed the assertions from RelationIdGetRelation(). Using
"-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
actually problematic, since invalidation updates rd_node while other code
updates rd_firstRelfilenodeSubid. This is not a significant loss, now that
AssertPendingSyncs_RelationCache() opens relations. (I considered making
the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
somewhere until the next CommandCounterIncrement(), which would make it
actually affect RelationNeedsWAL(). That might have been better in general,
but it felt complex without clear benefits.)

Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making
that work no matter what does ereport(ERROR) would be tricky and low-value.

- Extracted the RelationTruncate() changes into new function
RelationPreTruncate(), so table access methods that can't use
RelationTruncate() have another way to request that work.

- Changed wal_skip_threshold default to 2MB. My second preference was for
4MB. In your data, 2MB and 4MB had similar performance at optimal
wal_buffers, but 4MB performed worse at low wal_buffers.

- Reverted most post-v24nm changes to swap_relation_files(). Under
"-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if
we're running CLUSTER for the second time in one transaction. I used
relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
assertion failure if we hold no lock at that point.

- Change toast_get_valid_index() to retain locks until end of transaction.
When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
revealed that we retain no lock on the TOAST index.

- Ran pgindent and perltidy. Updated some comments and names.

On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> Anyway the default value ought to be defined based on the default
> configuration.

PostgreSQL does not follow that principle. Settings that change permanent
resource consumption, such as wal_buffers, have small defaults. Settings that
don't change permanent resource consumption can have defaults that favor a
well-tuned system.

Attachment	Content-Type	Size
skip-wal-v30nm.patch	text/plain	95.5 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-26 03:46:39
Message-ID:	20191226.124639.5401358775142406.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for the findings.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> By improving AssertPendingSyncs_RelationCache() and by testing with
> -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> Would you fix these?

I'd like to do that, please give me som time.

> === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
>
> A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> when running "make check" under wal_level=minimal. I test this way:
>
> printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
> make check TEMP_CONFIG=$PWD/minimal.conf
>
> Self-contained demonstration:
> begin;
> create table t (c int);
> savepoint q; drop table t; rollback to q; -- forgets table is skipping wal
> commit; -- assertion failure
>
>
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
>
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation. Before v24nm, the
> new index relation skipped WAL unconditionally. Since v24nm, the new index
> relation never skips WAL. I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.
>
>
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
>
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease? I haven't tested, but this sequence seems possible:
>
> TRUNCATE
> reduces MAIN_FORKNUM from 100 blocks to 0 blocks
> reduces FSM_FORKNUM from 3 blocks to 0 blocks
> COPY
> raises MAIN_FORKNUM from 0 blocks to 110 blocks
> does not change FSM_FORKNUM
> COMMIT
> should fsync, but wrongly chooses log_newpage_range() approach
>
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.) If that's not a problem, would you explain?
>

> === Non-defect notes
>
> Once you have a correct patch, would you run check-world with
> -DCLOBBER_CACHE_ALWAYS? That may reveal additional defects. It may take a
> day or more, but that's fine.

Sure.

> The new smgrimmedsync() calls are potentially fragile, because they sometimes
> target a file of a dropped relation. However, the mdexists() test prevents
> anything bad from happening. No change is necessary. Example:
>
> SET wal_skip_threshold = 0;
> BEGIN;
> SAVEPOINT q;
> CREATE TABLE t (c) AS SELECT 1;
> ROLLBACK TO q; -- truncates the relfilenode
> CHECKPOINT; -- unlinks the relfilenode
> COMMIT; -- calls mdexists() on the relfilenode
>
>
> === Notable changes in v30nm
>
> - Changed "wal_skip_threshold * 1024" to an expression that can't overflow.
> Without this, wal_skip_threshold=1TB behaved like wal_skip_threshold=0.

Ahh.., I wrongly understood that MAX_KILOBYTES inhibits that
setting. work_mem and maintenance_work_mem are casted to double or
long before calculation. In this case it's enough that calculation
unit becomes kilobytes instad of bytes.

> - Changed AssertPendingSyncs_RelationCache() to open all relations on which
> the transaction holds locks. This permits detection of cases where
> RelationNeedsWAL() returns true but storage.c will sync the relation.
>
> Removed the assertions from RelationIdGetRelation(). Using
> "-DRELCACHE_FORCE_RELEASE" made them fail for usage patterns that aren't
> actually problematic, since invalidation updates rd_node while other code
> updates rd_firstRelfilenodeSubid. This is not a significant loss, now that
> AssertPendingSyncs_RelationCache() opens relations. (I considered making
> the update of rd_firstRelfilenodeSubid more like rd_node, where we store it
> somewhere until the next CommandCounterIncrement(), which would make it
> actually affect RelationNeedsWAL(). That might have been better in general,
> but it felt complex without clear benefits.)
>
> Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making
> that work no matter what does ereport(ERROR) would be tricky and low-value.

Right about ereport, but I'm not sure remove the whole assertion from abort.

> - Extracted the RelationTruncate() changes into new function
> RelationPreTruncate(), so table access methods that can't use
> RelationTruncate() have another way to request that work.

Sounds reasonable. Also the new behavior of max_truncated looks fine.

> - Changed wal_skip_threshold default to 2MB. My second preference was for
> 4MB. In your data, 2MB and 4MB had similar performance at optimal
> wal_buffers, but 4MB performed worse at low wal_buffers.

That's fine with me.

> - Reverted most post-v24nm changes to swap_relation_files(). Under
> "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if
> we're running CLUSTER for the second time in one transaction. I used

I don't agree to that. As I think I have mentioned upthread, rel2 is
wrongly marked as "new in this tranction" at that time, which hinders
the opportunity of removal and such entries wrongly persist for the
backend life and causes problems. (That was found by abort-time
AssertPendingSyncs_RelationCache()..)

> relation_open(r1, NoLock) instead of AccessShareLock, because we deserve an
> assertion failure if we hold no lock at that point.

I agree to that.

> - Change toast_get_valid_index() to retain locks until end of transaction.
> When I adopted relation_open(r1, NoLock) in swap_relation_files(), that
> revealed that we retain no lock on the TOAST index.

Sounds more reasonable than open_relation(AnyLock) in swap_relation_files.

> - Ran pgindent and perltidy. Updated some comments and names.
>
> On Mon, Dec 09, 2019 at 06:04:06PM +0900, Kyotaro Horiguchi wrote:
> > Anyway the default value ought to be defined based on the default
> > configuration.
>
> PostgreSQL does not follow that principle. Settings that change permanent
> resource consumption, such as wal_buffers, have small defaults. Settings that
> don't change permanent resource consumption can have defaults that favor a
> well-tuned system.

I think I understand that, actually 4MB was too large, though.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-26 04:22:04
Message-ID:	20191226042204.GB1772687@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making
> > that work no matter what does ereport(ERROR) would be tricky and low-value.
>
> Right about ereport, but I'm not sure remove the whole assertion from abort.

You may think of a useful assert location that lacks the problems of asserting
at abort. For example, I considered asserting in PortalRunMulti() and
PortalRun(), just after each command, if still in a transaction.

> > - Reverted most post-v24nm changes to swap_relation_files(). Under
> > "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> > rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if
> > we're running CLUSTER for the second time in one transaction. I used
>
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I can't reproduce rel2's relcache entry wrongly persisting for the life of a
backend. If that were happening, I would expect repeating a CLUSTER command N
times to increase hash_get_num_entries(RelationIdCache) by at least N. I
tried that, but hash_get_num_entries(RelationIdCache) did not increase. In a
non-assert build, how can I reproduce problems caused by incorrect
rd_createSubid on rel2?

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-26 05:45:26
Message-ID:	20191226.144526.1274190438297223221.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > - Reverted most post-v24nm changes to swap_relation_files(). Under
> > "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> > rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if
> > we're running CLUSTER for the second time in one transaction. I used
>
> I don't agree to that. As I think I have mentioned upthread, rel2 is
> wrongly marked as "new in this tranction" at that time, which hinders
> the opportunity of removal and such entries wrongly persist for the
> backend life and causes problems. (That was found by abort-time
> AssertPendingSyncs_RelationCache()..)

I played with the new version for a while and I don't see such a
problem. I don't recall cleary what I saw the time I thought I saw a
problem but I changed my mind to agree to that. It's far reasonable
and clearer as long as it works correctly.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2019-12-26 09:03:21
Message-ID:	20191226.180321.1191503580459144209.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, Noah.

At Wed, 25 Dec 2019 20:22:04 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Thu, Dec 26, 2019 at 12:46:39PM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > Skip AssertPendingSyncs_RelationCache() at abort, like v24nm did. Making
> > > that work no matter what does ereport(ERROR) would be tricky and low-value.
> >
> > Right about ereport, but I'm not sure remove the whole assertion from abort.
>
> You may think of a useful assert location that lacks the problems of asserting
> at abort. For example, I considered asserting in PortalRunMulti() and
> PortalRun(), just after each command, if still in a transaction.

Thanks for the suggestion. I'll consider that

> > > - Reverted most post-v24nm changes to swap_relation_files(). Under
> > > "-DRELCACHE_FORCE_RELEASE", relcache.c quickly discards the
> > > rel1->rd_node.relNode update. Clearing rel2->rd_createSubid is not right if
> > > we're running CLUSTER for the second time in one transaction. I used
> >
> > I don't agree to that. As I think I have mentioned upthread, rel2 is
> > wrongly marked as "new in this tranction" at that time, which hinders
> > the opportunity of removal and such entries wrongly persist for the
> > backend life and causes problems. (That was found by abort-time
> > AssertPendingSyncs_RelationCache()..)
>
> I can't reproduce rel2's relcache entry wrongly persisting for the life of a
> backend. If that were happening, I would expect repeating a CLUSTER command N
> times to increase hash_get_num_entries(RelationIdCache) by at least N. I
> tried that, but hash_get_num_entries(RelationIdCache) did not increase. In a
> non-assert build, how can I reproduce problems caused by incorrect
> rd_createSubid on rel2?

As wrote in the another mail. I don't see such a problem and agree to
the removal.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-14 10:35:22
Message-ID:	20200114.193522.177274387863061991.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, this is a fix for the defect 1 of 3.

At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> Thank you for the findings.
>
> At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > By improving AssertPendingSyncs_RelationCache() and by testing with
> > -DRELCACHE_FORCE_RELEASE, I now know of three defects in the attached v30nm.
> > Would you fix these?
>
> I'd like to do that, please give me som time.
>
> > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> >
> > A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> > when running "make check" under wal_level=minimal. I test this way:
> >
> > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
> > make check TEMP_CONFIG=$PWD/minimal.conf
> >
> > Self-contained demonstration:
> > begin;
> > create table t (c int);
> > savepoint q; drop table t; rollback to q; -- forgets table is skipping wal
> > commit; -- assertion failure

This is complex than expected. The DROP TABLE unconditionally removed
relcache entry. To fix that, I tried to use rd_isinvalid but it failed
because there's a state that a relcache invalid but the corresponding
catalog entry is alive.

In the attached patch 0002, I added a boolean in relcache that
indicates that the relation is already removed in catalog but not
committed. I needed to ignore invalid relcache entries in
AssertPendingSyncs_RelationCache but I think it is the right thing to
do.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
0001-Rework-WAL-skipping-optimization.patch	text/x-patch	86.0 KB
0002-Fix-the-defect-1.patch	text/x-patch	6.0 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-15 08:18:57
Message-ID:	20200115.171857.1170166613760719188.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. I added a fix for the defect 2.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> === Defect 2: Forgets to skip WAL due to oversimplification in heap_create()
>
> In ALTER TABLE cases where TryReuseIndex() avoided an index rebuild, we need
> to transfer WAL-skipped state to the new index relation. Before v24nm, the
> new index relation skipped WAL unconditionally. Since v24nm, the new index
> relation never skips WAL. I've added a test to alter_table.sql that reveals
> this problem under wal_level=minimal.

The fix for this defect utilizes the mechanism that preserves relcache
entry for dropped relation. If ATExecAddIndex can obtain such a
relcache entry for the old relation, it should hold the newness flags
and we can copy them to the new relcache entry. I added one member
named oldRelId to the struct IndexStmt to let the function access the
relcache entry for the old index relation.

I forgot to assing version 31 to the last patch, I reused the number
for this version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v31-0001-Rework-WAL-skipping-optimization.patch	text/x-patch	86.0 KB
v31-0002-Fix-the-defect-1.patch	text/x-patch	6.0 KB
v31-0003-Fix-the-defect-2.patch	text/x-patch	2.9 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-16 05:20:57
Message-ID:	20200116.142057.1250623796779593147.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

All the known defects are fixed.

At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> === Defect 3: storage.c checks size decrease of MAIN_FORKNUM only
>
> storage.c tracks only MAIN_FORKNUM in pendingsync->max_truncated. Is it
> possible for MAIN_FORKNUM to have a net size increase while FSM_FORKNUM has a
> net size decrease? I haven't tested, but this sequence seems possible:
>
> TRUNCATE
> reduces MAIN_FORKNUM from 100 blocks to 0 blocks
> reduces FSM_FORKNUM from 3 blocks to 0 blocks
> COPY
> raises MAIN_FORKNUM from 0 blocks to 110 blocks
> does not change FSM_FORKNUM
> COMMIT
> should fsync, but wrongly chooses log_newpage_range() approach
>
> If that's indeed a problem, beside the obvious option of tracking every fork's
> max_truncated, we could convert max_truncated to a bool and use fsync anytime
> the relation experienced an mdtruncate(). (While FSM_FORKNUM is not critical
> for database operations, the choice to subject it to checksums entails
> protecting it here.) If that's not a problem, would you explain?

That causes page-load failure since FSM can offer a nonexistent heap
block, which failure leads to ERROR of an SQL statement. It's not
critical but surely a problem. I'd like to take the bool option
because insert-truncate sequence is rarely happen. That case is not
our main target of the optimization so it is enough for us to make
sure that the operation doesn't lead to such errors.

The attached is nm30 patch followed by the three fix patches for the
three defects. The new member "RelationData.isremoved" is renamed to
"isdropped" in this version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v32-0001-Rework-WAL-skipping-optimization.patch	text/x-patch	98.7 KB
v32-0002-Fix-the-defect-1.patch	text/x-patch	6.3 KB
v32-0003-Fix-the-defect-2.patch	text/x-patch	3.0 KB
v32-0004-Fix-the-defect-3.patch	text/x-patch	4.1 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-19 03:51:39
Message-ID:	20200119035139.GA2811524@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote:
> At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
> > >
> > > A test in transactions.sql now fails in AssertPendingSyncs_RelationCache(),
> > > when running "make check" under wal_level=minimal. I test this way:
> > >
> > > printf '%s\n%s\n' 'wal_level = minimal' 'max_wal_senders = 0' >$PWD/minimal.conf
> > > make check TEMP_CONFIG=$PWD/minimal.conf
> > >
> > > Self-contained demonstration:
> > > begin;
> > > create table t (c int);
> > > savepoint q; drop table t; rollback to q; -- forgets table is skipping wal
> > > commit; -- assertion failure
>
> This is complex than expected. The DROP TABLE unconditionally removed
> relcache entry. To fix that, I tried to use rd_isinvalid but it failed
> because there's a state that a relcache invalid but the corresponding
> catalog entry is alive.
>
> In the attached patch 0002, I added a boolean in relcache that
> indicates that the relation is already removed in catalog but not
> committed.

This design could work, but some if its properties aren't ideal. For example,
RelationIdGetRelation() can return a !rd_isvalid relation when the relation
has been dropped. What others designs did you consider, if any?

On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/utils/cache/relcache.c
> +++ b/src/backend/utils/cache/relcache.c
> @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
> */
> if (relation->rd_createSubid != InvalidSubTransactionId)
> {
> - if (isCommit)
> - relation->rd_createSubid = InvalidSubTransactionId;
> + relation->rd_createSubid = InvalidSubTransactionId;
> +
> + if (isCommit && !relation->rd_isdropped)
> + {} /* Nothing to do */

What is the purpose of this particular change? This executes at the end of a
top-level transaction. We've already done any necessary syncing, and we're
clearing any flags that caused WAL skipping. I think it's no longer
productive to treat dropped relations differently.

> @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
> }
> }
>
> + /*
> + * If this relation registered pending sync then dropped, subxact rollback
> + * cancels the uncommitted drop, and commit propagates it to the parent.
> + */
> + if (relation->rd_isdropped)
> + {
> + Assert (!relation->rd_isvalid &&
> + (relation->rd_createSubid != InvalidSubTransactionId ||
> + relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
> + if (!isCommit)
> + relation->rd_isdropped = false;

This does the wrong thing when there exists some subtransaction rollback that
does not rollback the DROP:

\pset null 'NULL'
begin;
create extension pg_visibility;
create table droppedtest (c int);
select 'droppedtest'::regclass::oid as oid \gset
savepoint q; drop table droppedtest; release q; -- rd_dropped==true
select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not ideal)
savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong)
savepoint q; select 1; rollback to q;
select pg_relation_size(:oid), pg_relation_filepath(:oid),
has_table_privilege(:oid, 'SELECT'); -- all nulls, okay
select * from pg_visibility_map(:oid); -- assertion failure
rollback;

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-21 09:45:57
Message-ID:	20200121.184557.1462314355964996736.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thank you for the comment.

At Sat, 18 Jan 2020 19:51:39 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Tue, Jan 14, 2020 at 07:35:22PM +0900, Kyotaro Horiguchi wrote:
> > At Thu, 26 Dec 2019 12:46:39 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > > At Wed, 25 Dec 2019 16:15:21 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > > === Defect 1: Forgets to skip WAL after SAVEPOINT; DROP TABLE; ROLLBACK TO
...
> > This is complex than expected. The DROP TABLE unconditionally removed
> > relcache entry. To fix that, I tried to use rd_isinvalid but it failed
> > because there's a state that a relcache invalid but the corresponding
> > catalog entry is alive.
> >
> > In the attached patch 0002, I added a boolean in relcache that
> > indicates that the relation is already removed in catalog but not
> > committed.
>
> This design could work, but some if its properties aren't ideal. For example,
> RelationIdGetRelation() can return a !rd_isvalid relation when the relation
> has been dropped. What others designs did you consider, if any?

I thought that the entries with rd_isdropped is true cannot be fetched
by other transactions because the relid is not seen there. Still the
same session could do that by repeatedly reindexing or invalidation on
the same relation and I think it is safe because the content of the
entry coudln't be changed and the cache content is reusable. That
being said, it is actually makes things unclear.

I came up with two alternatives. One is a variant of
RelationIdGetRelation for the purpose. The new function
RelationIdGetRelationCache is currently used (only) in ATExecAddIndex
so we could restrict it to return only dropped relations.

Another is another "stashed" relcache, but it seems to make things too
complex.

> On Thu, Jan 16, 2020 at 02:20:57PM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/utils/cache/relcache.c
> > +++ b/src/backend/utils/cache/relcache.c
> > @@ -3114,8 +3153,10 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
> > */
> > if (relation->rd_createSubid != InvalidSubTransactionId)
> > {
> > - if (isCommit)
> > - relation->rd_createSubid = InvalidSubTransactionId;
> > + relation->rd_createSubid = InvalidSubTransactionId;
> > +
> > + if (isCommit && !relation->rd_isdropped)
> > + {} /* Nothing to do */
>
> What is the purpose of this particular change? This executes at the end of a
> top-level transaction. We've already done any necessary syncing, and we're
> clearing any flags that caused WAL skipping. I think it's no longer
> productive to treat dropped relations differently.

It executes the pending *relcache* drop we should have done in
ATPostAlterTypeCleanup (or in RelationClearRelation) if the newness
flags were false. The entry misses the chance of being removed (then
bloats the relcache) if we don't do that there. I added a comment
there to exlaining that.

> > @@ -3232,6 +3272,19 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
> > }
> > }
> >
> > + /*
> > + * If this relation registered pending sync then dropped, subxact rollback
> > + * cancels the uncommitted drop, and commit propagates it to the parent.
> > + */
> > + if (relation->rd_isdropped)
> > + {
> > + Assert (!relation->rd_isvalid &&
> > + (relation->rd_createSubid != InvalidSubTransactionId ||
> > + relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId));
> > + if (!isCommit)
> > + relation->rd_isdropped = false;
>
> This does the wrong thing when there exists some subtransaction rollback that
> does not rollback the DROP:

Sorry for my stupid. I actually thought something like that on the
way. After all, I concluded that the dropped flag ought to behave same
way with rd_createSubid.

> \pset null 'NULL'
> begin;
> create extension pg_visibility;
> create table droppedtest (c int);
> select 'droppedtest'::regclass::oid as oid \gset
> savepoint q; drop table droppedtest; release q; -- rd_dropped==true
> select * from pg_visibility_map(:oid); -- processes !rd_isvalid rel (not ideal)
> savepoint q; select 1; rollback to q; -- rd_dropped==false (wrong)
> savepoint q; select 1; rollback to q;
> select pg_relation_size(:oid), pg_relation_filepath(:oid),
> has_table_privilege(:oid, 'SELECT'); -- all nulls, okay
> select * from pg_visibility_map(:oid); -- assertion failure
> rollback;

And I teached RelationIdGetRelation not to return dropped
relations. So the (not ideal) cases just fail as before.

Three other fixes not mentined above are made. One is the useless
rd_firstRelfilenodeSubid in the condition to dicide whether to
preserve or not a relcache entry, and the forgotten copying of other
newness flags. Another is the forgotten SWAPFIELD on
rd_dropedSubid. The last is the forgotten change in
out/equal/copyfuncs.

Please find the attached.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v33-0001-Rework-WAL-skipping-optimization.patch	text/x-patch	98.7 KB
v33-0002-Fix-the-defect-1.patch	text/x-patch	10.3 KB
v33-0003-Fix-the-defect-2.patch	text/x-patch	4.7 KB
v33-0004-Fix-the-defect-3.patch	text/x-patch	4.1 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-27 04:22:01
Message-ID:	20200127042201.GA3119606@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Diffing the two latest versions of one patch:
> --- v32-0002-Fix-the-defect-1.patch 2020-01-18 14:32:47.499129940 -0800
> +++ v33-0002-Fix-the-defect-1.patch 2020-01-26 16:23:52.846391035 -0800
> +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> + LOCKTAG_RELATION)
> + continue;
> + relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> +- r = RelationIdGetRelation(relid);
> +- if (r == NULL)
> ++ r = RelationIdGetRelationCache(relid);

The purpose of this loop is to create relcache entries for rels locked in the
current transaction. (The "r == NULL" case happens for rels no longer visible
in catalogs. It is harmless.) Since RelationIdGetRelationCache() never
creates a relcache entry, calling it defeats that purpose.
RelationIdGetRelation() is the right function to call.

On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote:
> Three other fixes not mentined above are made. One is the useless
> rd_firstRelfilenodeSubid in the condition to dicide whether to
> preserve or not a relcache entry

It was not useless. Test case:

create table t (c int);
begin;
alter table t alter c type bigint; -- sets rd_firstRelfilenodeSubid
savepoint q; drop table t; rollback to q; -- forgets rd_firstRelfilenodeSubid
commit; -- assertion failure, after s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-27 04:44:13
Message-ID:	20200127.134413.1734378790610835095.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks!

At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> Diffing the two latest versions of one patch:
> > --- v32-0002-Fix-the-defect-1.patch 2020-01-18 14:32:47.499129940 -0800
> > +++ v33-0002-Fix-the-defect-1.patch 2020-01-26 16:23:52.846391035 -0800
> > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> > + LOCKTAG_RELATION)
> > + continue;
> > + relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> > +- r = RelationIdGetRelation(relid);
> > +- if (r == NULL)
> > ++ r = RelationIdGetRelationCache(relid);
>
> The purpose of this loop is to create relcache entries for rels locked in the
> current transaction. (The "r == NULL" case happens for rels no longer visible
> in catalogs. It is harmless.) Since RelationIdGetRelationCache() never
> creates a relcache entry, calling it defeats that purpose.
> RelationIdGetRelation() is the right function to call.

I thought that the all required entry exist in the cache but actually
it's safer that recreate dropped caches. Does the following works?

r = RelationIdGetRelation(relid);
+ /* if not found, fetch a "dropped" entry if any */
+ if (r == NULL)
+ r = RelationIdGetRelationCache(relid);
if (r == NULL)
continue;

> On Tue, Jan 21, 2020 at 06:45:57PM +0900, Kyotaro Horiguchi wrote:
> > Three other fixes not mentined above are made. One is the useless
> > rd_firstRelfilenodeSubid in the condition to dicide whether to
> > preserve or not a relcache entry
>
> It was not useless. Test case:
>
> create table t (c int);
> begin;
> alter table t alter c type bigint; -- sets rd_firstRelfilenodeSubid
> savepoint q; drop table t; rollback to q; -- forgets rd_firstRelfilenodeSubid
> commit; -- assertion failure, after s/RelationIdGetRelationCache/RelationIdGetRelation/ discussed above

Mmm? I thought somehow that that relcache entry never be dropped and I
believe I considered that case, of course. But yes, you're right.

I'll post upated version.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-27 04:57:00
Message-ID:	20200127045700.GB3119606@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 26 Jan 2020 20:22:01 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > Diffing the two latest versions of one patch:
> > > --- v32-0002-Fix-the-defect-1.patch 2020-01-18 14:32:47.499129940 -0800
> > > +++ v33-0002-Fix-the-defect-1.patch 2020-01-26 16:23:52.846391035 -0800
> > > +@@ -2978,8 +3054,8 @@ AssertPendingSyncs_RelationCache(void)
> > > + LOCKTAG_RELATION)
> > > + continue;
> > > + relid = ObjectIdGetDatum(locallock->tag.lock.locktag_field2);
> > > +- r = RelationIdGetRelation(relid);
> > > +- if (r == NULL)
> > > ++ r = RelationIdGetRelationCache(relid);
> >
> > The purpose of this loop is to create relcache entries for rels locked in the
> > current transaction. (The "r == NULL" case happens for rels no longer visible
> > in catalogs. It is harmless.) Since RelationIdGetRelationCache() never
> > creates a relcache entry, calling it defeats that purpose.
> > RelationIdGetRelation() is the right function to call.
>
> I thought that the all required entry exist in the cache but actually
> it's safer that recreate dropped caches. Does the following works?
>
> r = RelationIdGetRelation(relid);
> + /* if not found, fetch a "dropped" entry if any */
> + if (r == NULL)
> + r = RelationIdGetRelationCache(relid);
> if (r == NULL)
> continue;

That does not materially change the function's behavior. Notice that the
function does one thing with "r", which is to call RelationClose(r). The
function calls RelationIdGetRelation() for its side effects, not for its
return value.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-27 06:08:18
Message-ID:	20200127.150818.232645484356723781.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

By the way, the previous version looks somewhat different from what I
thought I posted..

At Sun, 26 Jan 2020 20:57:00 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Mon, Jan 27, 2020 at 01:44:13PM +0900, Kyotaro Horiguchi wrote:
> > > The purpose of this loop is to create relcache entries for rels locked in the
> > > current transaction. (The "r == NULL" case happens for rels no longer visible
> > > in catalogs. It is harmless.) Since RelationIdGetRelationCache() never
> > > creates a relcache entry, calling it defeats that purpose.
> > > RelationIdGetRelation() is the right function to call.
> >
> > I thought that the all required entry exist in the cache but actually
> > it's safer that recreate dropped caches. Does the following works?
> >
> > r = RelationIdGetRelation(relid);
> > + /* if not found, fetch a "dropped" entry if any */
> > + if (r == NULL)
> > + r = RelationIdGetRelationCache(relid);
> > if (r == NULL)
> > continue;
>
> That does not materially change the function's behavior. Notice that the
> function does one thing with "r", which is to call RelationClose(r). The
> function calls RelationIdGetRelation() for its side effects, not for its
> return value.

..Right. The following loop accesses relcache hash directly and no
need for storing returned r to the array rels..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-01-27 10:28:31
Message-ID:	20200127.192831.1430769560984064622.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello, this is rebased then addressed version.

- Now valid rd_firstRelfilenodeSubid causes drop-pending of relcache
as well as rd_createSubid. The oblivion in the last example no
longer happens.

- Revert the (really) useless change of AssertPendingSyncs_RelationCache.

- Fix several comments. Some of the fixes are just rewording and some
are related to the first change above.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v34-0001-Rework-WAL-skipping-optimization.patch	text/x-patch	98.7 KB
v34-0002-Fix-the-defect-1.patch	text/x-patch	12.4 KB
v34-0003-Fix-the-defect-2.patch	text/x-patch	4.7 KB
v34-0004-Fix-the-defect-3.patch	text/x-patch	4.1 KB

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-18 02:56:15
Message-ID:	CA+hUKGLSyn78-9MtM5W4=17Aeq3+71Fyc6jLSbi22adRKZv55g@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jan 27, 2020 at 11:30 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> Hello, this is rebased then addressed version.

Hi, I haven't followed this thread but I just noticed this strange
looking failure:

CREATE TYPE priv_testtype1 AS (a int, b text);
+ERROR: relation 24844 deleted while still in use
REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923

It didn't fail on the same OS a couple of days earlier:

https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-18 08:53:37
Message-ID:	20200218085337.GB3781216@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote:
> CREATE TYPE priv_testtype1 AS (a int, b text);
> +ERROR: relation 24844 deleted while still in use
> REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923
>
> It didn't fail on the same OS a couple of days earlier:
>
> https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686

Thanks for the report. This reproduces consistently under
CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today). Removing the
heap_create() change fixes it. Since we now restore a saved rd_createSubid,
the heap_create() change is obsolete. My next version will include that fix.

The system uses rd_createSubid to mean two things. First, rd_node is new.
Second, the rel might not yet be in catalogs, so we can't rebuild its relcache
entry. The first can be false while the second is true, hence this failure.
However, the second is true in a relatively-narrow period in which we don't
run arbitrary user code. Hence, that simple fix suffices.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	thomas(dot)munro(at)gmail(dot)com, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-18 10:50:16
Message-ID:	20200218.195016.1845965961888951051.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Oops. I played on a wrong branch and got stuck in slow build on
Windows...

At Tue, 18 Feb 2020 00:53:37 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Tue, Feb 18, 2020 at 03:56:15PM +1300, Thomas Munro wrote:
> > CREATE TYPE priv_testtype1 AS (a int, b text);
> > +ERROR: relation 24844 deleted while still in use
> > REVOKE USAGE ON TYPE priv_testtype1 FROM PUBLIC;
> >
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.79923
> >
> > It didn't fail on the same OS a couple of days earlier:
> >
> > https://ci.appveyor.com/project/postgresql-cfbot/postgresql/builds/30829686
>
> Thanks for the report. This reproduces consistently under
> CLOBBER_CACHE_ALWAYS (which, coincidentally, I started today). Removing the
> heap_create() change fixes it. Since we now restore a saved rd_createSubid,
> the heap_create() change is obsolete. My next version will include that fix.

Yes, ATExecAddIndex correctly set createSubid without that.

> The system uses rd_createSubid to mean two things. First, rd_node is new.
> Second, the rel might not yet be in catalogs, so we can't rebuild its relcache
> entry. The first can be false while the second is true, hence this failure.
> However, the second is true in a relatively-narrow period in which we don't
> run arbitrary user code. Hence, that simple fix suffices.

I didn't care the second meaning. I thought it is caused by
invalidation but I couldn't get a core dump on Windows 10.. The
comment for RelationCacheInvalidate seems faintly explains about the
second meaning.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-19 07:44:52
Message-ID:	20200219074452.GA4006615@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I think attached v35nm is ready for commit to master. Would anyone like to
talk me out of back-patching this? I would not enjoy back-patching it, but
it's hard to justify lack of back-patch for a data-loss bug.

Notable changes since v34:

- Separate a few freestanding fixes into their own patches.

On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> --- a/src/backend/catalog/storage.c
> +++ b/src/backend/catalog/storage.c
> @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
> /* Record largest maybe-unsynced block of files under tracking */
> pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> HASH_FIND, NULL);
> - if (pending)
> - {
> - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> -
> - if (pending->max_truncated < nblocks)
> - pending->max_truncated = nblocks;
> - }
> + pending->is_truncated = true;

- Fix this crashing when "pending" is NULL, as it is in this test case:

begin;
create temp table t ();
create table t2 (); -- cause pendingSyncHash to exist
truncate t;
rollback;

- Fix the "deleted while still in use" problem that Thomas Munro reported, by
removing the heap_create() change. Restoring the saved rd_createSubid had
made obsolete the heap_create() change. check-world now passes with
wal_level=minimal and CLOBBER_CACHE_ALWAYS.

- Set rd_droppedSubid in RelationForgetRelation(), not
RelationClearRelation(). RelationForgetRelation() knows it is processing a
drop, but RelationClearRelation() could only infer that from circumstantial
evidence. This seems more future-proof to me.

- When reusing an index build, instead of storing the dropped relid in the
IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
the subid fields in the IndexStmt. This is less code, and I felt
RelationIdGetRelationCache() invited misuse.

Attachment	Content-Type	Size
createsubid-cosmetics-v35nm.patch	text/plain	2.4 KB
toast_get_valid_index-lock-v35nm.patch	text/plain	1.9 KB
log_newpage_range-args-v35nm.patch	text/plain	1.7 KB
skip-wal-v35nm.patch	text/plain	115.7 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-19 08:29:08
Message-ID:	20200219.172908.1235030736223943908.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> I think attached v35nm is ready for commit to master. Would anyone like to
> talk me out of back-patching this? I would not enjoy back-patching it, but
> it's hard to justify lack of back-patch for a data-loss bug.
>
> Notable changes since v34:
>
> - Separate a few freestanding fixes into their own patches.

All of the three patches look fine.

> On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> > --- a/src/backend/catalog/storage.c
> > +++ b/src/backend/catalog/storage.c
> > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
> > /* Record largest maybe-unsynced block of files under tracking */
> > pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> > HASH_FIND, NULL);
> > - if (pending)
> > - {
> > - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> > -
> > - if (pending->max_truncated < nblocks)
> > - pending->max_truncated = nblocks;
> > - }
> > + pending->is_truncated = true;
>
> - Fix this crashing when "pending" is NULL, as it is in this test case:
>
> begin;
> create temp table t ();
> create table t2 (); -- cause pendingSyncHash to exist
> truncate t;
> rollback;

That's terrible... Thanks for fixint it.

> - Fix the "deleted while still in use" problem that Thomas Munro reported, by
> removing the heap_create() change. Restoring the saved rd_createSubid had
> made obsolete the heap_create() change. check-world now passes with
> wal_level=minimal and CLOBBER_CACHE_ALWAYS.

Ok, as in the previous mail.

> - Set rd_droppedSubid in RelationForgetRelation(), not
> RelationClearRelation(). RelationForgetRelation() knows it is processing a
> drop, but RelationClearRelation() could only infer that from circumstantial
> evidence. This seems more future-proof to me.

Agreed. Different from RelationClearRelatoin, RelationForgetRelation
is called only for "drop"ing the relation.

> - When reusing an index build, instead of storing the dropped relid in the
> IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
> the subid fields in the IndexStmt. This is less code, and I felt
> RelationIdGetRelationCache() invited misuse.

Hmm. I'm not sure that index_create having the new subid parameters is
good. And the last if(OidIsValid) clause handles storage persistence
so I did that there. But I don't strongly against it.

Please give a bit more time to look it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-19 08:31:36
Message-ID:	20200219.173136.220467556238936738.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry, just one fix. (omitting some typos, though..)

At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > I think attached v35nm is ready for commit to master. Would anyone like to
> > talk me out of back-patching this? I would not enjoy back-patching it, but
> > it's hard to justify lack of back-patch for a data-loss bug.
> >
> > Notable changes since v34:
> >
> > - Separate a few freestanding fixes into their own patches.
>
> All of the three patches look fine.
>
> > On Mon, Jan 27, 2020 at 07:28:31PM +0900, Kyotaro Horiguchi wrote:
> > > --- a/src/backend/catalog/storage.c
> > > +++ b/src/backend/catalog/storage.c
> > > @@ -388,13 +388,7 @@ RelationPreTruncate(Relation rel)
> > > /* Record largest maybe-unsynced block of files under tracking */
> > > pending = hash_search(pendingSyncHash, &(rel->rd_smgr->smgr_rnode.node),
> > > HASH_FIND, NULL);
> > > - if (pending)
> > > - {
> > > - BlockNumber nblocks = smgrnblocks(rel->rd_smgr, MAIN_FORKNUM);
> > > -
> > > - if (pending->max_truncated < nblocks)
> > > - pending->max_truncated = nblocks;
> > > - }
> > > + pending->is_truncated = true;
> >
> > - Fix this crashing when "pending" is NULL, as it is in this test case:
> >
> > begin;
> > create temp table t ();
> > create table t2 (); -- cause pendingSyncHash to exist
> > truncate t;
> > rollback;
>
> That's terrible... Thanks for fixint it.
>
> > - Fix the "deleted while still in use" problem that Thomas Munro reported, by
> > removing the heap_create() change. Restoring the saved rd_createSubid had
> > made obsolete the heap_create() change. check-world now passes with
> > wal_level=minimal and CLOBBER_CACHE_ALWAYS.
>
> Ok, as in the previous mail.
>
> > - Set rd_droppedSubid in RelationForgetRelation(), not
> > RelationClearRelation(). RelationForgetRelation() knows it is processing a
> > drop, but RelationClearRelation() could only infer that from circumstantial
> > evidence. This seems more future-proof to me.
>
> Agreed. Different from RelationClearRelatoin, RelationForgetRelation
> is called only for "drop"ing the relation.
>
> > - When reusing an index build, instead of storing the dropped relid in the
> > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
> > the subid fields in the IndexStmt. This is less code, and I felt
> > RelationIdGetRelationCache() invited misuse.
>
> Hmm. I'm not sure that index_create having the new subid parameters is
> good. And the last if(OidIsValid) clause handles storage persistence
> so I did that there. But I don't strongly against it.

Hmm. I'm not sure that index_create having the new subid parameters is
good. And the last if(OidIsValid) clause in AtExecAddIndex handles
storage persistence so I did that there. But I don't strongly against
it.

> Please give a bit more time to look it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-21 07:49:59
Message-ID:	20200221.164959.653062648402657703.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello. I looked through the latest patch.

At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > - When reusing an index build, instead of storing the dropped relid in the
> > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
> > the subid fields in the IndexStmt. This is less code, and I felt
> > RelationIdGetRelationCache() invited misuse.
>
> Hmm. I'm not sure that index_create having the new subid parameters is
> good. And the last if(OidIsValid) clause handles storage persistence
> so I did that there. But I don't strongly against it.
>
> Please give a bit more time to look it.

The change on alter_table.sql and create_table.sql is expecting to
cause assertion failure. Don't we need that kind of explanation in
the comment?

In swap_relation_files, we can remove rel2-related code when #ifndef
USE_ASSERT_CHECKING.

The patch adds the test for createSubid to pg_visibility.out. It
doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but
CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be
reached. I'm not sure it is useful.

config.sgml:
+ When <varname>wal_level</varname> is <literal>minimal</literal> and a
+ transaction commits after creating or rewriting a permanent table,
+ materialized view, or index, this setting determines how to persist

"creating or truncation" a permanent table? and maybe "refreshing
matview and reindex". I'm not sure that they can be merged that way.

Other than the item related to pg_visibility.sql are in the attached.

The others look good to me.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-23 05:12:20
Message-ID:	20200223051220.GA4150059@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > - When reusing an index build, instead of storing the dropped relid in the
> > > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
> > > the subid fields in the IndexStmt. This is less code, and I felt
> > > RelationIdGetRelationCache() invited misuse.
> >
> > Hmm. I'm not sure that index_create having the new subid parameters is
> > good. And the last if(OidIsValid) clause handles storage persistence
> > so I did that there. But I don't strongly against it.

Agreed. My choice there was not a clear improvement.

> The change on alter_table.sql and create_table.sql is expecting to
> cause assertion failure. Don't we need that kind of explanation in
> the comment?

Test comments generally describe the feature unique to that test, not how the
test might break. Some tests list bug numbers, but that doesn't apply here.

> In swap_relation_files, we can remove rel2-related code when #ifndef
> USE_ASSERT_CHECKING.

When state is visible to many compilation units, we should avoid making that
state depend on --enable-cassert. That would be a recipe for a Heisenbug. In
a hot code path, it might be worth the risk.

> The patch adds the test for createSubid to pg_visibility.out. It
> doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but
> CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be
> reached. I'm not sure it is useful.

I agree it's not clearly useful, but tests don't need to meet a "clearly
useful" standard. When a fast test is not clearly redundant with another
test, we generally accept it. In the earlier patch version that inspired this
test, RELCACHE_FORCE_RELEASE sufficed to make it fail.

> config.sgml:
> + When <varname>wal_level</varname> is <literal>minimal</literal> and a
> + transaction commits after creating or rewriting a permanent table,
> + materialized view, or index, this setting determines how to persist
>
> "creating or truncation" a permanent table? and maybe "refreshing
> matview and reindex". I'm not sure that they can be merged that way.

> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2889,13 +2889,13 @@ include_dir 'conf.d'
> <listitem>
> <para>
> When <varname>wal_level</varname> is <literal>minimal</literal> and a
> - transaction commits after creating or rewriting a permanent table,
> - materialized view, or index, this setting determines how to persist
> - the new data. If the data is smaller than this setting, write it to
> - the WAL log; otherwise, use an fsync of the data file. Depending on
> - the properties of your storage, raising or lowering this value might
> - help if such commits are slowing concurrent transactions. The default
> - is two megabytes (<literal>2MB</literal>).
> + transaction commits after creating or truncating a permanent table,
> + refreshing a materialized view, or reindexing, this setting determines
> + how to persist the new data. If the data is smaller than this
> + setting, write it to the WAL log; otherwise, use an fsync of the data
> + file. Depending on the properties of your storage, raising or
> + lowering this value might help if such commits are slowing concurrent
> + transactions. The default is two megabytes (<literal>2MB</literal>).
> </para>

I like mentioning truncation, but I dislike how this implies that CREATE
INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in
scope. While I usually avoid the word "relation" in documentation, I can
justify it here to make the sentence less complex. How about the following?

--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -2484,9 +2484,9 @@ include_dir 'conf.d'
In <literal>minimal</literal> level, no information is logged for
- tables or indexes for the remainder of a transaction that creates or
- truncates them. This can make bulk operations much faster (see
- <xref linkend="populate-pitr"/>). But minimal WAL does not contain
- enough information to reconstruct the data from a base backup and the
- WAL logs, so <literal>replica</literal> or higher must be used to
- enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
- streaming replication.
+ permanent relations for the remainder of a transaction that creates,
+ rewrites, or truncates them. This can make bulk operations much
+ faster (see <xref linkend="populate-pitr"/>). But minimal WAL does
+ not contain enough information to reconstruct the data from a base
+ backup and the WAL logs, so <literal>replica</literal> or higher must
+ be used to enable WAL archiving (<xref linkend="guc-archive-mode"/>)
+ and streaming replication.
</para>
@@ -2891,9 +2891,9 @@ include_dir 'conf.d'
When <varname>wal_level</varname> is <literal>minimal</literal> and a
- transaction commits after creating or rewriting a permanent table,
- materialized view, or index, this setting determines how to persist
- the new data. If the data is smaller than this setting, write it to
- the WAL log; otherwise, use an fsync of the data file. Depending on
- the properties of your storage, raising or lowering this value might
- help if such commits are slowing concurrent transactions. The default
- is two megabytes (<literal>2MB</literal>).
+ transaction commits after creating, rewriting, or truncating a
+ permanent relation, this setting determines how to persist the new
+ data. If the data is smaller than this setting, write it to the WAL
+ log; otherwise, use an fsync of the data file. Depending on the
+ properties of your storage, raising or lowering this value might help
+ if such commits are slowing concurrent transactions. The default is
+ two megabytes (<literal>2MB</literal>).
</para>

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-25 01:01:51
Message-ID:	20200225.100151.2230637753040571699.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > > - When reusing an index build, instead of storing the dropped relid in the
> > > > IndexStmt and opening the dropped relcache entry in ATExecAddIndex(), store
> > > > the subid fields in the IndexStmt. This is less code, and I felt
> > > > RelationIdGetRelationCache() invited misuse.
> > >
> > > Hmm. I'm not sure that index_create having the new subid parameters is
> > > good. And the last if(OidIsValid) clause handles storage persistence
> > > so I did that there. But I don't strongly against it.
>
> Agreed. My choice there was not a clear improvement.
>
> > The change on alter_table.sql and create_table.sql is expecting to
> > cause assertion failure. Don't we need that kind of explanation in
> > the comment?
>
> Test comments generally describe the feature unique to that test, not how the
> test might break. Some tests list bug numbers, but that doesn't apply here.

Agreed.

> > In swap_relation_files, we can remove rel2-related code when #ifndef
> > USE_ASSERT_CHECKING.
>
> When state is visible to many compilation units, we should avoid making that
> state depend on --enable-cassert. That would be a recipe for a Heisenbug. In
> a hot code path, it might be worth the risk.

I aggree that the new #ifdef can invite a Heisenbug. I thought that
you didn't want that because it doesn't make substantial difference.
If we decide to keep the consistency there, I would like to describe
the code is there for consistency, not for the benefit of a specific
assertion.

(cluster.c:1116)
- * new. The next step for rel2 is deletion, but copy rd_*Subid for the
- * benefit of AssertPendingSyncs_RelationCache().
+ * new. The next step for rel2 is deletion, but copy rd_*Subid for the
+ * consistency of the fieles. It is checked later by
+ * AssertPendingSyncs_RelationCache().

> > The patch adds the test for createSubid to pg_visibility.out. It
> > doesn't fail without CLOBBER_CACHE_ALWAYS while regression test but
> > CLOBBER_CACHE_ALWAYS causes initdb fail and the added check won't be
> > reached. I'm not sure it is useful.
>
> I agree it's not clearly useful, but tests don't need to meet a "clearly
> useful" standard. When a fast test is not clearly redundant with another
> test, we generally accept it. In the earlier patch version that inspired this
> test, RELCACHE_FORCE_RELEASE sufficed to make it fail.
>
> > config.sgml:
> > + When <varname>wal_level</varname> is <literal>minimal</literal> and a
> > + transaction commits after creating or rewriting a permanent table,
> > + materialized view, or index, this setting determines how to persist
> >
> > "creating or truncation" a permanent table? and maybe "refreshing
> > matview and reindex". I'm not sure that they can be merged that way.
...
> I like mentioning truncation, but I dislike how this implies that CREATE
> INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in
> scope. While I usually avoid the word "relation" in documentation, I can
> justify it here to make the sentence less complex. How about the following?
>
> --- a/doc/src/sgml/config.sgml
> +++ b/doc/src/sgml/config.sgml
> @@ -2484,9 +2484,9 @@ include_dir 'conf.d'
> In <literal>minimal</literal> level, no information is logged for
> - tables or indexes for the remainder of a transaction that creates or
> - truncates them. This can make bulk operations much faster (see
> - <xref linkend="populate-pitr"/>). But minimal WAL does not contain
> - enough information to reconstruct the data from a base backup and the
> - WAL logs, so <literal>replica</literal> or higher must be used to
> - enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
> - streaming replication.
> + permanent relations for the remainder of a transaction that creates,
> + rewrites, or truncates them. This can make bulk operations much
> + faster (see <xref linkend="populate-pitr"/>). But minimal WAL does
> + not contain enough information to reconstruct the data from a base
> + backup and the WAL logs, so <literal>replica</literal> or higher must
> + be used to enable WAL archiving (<xref linkend="guc-archive-mode"/>)
> + and streaming replication.
> </para>
> @@ -2891,9 +2891,9 @@ include_dir 'conf.d'
> When <varname>wal_level</varname> is <literal>minimal</literal> and a
> - transaction commits after creating or rewriting a permanent table,
> - materialized view, or index, this setting determines how to persist
> - the new data. If the data is smaller than this setting, write it to
> - the WAL log; otherwise, use an fsync of the data file. Depending on
> - the properties of your storage, raising or lowering this value might
> - help if such commits are slowing concurrent transactions. The default
> - is two megabytes (<literal>2MB</literal>).
> + transaction commits after creating, rewriting, or truncating a
> + permanent relation, this setting determines how to persist the new
> + data. If the data is smaller than this setting, write it to the WAL
> + log; otherwise, use an fsync of the data file. Depending on the
> + properties of your storage, raising or lowering this value might help
> + if such commits are slowing concurrent transactions. The default is
> + two megabytes (<literal>2MB</literal>).
> </para>

I agree that relation works as the generic name of table-like
objects. Addition to that, doesn't using the word "storage file" make
it more clearly? I'm not confident on the wording itself, but it will
look like the following.

> @@ -2484,9 +2484,9 @@ include_dir 'conf.d'
In <literal>minimal</literal> level, no information is logged for
permanent relations for the remainder of a transaction that creates,
replaces, or truncates the on-disk file. This can make bulk
operations much

> @@ -2891,9 +2891,9 @@ include_dir 'conf.d'
When <varname>wal_level</varname> is <literal>minimal</literal> and a
transaction commits after creating, replacing, or truncating the
on-disk file, this setting determines how to persist the new data. If
the data is smaller than this setting, write it to the WAL

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-26 05:36:12
Message-ID:	20200226053612.GA22911@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > > At Wed, 19 Feb 2020 17:29:08 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > > > At Tue, 18 Feb 2020 23:44:52 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in

> > > In swap_relation_files, we can remove rel2-related code when #ifndef
> > > USE_ASSERT_CHECKING.
> >
> > When state is visible to many compilation units, we should avoid making that
> > state depend on --enable-cassert. That would be a recipe for a Heisenbug. In
> > a hot code path, it might be worth the risk.
>
> I aggree that the new #ifdef can invite a Heisenbug. I thought that
> you didn't want that because it doesn't make substantial difference.

v35nm added swap_relation_files() code so AssertPendingSyncs_RelationCache()
could check rd_droppedSubid relations. v30nm, which did not have
rd_droppedSubid, removed swap_relation_files() code that wasn't making a
difference.

> If we decide to keep the consistency there, I would like to describe
> the code is there for consistency, not for the benefit of a specific
> assertion.
>
> (cluster.c:1116)
> - * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> - * benefit of AssertPendingSyncs_RelationCache().
> + * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> + * consistency of the fieles. It is checked later by
> + * AssertPendingSyncs_RelationCache().

I think the word "consistency" is too vague for "consistency of the fields" to
convey information. May I just remove the last sentence of the comment
(everything after "* new.")?

> > > config.sgml:
> > > + When <varname>wal_level</varname> is <literal>minimal</literal> and a
> > > + transaction commits after creating or rewriting a permanent table,
> > > + materialized view, or index, this setting determines how to persist
> > >
> > > "creating or truncation" a permanent table? and maybe "refreshing
> > > matview and reindex". I'm not sure that they can be merged that way.
> ...
> > I like mentioning truncation, but I dislike how this implies that CREATE
> > INDEX, CREATE MATERIALIZED VIEW, and ALTER INDEX SET TABLESPACE aren't in
> > scope. While I usually avoid the word "relation" in documentation, I can
> > justify it here to make the sentence less complex. How about the following?
> >
> > --- a/doc/src/sgml/config.sgml
> > +++ b/doc/src/sgml/config.sgml
> > @@ -2484,9 +2484,9 @@ include_dir 'conf.d'
> > In <literal>minimal</literal> level, no information is logged for
> > - tables or indexes for the remainder of a transaction that creates or
> > - truncates them. This can make bulk operations much faster (see
> > - <xref linkend="populate-pitr"/>). But minimal WAL does not contain
> > - enough information to reconstruct the data from a base backup and the
> > - WAL logs, so <literal>replica</literal> or higher must be used to
> > - enable WAL archiving (<xref linkend="guc-archive-mode"/>) and
> > - streaming replication.
> > + permanent relations for the remainder of a transaction that creates,
> > + rewrites, or truncates them. This can make bulk operations much
> > + faster (see <xref linkend="populate-pitr"/>). But minimal WAL does
> > + not contain enough information to reconstruct the data from a base
> > + backup and the WAL logs, so <literal>replica</literal> or higher must
> > + be used to enable WAL archiving (<xref linkend="guc-archive-mode"/>)
> > + and streaming replication.
> > </para>
> > @@ -2891,9 +2891,9 @@ include_dir 'conf.d'
> > When <varname>wal_level</varname> is <literal>minimal</literal> and a
> > - transaction commits after creating or rewriting a permanent table,
> > - materialized view, or index, this setting determines how to persist
> > - the new data. If the data is smaller than this setting, write it to
> > - the WAL log; otherwise, use an fsync of the data file. Depending on
> > - the properties of your storage, raising or lowering this value might
> > - help if such commits are slowing concurrent transactions. The default
> > - is two megabytes (<literal>2MB</literal>).
> > + transaction commits after creating, rewriting, or truncating a
> > + permanent relation, this setting determines how to persist the new
> > + data. If the data is smaller than this setting, write it to the WAL
> > + log; otherwise, use an fsync of the data file. Depending on the
> > + properties of your storage, raising or lowering this value might help
> > + if such commits are slowing concurrent transactions. The default is
> > + two megabytes (<literal>2MB</literal>).
> > </para>
>
> I agree that relation works as the generic name of table-like
> objects. Addition to that, doesn't using the word "storage file" make
> it more clearly? I'm not confident on the wording itself, but it will
> look like the following.
>
> > @@ -2484,9 +2484,9 @@ include_dir 'conf.d'
> In <literal>minimal</literal> level, no information is logged for
> permanent relations for the remainder of a transaction that creates,
> replaces, or truncates the on-disk file. This can make bulk
> operations much

The docs rarely use "storage file" or "on-disk file" as terms. I hesitate to
put more emphasis on files, because they are part of the implementation, not
part of the user interface. The term "rewrites"/"rewriting" has the same
problem, though. Yet another alternative would be to talk about operations
that change the pg_relation_filenode() return value:

In <literal>minimal</literal> level, no information is logged for permanent
relations for the remainder of a transaction that creates them or changes
what <function>pg_relation_filenode</function> returns for them.

What do you think?

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-02-27 07:00:24
Message-ID:	20200227.160024.1603714516899165010.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Tue, 25 Feb 2020 21:36:12 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > I aggree that the new #ifdef can invite a Heisenbug. I thought that
> > you didn't want that because it doesn't make substantial difference.
>
> v35nm added swap_relation_files() code so AssertPendingSyncs_RelationCache()
> could check rd_droppedSubid relations. v30nm, which did not have
> rd_droppedSubid, removed swap_relation_files() code that wasn't making a
> difference.

Ok, I understand that it meant that the additional code still makes
difference in --enable-cassert build.

> > If we decide to keep the consistency there, I would like to describe
> > the code is there for consistency, not for the benefit of a specific
> > assertion.
> >
> > (cluster.c:1116)
> > - * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > - * benefit of AssertPendingSyncs_RelationCache().
> > + * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > + * consistency of the fieles. It is checked later by
> > + * AssertPendingSyncs_RelationCache().
>
> I think the word "consistency" is too vague for "consistency of the fields" to
> convey information. May I just remove the last sentence of the comment
> (everything after "* new.")?

I'm fine with that:)

> > I agree that relation works as the generic name of table-like
> > objects. Addition to that, doesn't using the word "storage file" make
> > it more clearly? I'm not confident on the wording itself, but it will
> > look like the following.
>
> The docs rarely use "storage file" or "on-disk file" as terms. I hesitate to
> put more emphasis on files, because they are part of the implementation, not
> part of the user interface. The term "rewrites"/"rewriting" has the same
> problem, though. Yet another alternative would be to talk about operations
> that change the pg_relation_filenode() return value:
>
> In <literal>minimal</literal> level, no information is logged for permanent
> relations for the remainder of a transaction that creates them or changes
> what <function>pg_relation_filenode</function> returns for them.
>
> What do you think?

It sounds somewhat obscure. Coulnd't we enumetate examples? And if we
could use pg_relation_filenode, I think we can use just
"filenode". (Thuogh the word is used in the documentation, it is not
defined anywhere..)

====
In <literal>minimal</literal> level, no information is logged for
permanent relations for the remainder of a transaction that creates
them or changes their <code>filenode</code>. For example, CREATE
TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
category.
====

# sorry for bothering you..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-01 19:56:32
Message-ID:	20200301195632.GA135286@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote:
> At Tue, 25 Feb 2020 21:36:12 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > On Tue, Feb 25, 2020 at 10:01:51AM +0900, Kyotaro Horiguchi wrote:
> > > At Sat, 22 Feb 2020 21:12:20 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > > On Fri, Feb 21, 2020 at 04:49:59PM +0900, Kyotaro Horiguchi wrote:
> > > If we decide to keep the consistency there, I would like to describe
> > > the code is there for consistency, not for the benefit of a specific
> > > assertion.
> > >
> > > (cluster.c:1116)
> > > - * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > > - * benefit of AssertPendingSyncs_RelationCache().
> > > + * new. The next step for rel2 is deletion, but copy rd_*Subid for the
> > > + * consistency of the fieles. It is checked later by
> > > + * AssertPendingSyncs_RelationCache().
> >
> > I think the word "consistency" is too vague for "consistency of the fields" to
> > convey information. May I just remove the last sentence of the comment
> > (everything after "* new.")?
>
> I'm fine with that:)
>
> > > I agree that relation works as the generic name of table-like
> > > objects. Addition to that, doesn't using the word "storage file" make
> > > it more clearly? I'm not confident on the wording itself, but it will
> > > look like the following.
> >
> > The docs rarely use "storage file" or "on-disk file" as terms. I hesitate to
> > put more emphasis on files, because they are part of the implementation, not
> > part of the user interface. The term "rewrites"/"rewriting" has the same
> > problem, though. Yet another alternative would be to talk about operations
> > that change the pg_relation_filenode() return value:
> >
> > In <literal>minimal</literal> level, no information is logged for permanent
> > relations for the remainder of a transaction that creates them or changes
> > what <function>pg_relation_filenode</function> returns for them.
> >
> > What do you think?
>
> It sounds somewhat obscure.

I see. I won't use that.

> Coulnd't we enumetate examples? And if we
> could use pg_relation_filenode, I think we can use just
> "filenode". (Thuogh the word is used in the documentation, it is not
> defined anywhere..)

func.sgml does define the term. Nonetheless, I'm not using it.

> ====
> In <literal>minimal</literal> level, no information is logged for
> permanent relations for the remainder of a transaction that creates
> them or changes their <code>filenode</code>. For example, CREATE
> TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
> category.
> ====
>
> # sorry for bothering you..

Including examples is fine. Attached v36nm has just comment and doc changes.
Would you translate this into back-patch versions for v9.5 through v12?

Attachment	Content-Type	Size
createsubid-cosmetics-v35nm.patch	text/plain	2.4 KB
toast_get_valid_index-lock-v35nm.patch	text/plain	1.9 KB
log_newpage_range-args-v35nm.patch	text/plain	1.7 KB
skip-wal-v36nm.patch	text/plain	115.6 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-02 07:53:53
Message-ID:	20200302.165353.1408159864331844170.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sun, 1 Mar 2020 11:56:32 -0800, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Thu, Feb 27, 2020 at 04:00:24PM +0900, Kyotaro Horiguchi wrote:
> > It sounds somewhat obscure.
>
> I see. I won't use that.

Thanks.

> > Coulnd't we enumetate examples? And if we
> > could use pg_relation_filenode, I think we can use just
> > "filenode". (Thuogh the word is used in the documentation, it is not
> > defined anywhere..)
>
> func.sgml does define the term. Nonetheless, I'm not using it.

Ah, "The filenode is the base component oif the file name(s) used for
the relation".. So it's very similar to "on-disk file" in a sense.

> > ====
> > In <literal>minimal</literal> level, no information is logged for
> > permanent relations for the remainder of a transaction that creates
> > them or changes their <code>filenode</code>. For example, CREATE
> > TABLE, CLUSTER or REFRESH MATERIALIZED VIEW are the command of that
> > category.
> > ====
> >
> > # sorry for bothering you..
>
> Including examples is fine. Attached v36nm has just comment and doc changes.
> Would you translate this into back-patch versions for v9.5 through v12?

The explicit list of commands that initiate the WAL-skipping mode
works for me. I'm going to work on the tranlation right now.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-04 07:29:19
Message-ID:	20200304.162919.898938381201316571.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hello.

The attached is back-patches from 9.5 through master.

At Mon, 02 Mar 2020 16:53:53 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > Would you translate this into back-patch versions for v9.5 through v12?
>
> The explicit list of commands that initiate the WAL-skipping mode
> works for me. I'm going to work on the tranlation right now.

At first I fixed several ssues in 018_wal_optimize.pl:

- TRUNCATE INSERT, TRUNCATE INSERT PREPARE

It wrongly passes if finally we see the value only from the first
INSERT. I changed it so that it checks the value, not the number of
values.

- TRUNCATE with end-of-xact WAL => lengty end-of-xact WAL

TRUNCATE inhibits end-of-xact WAL so I removed the TRUNCATE. It uses
only 1 page so it fails to excercise multi-page behavior of
log_newpage_range. At least 33 pages is needed to check if it is
working correctly. 10000 rows is sufficient but I choosed 20000 rows
including margin.

- COPY with INSERT tirggers
It wrongly referes to OLD in AFTER-INSERT trigger. It yeilds NULL
for 11 and later, or ends in ERROR otherwise. Addition to that
AFTER-INSERT ROW-level trigger is fired after *stagtement* (but
before AFTER-INSERT statement level triggers). That being said, it
doesn't affect the result of the test so I leave it with modifying it
not to refer to OLD.

log_newpage_range has been introduced at PG12. Fortunately the
required infrastructure is introduced at PG9.5 so what I need to do
for PG95-PG11 is back-patching the function and its counter part in
xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or
more backup pages so the compatibility is forward only. That is, newer
minor versions read WAL from older minor versions, but not vise
versea. I'm not sure it is back-patchable so in the attached the
end-of-xact WAL feature is separated for PG9.5-PG11.
(000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch)

====

In the patchset for 12, I let the functions heap_sync,
heapam_methods.finish_bulk_insert and table_finish_bulk_insert left
as-is. As the result heapam_finish_bulk_insert becomes no-op.
begin_heap_rewrite is a public function but the last parameter is
useless and rather harmful as it looks as if it works. So I removed
the parameter.

For 11 and 10, heap_sync and begin_heap_rewrite is treated the same
way to 12.

For 9.6, mdexists() creates the specified file while bootstrap mode
and that leads to assertion failure of smgrDoPendingSyncs. So I made
CreateStorage not register pending sync while bootstrap mode.
gistbuild generates the LSN for root page of a newly created index
using gistGetFakeLSN(heap), which fires assertion failure in
gistGetFakeLSN. I think we should use index instead of heap there,
but it doesn't matter if we don't have the new pending sync mechanism,
so I didn't split it as a separate patch. pg_visibility doesn't have
regression test but I added the files conatining only the test for
this feature.

For 9.5, pg_visibility does not exist so I dropped the test for the
module. It lacks a part of TAP infrastructure nowadays we have, but I
want to have the test (and it actually found a bug I made during this
work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
and RecursiveCopy.pm along with 018_wal_optimize.pl.
(0004-Add-TAP-test-for-WAL-skipping-feature.patch)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
wal_skip_optimize_patchset_20200304.tar.gz	application/octet-stream	226.1 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-04 07:44:25
Message-ID:	20200304.164425.1403366106926961143.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Some fixes..

At Wed, 04 Mar 2020 16:29:19 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At first I fixed several ssues in 018_wal_optimize.pl:
>
> - TRUNCATE INSERT, TRUNCATE INSERT PREPARE
>
> It wrongly passes if finally we see the value only from the first
> INSERT. I changed it so that it checks the value, not the number of
> values.

Finally it checks both number of values and the largest value.

...
> log_newpage_range has been introduced at PG12. Fortunately the
> required infrastructure is introduced at PG9.5 so what I need to do
> for PG95-PG11 is back-patching the function and its counter part in
- xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or
+ xlog_redo. It doen't change WAL format itself but XLOG_FPI gets to have 2 or
> more backup pages so the compatibility is forward only. That is, newer
> minor versions read WAL from older minor versions, but not vise
> versea. I'm not sure it is back-patchable so in the attached the
> end-of-xact WAL feature is separated for PG9.5-PG11.
> (000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-16 03:46:47
Message-ID:	20200316034647.GA1121601@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> The attached is back-patches from 9.5 through master.

Thanks. I've made some edits. I'll plan to push the attached patches on
Friday or Saturday.

> log_newpage_range has been introduced at PG12. Fortunately the
> required infrastructure is introduced at PG9.5 so what I need to do
> for PG95-PG11 is back-patching the function and its counter part in
> xlog_redo. It doen't WAL format itself but XLOG_FPI gets to have 2 or
> more backup pages so the compatibility is forward only. That is, newer
> minor versions read WAL from older minor versions, but not vise
> versea. I'm not sure it is back-patchable so in the attached the
> end-of-xact WAL feature is separated for PG9.5-PG11.
> (000x-Add-end-of-xact-WAL-feature-of-WAL-skipping.patch)

The main patch's introduction of XLOG_GIST_ASSIGN_LSN already creates a WAL
upgrade hazard. Changing XLOG_FPI is riskier, because an old server will
apply the first FPI and ignore the rest. For v11 and earlier, I decided to
introduce XLOG_FPI_MULTI. It behaves exactly like XLOG_FPI, but this PANICs
if one reads post-update WAL with a pre-update server. The main alternative
would be to issue one XLOG_FPI per page, but I was concerned that would cause
a notable performance loss.

> In the patchset for 12, I let the functions heap_sync,
> heapam_methods.finish_bulk_insert and table_finish_bulk_insert left
> as-is. As the result heapam_finish_bulk_insert becomes no-op.

heapam_finish_bulk_insert() is a static function, so I deleted it.

> begin_heap_rewrite is a public function but the last parameter is
> useless and rather harmful as it looks as if it works. So I removed
> the parameter.

Agreed. Also, pgxn contains no references to begin_heap_rewrite().

> For 9.6, mdexists() creates the specified file while bootstrap mode
> and that leads to assertion failure of smgrDoPendingSyncs. So I made
> CreateStorage not register pending sync while bootstrap mode.
> gistbuild generates the LSN for root page of a newly created index
> using gistGetFakeLSN(heap), which fires assertion failure in
> gistGetFakeLSN. I think we should use index instead of heap there,
> but it doesn't matter if we don't have the new pending sync mechanism,
> so I didn't split it as a separate patch.

v11 and v10, too, had the gistGetFakeLSN(heap) problem. I saw that and other
problems by running the following on each branch:

make check-world
printf '%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' >/tmp/minimal.conf
make check-world TEMP_CONFIG=/tmp/minimal.conf
make -C doc # catch breakage when XML changes don't work in SGML

> For 9.5, pg_visibility does not exist so I dropped the test for the
> module.

The test would have required further changes to work in v11 or earlier, so I
deleted the test. It was a low-importance test.

> It lacks a part of TAP infrastructure nowadays we have, but I
> want to have the test (and it actually found a bug I made during this
> work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> and RecursiveCopy.pm along with 018_wal_optimize.pl.
> (0004-Add-TAP-test-for-WAL-skipping-feature.patch)

That is a good idea. Rather than make it specific to this test, I would like
to back-patch all applicable test files from 9.6 src/test/recovery. I'll plan
to push that one part on Thursday.

Other notable changes:

- Like you suggested earlier, I moved restoration of old*Subid from
index_create() back to ATExecAddIndex(). I decided to do this when I
observed that pg_idx_advisor calls index_create(). That's not a strong
reason, but it was enough to change a decision that had been arbitrary.

- Updated the wal_skip_threshold GUC category to WAL_SETTINGS, for consistency
with the documentation move. Added the GUC to postgresql.conf.sample.

- In released branches, I moved the new public struct fields to the end. This
reduces the number of extensions requiring a recompile. From a grep of
pgxn, one extension ("citus") relies on sizeof(RelationData), and nothing
relies on sizeof(IndexStmt).

- In 9.6, I rewrote the mdimmedsync() changes so the function never ignores
FileSync() failure.

Other observations:

- The new test file takes ~62s longer on 9.6 and 9.5, mostly due to commit
c61559ec3 first appearing in v10. I am fine with this.

- This is the most-demanding back-branch fix I've ever attempted. Hopefully
I've been smarter than usual while reviewing it, but that is unlikely.

Thanks,
nm

Attachment	Content-Type	Size
skip-wal-v38nm.tar.gz	application/x-tar-gz	216.4 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-21 19:01:27
Message-ID:	20200321190127.GA1763544@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote:
> On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> > The attached is back-patches from 9.5 through master.
>
> Thanks. I've made some edits. I'll plan to push the attached patches on
> Friday or Saturday.

Pushed, after adding a missing "break" to gist_identify() and tweaking two
more comments. However, a diverse minority of buildfarm members are failing
like this, in most branches:

Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231.
Mar 21 13:16:37 # got: '1'
Mar 21 13:16:37 # expected: '2'
Mar 21 13:16:46 # Looks like you failed 1 test of 34.
Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................
-- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05

Since I run two of the failing animals, I expect to reproduce this soon.

fairywren failed differently on 9.5; I have not yet studied it:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10

> > It lacks a part of TAP infrastructure nowadays we have, but I
> > want to have the test (and it actually found a bug I made during this
> > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> > and RecursiveCopy.pm along with 018_wal_optimize.pl.
> > (0004-Add-TAP-test-for-WAL-skipping-feature.patch)
>
> That is a good idea. Rather than make it specific to this test, I would like
> to back-patch all applicable test files from 9.6 src/test/recovery. I'll plan
> to push that one part on Thursday.

That push did not cause failures.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-21 19:04:50
Message-ID:	20200321190450.GL10066@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Wow, this thread started in 2015. :-O

Date: Fri, 3 Jul 2015 00:05:24 +0200

---------------------------------------------------------------------------

On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> On Sun, Mar 15, 2020 at 08:46:47PM -0700, Noah Misch wrote:
> > On Wed, Mar 04, 2020 at 04:29:19PM +0900, Kyotaro Horiguchi wrote:
> > > The attached is back-patches from 9.5 through master.
> >
> > Thanks. I've made some edits. I'll plan to push the attached patches on
> > Friday or Saturday.
>
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments. However, a diverse minority of buildfarm members are failing
> like this, in most branches:
>
> Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 # got: '1'
> Mar 21 13:16:37 # expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................
> -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
>
> Since I run two of the failing animals, I expect to reproduce this soon.
>
> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10
>
> > > It lacks a part of TAP infrastructure nowadays we have, but I
> > > want to have the test (and it actually found a bug I made during this
> > > work). So I added a patch to back-patch TestLib.pm, PostgresNode.pm
> > > and RecursiveCopy.pm along with 018_wal_optimize.pl.
> > > (0004-Add-TAP-test-for-WAL-skipping-feature.patch)
> >
> > That is a good idea. Rather than make it specific to this test, I would like
> > to back-patch all applicable test files from 9.6 src/test/recovery. I'll plan
> > to push that one part on Thursday.
>
> That push did not cause failures.
>
>

--
Bruce Momjian <bruce(at)momjian(dot)us> https://momjian.us
EnterpriseDB https://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-21 22:49:20
Message-ID:	20200321224920.GB1763544@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> Pushed, after adding a missing "break" to gist_identify() and tweaking two
> more comments. However, a diverse minority of buildfarm members are failing
> like this, in most branches:
>
> Mar 21 13:16:37 # Failed test 'wal_level = minimal, SET TABLESPACE, hint bit'
> Mar 21 13:16:37 # at t/018_wal_optimize.pl line 231.
> Mar 21 13:16:37 # got: '1'
> Mar 21 13:16:37 # expected: '2'
> Mar 21 13:16:46 # Looks like you failed 1 test of 34.
> Mar 21 13:16:46 [13:16:46] t/018_wal_optimize.pl ................
> -- https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2020-03-21%2016%3A52%3A05
>
> Since I run two of the failing animals, I expect to reproduce this soon.

force_parallel_regress was the setting needed to reproduce this:

printf '%s\n%s\n%s\n' 'log_statement = all' 'force_parallel_mode = regress' >/tmp/force_parallel.conf
make -C src/test/recovery check PROVE_TESTS=t/018_wal_optimize.pl TEMP_CONFIG=/tmp/force_parallel.conf

The proximate cause is the RelFileNodeSkippingWAL() call that we added to
MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but
parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up
the attached patch to understand the scope of the problem (not to commit). It
logs a message whenever a parallel worker uses pendingSyncHash or
RelationNeedsWAL(). Some of the cases happen often enough to make logs huge,
so the patch suppresses logging for them. You can see the lower-volume calls
like this:

printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode = regress' >/tmp/minimal_parallel.conf
make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
find . -name log | xargs grep -rl 'nm0 invalid'

Not all are actual bugs. For example, get_relation_info() behaves fine:

/* Temporary and unlogged relations are inaccessible during recovery. */
if (!RelationNeedsWAL(relation) && RecoveryInProgress())

Kyotaro, can you look through the affected code and propose a strategy for
good coexistence of parallel query with the WAL skipping mechanism?

Since I don't expect one strategy to win clearly and quickly, I plan to revert
the main patch around 2020-03-22 17:30 UTC. That will give the patch about
twenty-four hours in the buildfarm, so more animals can report in. I will
leave the three smaller patches in place.

> fairywren failed differently on 9.5; I have not yet studied it:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10

This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or
SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in
the test should fix this.

Attachment	Content-Type	Size
debug-parallel-skip-wal-v0.patch	text/plain	5.1 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-23 08:20:27
Message-ID:	20200323.172027.2270553329883636814.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Thanks for the labour on this.

At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Sat, Mar 21, 2020 at 12:01:27PM -0700, Noah Misch wrote:
> > Pushed, after adding a missing "break" to gist_identify() and tweaking two
..
> The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but
> parallel workers have zeroes for pendingSyncHash and rd_*Subid. I hacked up
> the attached patch to understand the scope of the problem (not to commit). It
> logs a message whenever a parallel worker uses pendingSyncHash or
> RelationNeedsWAL(). Some of the cases happen often enough to make logs huge,
> so the patch suppresses logging for them. You can see the lower-volume calls
> like this:
>
> printf '%s\n%s\n%s\n%s\n' 'log_statement = all' 'wal_level = minimal' 'max_wal_senders = 0' 'force_parallel_mode = regress' >/tmp/minimal_parallel.conf
> make check-world TEMP_CONFIG=/tmp/minimal_parallel.conf
> find . -name log | xargs grep -rl 'nm0 invalid'
>
> Not all are actual bugs. For example, get_relation_info() behaves fine:
>
> /* Temporary and unlogged relations are inaccessible during recovery. */
> if (!RelationNeedsWAL(relation) && RecoveryInProgress())

But the relcache entry shows wrong information about new-ness of its
storage and it is the root cause of the all other problems.

> Kyotaro, can you look through the affected code and propose a strategy for
> good coexistence of parallel query with the WAL skipping mechanism?

Bi-directional communication between leader and workers is too-much.
It wouldn't be acceptable to inhibit the problematic operations on
workers such like heap-prune or btree pin removal. If we do pending
syncs just before worker start, it won't fix the issue.

The attached patch passes a list of pending-sync relfilenodes at
worker start. Workers create (immature) pending sync hash from the
list and create relcache entries using the hash. Given that parallel
workers don't perform transactional operations and DDL operations,
workers needs only the list of relfilenodes. The list might be long,
but I don't think it realistic that such many tables are truncated or
created then scanned in parallel within a transaction while wal_level
= minimal.

> Since I don't expect one strategy to win clearly and quickly, I plan to revert
> the main patch around 2020-03-22 17:30 UTC. That will give the patch about
> twenty-four hours in the buildfarm, so more animals can report in. I will
> leave the three smaller patches in place.

Thank you for your trouble and the check code. Sorry for not
responding in time.

> > fairywren failed differently on 9.5; I have not yet studied it:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2020-03-21%2018%3A01%3A10
>
> This did not remain specific to 9.5. On platforms where SIZEOF_SIZE_T==4 or
> SIZEOF_LONG==4, wal_skip_threshold cannot exceed 2GB. A simple s/1TB/1GB/ in
> the test should fix this.

Oops. I felt that the 2TB looks too large but didn't get it
seriously. 1GB is 1048576 is less than the said limit 2097151 so the
attached second patch does that.

The attached is a proposal of fix of the issue on top of the reverted
commit.

- v36-0001-Skip-WAL-for-new-relfilenodes-under-wal_level-mi.patch
The reverted patch.

- v36-0002-Fix-GUC-value-in-TAP-test.patch
Change wal_skip_threashold to 2TB to 2GB in TAP test.

v36-0003-Fix-the-name-of-struct-pendingSyncs.patch
I found that the struct of pending sync hash entry is named
differently way from pending delete hash entry. Change it so that the
two are in similarly naming convention.

v36-0004-Propagage-pending-sync-information-to-parallel-w.patch

The proposed fix for the parallel-worker problem.

The make check-world above didn't fail with this patch.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachment	Content-Type	Size
v36-0001-Skip-WAL-for-new-relfilenodes-under-wal_level-mi.patch	text/x-patch	116.4 KB
v36-0002-Fix-GUC-value-in-TAP-test.patch	text/x-patch	952 bytes
v36-0003-Fix-the-name-of-struct-pendingSyncs.patch	text/x-patch	2.4 KB
v36-0004-Propagage-pending-sync-information-to-parallel-w.patch	text/x-patch	9.6 KB

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-30 04:41:01
Message-ID:	20200330044101.GA2324620@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I think attached v41nm is ready for commit. Would anyone like to vote against
back-patching this? It's hard to justify lack of back-patch for a data-loss
bug, but this is atypically invasive. (I'm repeating the question, since some
folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday.

On Mon, Mar 23, 2020 at 05:20:27PM +0900, Kyotaro Horiguchi wrote:
> At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> > MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but
> > parallel workers have zeroes for pendingSyncHash and rd_*Subid.

> > Kyotaro, can you look through the affected code and propose a strategy for
> > good coexistence of parallel query with the WAL skipping mechanism?
>
> Bi-directional communication between leader and workers is too-much.
> It wouldn't be acceptable to inhibit the problematic operations on
> workers such like heap-prune or btree pin removal. If we do pending
> syncs just before worker start, it won't fix the issue.
>
> The attached patch passes a list of pending-sync relfilenodes at
> worker start.

If you were to issue pending syncs and also cease skipping WAL for affected
relations, that would fix the issue. Your design is better, though. I made
two notable changes:

- The patch was issuing syncs or FPIs every time a parallel worker exited. I
changed it to skip most of smgrDoPendingSyncs() in parallel workers, like
AtEOXact_RelationMap() does.

- PARALLEL_KEY_PENDING_SYNCS is most similar to PARALLEL_KEY_REINDEX_STATE and
PARALLEL_KEY_COMBO_CID. parallel.c, not execParallel.c, owns those. I
moved PARALLEL_KEY_PENDING_SYNCS to parallel.c, which also called for style
changes in the associated storage.c functions.

Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.

Attachment	Content-Type	Size
skip-wal-v41nm.tar.gz	application/x-tar-gz	192.3 KB

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-30 05:56:11
Message-ID:	20200330.145611.1603373373605263450.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> I think attached v41nm is ready for commit. Would anyone like to vote against
> back-patching this? It's hard to justify lack of back-patch for a data-loss
> bug, but this is atypically invasive. (I'm repeating the question, since some
> folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday.
>
> On Mon, Mar 23, 2020 at 05:20:27PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 21 Mar 2020 15:49:20 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > The proximate cause is the RelFileNodeSkippingWAL() call that we added to
> > > MarkBufferDirtyHint(). MarkBufferDirtyHint() runs in parallel workers, but
> > > parallel workers have zeroes for pendingSyncHash and rd_*Subid.
>
> > > Kyotaro, can you look through the affected code and propose a strategy for
> > > good coexistence of parallel query with the WAL skipping mechanism?
> >
> > Bi-directional communication between leader and workers is too-much.
> > It wouldn't be acceptable to inhibit the problematic operations on
> > workers such like heap-prune or btree pin removal. If we do pending
> > syncs just before worker start, it won't fix the issue.
> >
> > The attached patch passes a list of pending-sync relfilenodes at
> > worker start.
>
> If you were to issue pending syncs and also cease skipping WAL for affected
> relations, that would fix the issue. Your design is better, though. I made
> two notable changes:
>
> - The patch was issuing syncs or FPIs every time a parallel worker exited. I
> changed it to skip most of smgrDoPendingSyncs() in parallel workers, like
> AtEOXact_RelationMap() does.

Exactly. Thank you for fixing it.

> - PARALLEL_KEY_PENDING_SYNCS is most similar to PARALLEL_KEY_REINDEX_STATE and
> PARALLEL_KEY_COMBO_CID. parallel.c, not execParallel.c, owns those. I
> moved PARALLEL_KEY_PENDING_SYNCS to parallel.c, which also called for style
> changes in the associated storage.c functions.

That sounds better.

Moving the responsibility of creating pending syncs array reduces
copy. RestorePendingSyncs (And AddPendingSync()) looks better.

> Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
> XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.

Sounds reasonable. In AddPendingSync, don't we put
Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
The former guarantees the relationship between XLogIsNeeded() and
pendingSyncHash, and the existing latter assertion looks redundant as
it is placed just after "if (pendingSyncHash)".

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-30 06:08:27
Message-ID:	20200330060827.GD2324620@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote:
> At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
> > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.
>
> Sounds reasonable. In AddPendingSync, don't we put
> Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
> The former guarantees the relationship between XLogIsNeeded() and
> pendingSyncHash, and the existing latter assertion looks redundant as
> it is placed just after "if (pendingSyncHash)".

The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it. I
am not inclined to replace it with Assert(!XLogIsNeeded()). This static
function is not likely to get more callers, so the chance of accidentally
calling it under XLogIsNeeded() is too low.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, michael(at)paquier(dot)xyz
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-30 06:22:32
Message-ID:	20200330.152232.1538775970021459822.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sun, 29 Mar 2020 23:08:27 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Mon, Mar 30, 2020 at 02:56:11PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 29 Mar 2020 21:41:01 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > > Since pendingSyncHash is always NULL under XLogIsNeeded(), I also removed some
> > > XLogIsNeeded() tests that immediately preceded !pendingSyncHash tests.
> >
> > Sounds reasonable. In AddPendingSync, don't we put
> > Assert(!XLogIsNeeded()) instead of "Assert(pendingSyncHash == NULL)"?
> > The former guarantees the relationship between XLogIsNeeded() and
> > pendingSyncHash, and the existing latter assertion looks redundant as
> > it is placed just after "if (pendingSyncHash)".
>
> The "Assert(pendingSyncHash == NULL)" is indeed useless; I will remove it. I
> am not inclined to replace it with Assert(!XLogIsNeeded()). This static
> function is not likely to get more callers, so the chance of accidentally
> calling it under XLogIsNeeded() is too low.

Agreed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-30 07:43:00
Message-ID:	20200330074300.GD43995@paquier.xyz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> I think attached v41nm is ready for commit. Would anyone like to vote against
> back-patching this? It's hard to justify lack of back-patch for a data-loss
> bug, but this is atypically invasive. (I'm repeating the question, since some
> folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday.

The invasiveness of the patch is a concern. Have you considered a
different strategy? For example, we are soon going to be in beta for
13, so you could consider committing the patch only on HEAD first.
If there are issues to take care of, you can then leverage the beta
testing to address any issues found. Finally, once some dust has
settled on the concept and we have gained enough confidence, we could
consider a back-patch. In short, my point is just that even if this
stuff is discussed for years, I see no urgency in back-patching per
the lack of complains we have in -bugs or such.
--
Michael

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-31 06:28:54
Message-ID:	20200331062854.GC3354091@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > I think attached v41nm is ready for commit. Would anyone like to vote against
> > back-patching this? It's hard to justify lack of back-patch for a data-loss
> > bug, but this is atypically invasive. (I'm repeating the question, since some
> > folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday.
>
> The invasiveness of the patch is a concern. Have you considered a
> different strategy? For example, we are soon going to be in beta for
> 13, so you could consider committing the patch only on HEAD first.
> If there are issues to take care of, you can then leverage the beta
> testing to address any issues found. Finally, once some dust has
> settled on the concept and we have gained enough confidence, we could
> consider a back-patch.

No. Does anyone favor this proposal more than back-patching normally?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-03-31 06:37:57
Message-ID:	20200331063757.rvix7i3ttdpkod5a@alap3.anarazel.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On 2020-03-30 23:28:54 -0700, Noah Misch wrote:
> On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > > I think attached v41nm is ready for commit. Would anyone like to vote against
> > > back-patching this? It's hard to justify lack of back-patch for a data-loss
> > > bug, but this is atypically invasive. (I'm repeating the question, since some
> > > folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday.
> >
> > The invasiveness of the patch is a concern. Have you considered a
> > different strategy? For example, we are soon going to be in beta for
> > 13, so you could consider committing the patch only on HEAD first.
> > If there are issues to take care of, you can then leverage the beta
> > testing to address any issues found. Finally, once some dust has
> > settled on the concept and we have gained enough confidence, we could
> > consider a back-patch.
>
> No. Does anyone favor this proposal more than back-patching normally?

I have not reviewed the patch, so I don't have a good feeling for its
riskiness. But it does sound fairly invasive. Given that we've lived
with this issue for many years by now, and that the rate of incidents
seems to have been fairly low, I think living with the issue for a bit
longer to gain confidence might be a good choice. But I'd not push back
if you, being much more informed, think the risk/reward balance favors
immediate backpatching.

Greetings,

Andres Freund

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-04-02 03:51:29
Message-ID:	20200402035129.GA3376861@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Mar 30, 2020 at 11:37:57PM -0700, Andres Freund wrote:
> On 2020-03-30 23:28:54 -0700, Noah Misch wrote:
> > On Mon, Mar 30, 2020 at 04:43:00PM +0900, Michael Paquier wrote:
> > > On Sun, Mar 29, 2020 at 09:41:01PM -0700, Noah Misch wrote:
> > > > I think attached v41nm is ready for commit. Would anyone like to vote against
> > > > back-patching this? It's hard to justify lack of back-patch for a data-loss
> > > > bug, but this is atypically invasive. (I'm repeating the question, since some
> > > > folks missed my 2020-02-18 question.) Otherwise, I'll push this on Saturday.
> > >
> > > The invasiveness of the patch is a concern. Have you considered a
> > > different strategy? For example, we are soon going to be in beta for
> > > 13, so you could consider committing the patch only on HEAD first.
> > > If there are issues to take care of, you can then leverage the beta
> > > testing to address any issues found. Finally, once some dust has
> > > settled on the concept and we have gained enough confidence, we could
> > > consider a back-patch.
> >
> > No. Does anyone favor this proposal more than back-patching normally?
>
> I have not reviewed the patch, so I don't have a good feeling for its
> riskiness. But it does sound fairly invasive. Given that we've lived
> with this issue for many years by now, and that the rate of incidents
> seems to have been fairly low, I think living with the issue for a bit
> longer to gain confidence might be a good choice. But I'd not push back
> if you, being much more informed, think the risk/reward balance favors
> immediate backpatching.

I've translated the non-vote comments into estimated votes of -0.3, -0.6,
-0.4, +0.5, and -0.3. Hence, I revoke the plan to back-patch.

From:	Robert Haas <robertmhaas(at)gmail(dot)com>
To:	Noah Misch <noah(at)leadboat(dot)com>
Cc:	"pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-04-02 15:24:36
Message-ID:	CA+TgmoZHsEN9c=mphT_7jg3AANDzWaz4AKD2kMkRRpcNtaWebg@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 1, 2020 at 11:51 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> I've translated the non-vote comments into estimated votes of -0.3, -0.6,
> -0.4, +0.5, and -0.3. Hence, I revoke the plan to back-patch.

FWIW, I also think that it would be better not to back-patch. The risk
of back-patching is that this will break things, whereas the risk of
not back-patching is that we will harm people who are affected by this
bug for a longer period of time than would otherwise be the case.
Because this patch is complex, the risk of breaking things seems
higher than normal. On the other hand, the number of users adversely
affected by the bug appears to be relatively low. Taken together,
these factors persuade me that we should not back-patch at this time.

It is possible that in the future things may look different. In the
happy event that this patch causes no more problems following commit,
while at the same time we have more complaints about the underlying
problem, we can make a decision to back-patch at a later time. This
brings me to another point: because this patch changes the WAL format,
a straight revert will be impossible once a release has occurred.
Therefore, if we hold off on back-patching for now and later decide
that we erred, we can proceed at that time and it will probably not be
much harder than it would be to do it now. On the other hand, if we
decide to back-patch now and later decide that we have erred, we will
have additional engineering work to do to cater to people who have
already installed the version containing the back-patched fix and now
need to upgrade again. Perhaps the WAL format changes are simple
enough that this isn't likely to be a huge issue even if it happens,
but it does seem better to avoid the chance that it might. A further
factor is that releases which break WAL compatibility are undesirable,
and should only be undertaken when necessary.

Last but not least, I would like to join with others in expressing my
thanks to you for your hard work on this problem. While the process of
developing a fix has not been without bumps, few people would have had
the time, patience, diligence, and skill to take this effort as far as
you have. Kyotaro Horiguchi and others likewise deserve credit for all
of the many hours that they have put into this work. The entire
PostgreSQL community owes all of you a debt of gratitude, and you have
my thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Robert Haas <robertmhaas(at)gmail(dot)com>
Cc:	Noah Misch <noah(at)leadboat(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-04-04 22:24:34
Message-ID:	26351.1586039074@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Wed, Apr 1, 2020 at 11:51 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
>> I've translated the non-vote comments into estimated votes of -0.3, -0.6,
>> -0.4, +0.5, and -0.3. Hence, I revoke the plan to back-patch.

> FWIW, I also think that it would be better not to back-patch.

FWIW, I also concur with not back-patching; the risk/reward ratio
does not look favorable. Maybe later.

> Last but not least, I would like to join with others in expressing my
> thanks to you for your hard work on this problem.

+1 on that, too.

Shouldn't the CF entry get closed?

regards, tom lane

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, Andrew Dunstan <andrew(dot)dunstan(at)2ndquadrant(dot)com>, hlinnaka <hlinnaka(at)iki(dot)fi>, Andres Freund <andres(at)anarazel(dot)de>
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-04-04 22:32:12
Message-ID:	20200404223212.GC3442685@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> Shouldn't the CF entry get closed?

Once the buildfarm is clean for a day, sure. The buildfarm has already
revealed a missing perl2host call.

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	noah(at)leadboat(dot)com
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, michael(at)paquier(dot)xyz, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, andres(at)anarazel(dot)de
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-04-06 00:46:31
Message-ID:	20200406.094631.402837389573618941.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> > Shouldn't the CF entry get closed?
>
> Once the buildfarm is clean for a day, sure. The buildfarm has already
> revealed a missing perl2host call.

Thank you for (re-) committing this and the following fix. I hope this
doesn't bring in another failure.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	tgl(at)sss(dot)pgh(dot)pa(dot)us, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org, michael(at)paquier(dot)xyz, 9erthalion6(at)gmail(dot)com, andrew(dot)dunstan(at)2ndquadrant(dot)com, hlinnaka(at)iki(dot)fi, andres(at)anarazel(dot)de
Subject:	Re: [HACKERS] WAL logging problem in 9.4.3?
Date:	2020-04-06 07:04:25
Message-ID:	20200406070425.GA162712@rfd.leadboat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Apr 06, 2020 at 09:46:31AM +0900, Kyotaro Horiguchi wrote:
> At Sat, 4 Apr 2020 15:32:12 -0700, Noah Misch <noah(at)leadboat(dot)com> wrote in
> > On Sat, Apr 04, 2020 at 06:24:34PM -0400, Tom Lane wrote:
> > > Shouldn't the CF entry get closed?
> >
> > Once the buildfarm is clean for a day, sure. The buildfarm has already
> > revealed a missing perl2host call.
>
> Thank you for (re-) committing this and the following fix. I hope this
> doesn't bring in another failure.

I have closed the CF entry.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=jacana&dt=2020-04-05%2000%3A00%3A27
happened, but I doubt it is unrelated. A wait_for_catchup that usually takes
<1s instead timed out after 397s. I can't reproduce it. In the past, another
animal on the same machine had the same failure:

sysname │ snapshot │ branch │ bfurl
───────────┼─────────────────────┼────────┼───────────────────────────────────────────────────────────────────────────────────────────────
bowerbird │ 2019-11-17 15:22:42 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2019-11-17%2015%3A22%3A42
bowerbird │ 2020-01-10 17:30:49 │ HEAD │ http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=bowerbird&dt=2020-01-10%2017%3A30%3A49