Assertion failure twophase.c (2) (testing HS/SR)

Lists: pgsql-hackers
From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Assertion failure twophase.c (2) (testing HS/SR)
Date: 2010-03-04 16:00:27
Message-ID: 291ed9f985bd93bccc3644e71c569797.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30

With three patches:

new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
fix-KnownAssignedXidsRemoveMany-1.patch

pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary

FailedAssertion, File: "twophase.c", Line: 1201.

The standby was restarted and seems to catch up OK again.

LOG: database system was interrupted; last known up at 2010-03-04 01:35:23 CET
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
FATAL: the database system is starting up
LOG: entering standby mode
LOG: redo starts at 0/1000020
LOG: consistent recovery state reached at 0/2000000
LOG: database system is ready to accept read only connections
ERROR: cannot execute CREATE TABLE in a read-only transaction
STATEMENT: create table t (c text);
ERROR: cannot execute SELECT INTO in a read-only transaction
STATEMENT: create table t as select 1;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
ERROR: cannot execute TRUNCATE TABLE in a read-only transaction
STATEMENT: truncate wal;
FATAL: database "ms" does not exist
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
LOG: startup process (PID 18107) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: database system was interrupted while in recovery at log time 2010-03-04 05:00:24 CET
HINT: If this has occurred more than once some data might be corrupted and you might need to
choose an earlier recovery target.
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000001C0000007F':
No such file or directory
LOG: entering standby mode
LOG: redo starts at 1C/7800F8E0
LOG: consistent recovery state reached at 1C/ADB9C178
LOG: database system is ready to accept read only connections

The ERRORs (and FATAL) were accidentally issued commands; I can't tell if they were causing the
assertion. (database 'ms' indeed was not present on this instance)

see also:
http://archives.postgresql.org/pgsql-hackers/2010-02/msg02221.php

thanks,

Erik Rijkers


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndQuadrant(dot)com>
Subject: Re: Assertion failure twophase.c (2) (testing HS/SR)
Date: 2010-03-11 09:29:21
Message-ID: 4B98B7F1.6010707@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Erik Rijkers wrote:
> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>
> With three patches:
>
> new_smart_shutdown_20100201.patch
> extend_format_of_recovery_info_funcs_v4.20100303.patch

Got a link to these two patches? I couldn't find them with a quick search.

> fix-KnownAssignedXidsRemoveMany-1.patch
>
> pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
>
> FailedAssertion, File: "twophase.c", Line: 1201.
>
> The standby was restarted and seems to catch up OK again.
> ...
> see also:
> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02221.php

I'm still not any wiser on what's causing that, but I've fixed the bug
in KnownAssignedXidsMany() now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org, Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: Re: Assertion failure twophase.c (2) (testing HS/SR)
Date: 2010-03-11 09:39:28
Message-ID: 3f0b79eb1003110139l6d5771b1k74f478661e8b89cd@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Mar 11, 2010 at 6:29 PM, Heikki Linnakangas
<heikki(dot)linnakangas(at)enterprisedb(dot)com> wrote:
> Erik Rijkers wrote:
>> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>>
>> With three patches:
>>
>>   new_smart_shutdown_20100201.patch

http://archives.postgresql.org/pgsql-hackers/2010-01/msg03116.php

>>   extend_format_of_recovery_info_funcs_v4.20100303.patch

http://archives.postgresql.org/pgsql-hackers/2010-03/msg00175.php

> Got a link to these two patches? I couldn't find them with a quick search.

For your convenience, I attached those patches in this post.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachment Content-Type Size
new_smart_shutdown_20100201.patch text/x-patch 2.1 KB
extend_format_of_recovery_info_funcs_v4.patch text/x-patch 13.9 KB

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Erik Rijkers <er(at)xs4all(dot)nl>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Assertion failure twophase.c (2) (testing HS/SR)
Date: 2010-03-12 08:39:54
Message-ID: 1268383194.3825.2594.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, 2010-03-11 at 11:29 +0200, Heikki Linnakangas wrote:

> I'm still not any wiser on what's causing that, but I've fixed the bug
> in KnownAssignedXidsMany() now.

Yeh, I've been mulling this over for a few days now and I can't see a
way that could have happened either.

I agree with your fix and the stronger placement of the Assertion.
Thanks.

I will be doing some further investigation in that area as well, over
next week or so.

--
Simon Riggs www.2ndQuadrant.com


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Assertion failure twophase.c (3) (testing HS/SR)
Date: 2010-03-14 23:37:08
Message-ID: 2e714d30db1cb6a8c47b2768eac589b7.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, March 4, 2010 17:00, Erik Rijkers wrote:
> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>
> With three patches:
>
> new_smart_shutdown_20100201.patch
> extend_format_of_recovery_info_funcs_v4.20100303.patch
> fix-KnownAssignedXidsRemoveMany-1.patch
>
> pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
>
> FailedAssertion, File: "twophase.c", Line: 1201.
>

For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
(created 2010.03.13 23:49 cvs).

Unfortunately, it does not happen always, or predictably.

patches:
new_smart_shutdown_20100201.patch
extend_format_of_recovery_info_funcs_v4.20100303.patch
(both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )

(fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)

I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
| psql -1qtA -h /tmp -p 7575 -d replicas

(the copied schemas were together 175 GB)

As I seem to be the only one who finds this, I started looking what could be unique in this
install: and it would be postbio, which we use for its gist-indexing on ranges
(http://pgfoundry.org/projects/postbio/). We use postbio's int_interval type as a column type.
But keep in mind that sometimes the whole dump+restore+replication completes OK.

Other installed modules are:
contrib/btree_gist
contrib/seg
contrib/adminpack

log_line_prefix = '%t %p %d %u start=%s ' # slave

pgsql.sr_hotslave/logfile:

2010-03-13 23:54:59 CET 15765 start=2010-03-13 23:54:59 CET LOG: database system was
interrupted; last known up at 2010-03-13 23:54:31 CET
cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
No such file or directory
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: entering standby mode
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: redo starts at 0/1000020
2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: consistent recovery state
reached at 0/2000000
2010-03-13 23:55:00 CET 15763 start=2010-03-13 23:54:59 CET LOG: database system is ready to
accept read only connections
TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: startup process (PID 15765)
was terminated by signal 6: Aborted
2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: terminating any other active
server processes

Maybe I'll try now to setup a similar instance without postbio, to see if the crash still occurs.

hth,

Erik Rijkers


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Assertion failure twophase.c (3) (testing HS/SR)
Date: 2010-04-22 07:53:50
Message-ID: 4BD0008E.1030509@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Can you still reproduce this or has some of the changes since then fixed
it? We never quite figured out the cause..

Erik Rijkers wrote:
> On Thu, March 4, 2010 17:00, Erik Rijkers wrote:
>> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>>
>> With three patches:
>>
>> new_smart_shutdown_20100201.patch
>> extend_format_of_recovery_info_funcs_v4.20100303.patch
>> fix-KnownAssignedXidsRemoveMany-1.patch
>>
>> pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
>>
>> FailedAssertion, File: "twophase.c", Line: 1201.
>>
>
> For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
> (created 2010.03.13 23:49 cvs).
>
> Unfortunately, it does not happen always, or predictably.
>
> patches:
> new_smart_shutdown_20100201.patch
> extend_format_of_recovery_info_funcs_v4.20100303.patch
> (both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )
>
> (fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)
>
>
> I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
> pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
> | psql -1qtA -h /tmp -p 7575 -d replicas
>
> (the copied schemas were together 175 GB)
>
> As I seem to be the only one who finds this, I started looking what could be unique in this
> install: and it would be postbio, which we use for its gist-indexing on ranges
> (http://pgfoundry.org/projects/postbio/). We use postbio's int_interval type as a column type.
> But keep in mind that sometimes the whole dump+restore+replication completes OK.
>
>
> Other installed modules are:
> contrib/btree_gist
> contrib/seg
> contrib/adminpack
>
> log_line_prefix = '%t %p %d %u start=%s ' # slave
>
> pgsql.sr_hotslave/logfile:
>
> 2010-03-13 23:54:59 CET 15765 start=2010-03-13 23:54:59 CET LOG: database system was
> interrupted; last known up at 2010-03-13 23:54:31 CET
> cp: cannot stat `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
> No such file or directory
> 2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: entering standby mode
> 2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: redo starts at 0/1000020
> 2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: consistent recovery state
> reached at 0/2000000
> 2010-03-13 23:55:00 CET 15763 start=2010-03-13 23:54:59 CET LOG: database system is ready to
> accept read only connections
> TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
> 2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: startup process (PID 15765)
> was terminated by signal 6: Aborted
> 2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: terminating any other active
> server processes
>
>
> Maybe I'll try now to setup a similar instance without postbio, to see if the crash still occurs.
>
> hth,
>
> Erik Rijkers
>
>
>

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Erik Rijkers" <er(at)xs4all(dot)nl>
To: "Heikki Linnakangas" <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Assertion failure twophase.c (3) (testing HS/SR)
Date: 2010-04-23 01:08:50
Message-ID: bc7caff150beb82261ecd4da46d7517d.squirrel@webmail.xs4all.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, April 22, 2010 09:53, Heikki Linnakangas wrote:
> Can you still reproduce this or has some of the changes since then fixed
> it? We never quite figured out the cause..

I don't know for sure:
>> Unfortunately, it does not happen always, or predictably.

The only thing that I established after that email was sent,
is that the error can also occur without the postbio package
being been installed (this has happened once).

It's a very easy test; I will probably run it a few more times.

> Erik Rijkers wrote:
>> On Thu, March 4, 2010 17:00, Erik Rijkers wrote:
>>> in a 9.0devel, primary+standby, cvs from 2010.03.04 01:30
>>>
>>> With three patches:
>>>
>>> new_smart_shutdown_20100201.patch
>>> extend_format_of_recovery_info_funcs_v4.20100303.patch
>>> fix-KnownAssignedXidsRemoveMany-1.patch
>>>
>>> pg_dump -d $db8.4.2 | psql -d $db9.0devel-primary
>>>
>>> FailedAssertion, File: "twophase.c", Line: 1201.
>>>
>>
>> For the record, this still happens (FailedAssertion, File: "twophase.c", Line: 1201.)
>> (created 2010.03.13 23:49 cvs).
>>
>> Unfortunately, it does not happen always, or predictably.
>>
>> patches:
>> new_smart_shutdown_20100201.patch
>> extend_format_of_recovery_info_funcs_v4.20100303.patch
>> (both here: http://archives.postgresql.org/pgsql-hackers/2010-03/msg00446.php )
>>
>> (fix-KnownAssignedXidsRemoveMany-1.patch has been committed, I think?)
>>
>>
>> I use commandlines like this to copy schemas across from 8.4.2 to 9.0devel:
>> pg_dump -c -h /tmp -p 5432 -n myschema --no-owner --no-privileges mydb \
>> | psql -1qtA -h /tmp -p 7575 -d replicas
>>
>> (the copied schemas were together 175 GB)
>>
>> As I seem to be the only one who finds this, I started looking what could be unique in this
>> install: and it would be postbio, which we use for its gist-indexing on ranges
>> (http://pgfoundry.org/projects/postbio/). We use postbio's int_interval type as a column type.
>> But keep in mind that sometimes the whole dump+restore+replication completes OK.
>>
>>
>> Other installed modules are:
>> contrib/btree_gist
>> contrib/seg
>> contrib/adminpack
>>
>> log_line_prefix = '%t %p %d %u start=%s ' # slave
>>
>> pgsql.sr_hotslave/logfile:
>>
>> 2010-03-13 23:54:59 CET 15765 start=2010-03-13 23:54:59 CET LOG: database system was
>> interrupted; last known up at 2010-03-13 23:54:31 CET
>> cp: cannot stat
>> `/var/data1/pg_stuff/dump/hotslave/replication_archive/000000010000000000000001':
>> No such file or directory
>> 2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: entering standby mode
>> 2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: redo starts at 0/1000020
>> 2010-03-13 23:55:00 CET 15765 start=2010-03-13 23:54:59 CET LOG: consistent recovery state
>> reached at 0/2000000
>> 2010-03-13 23:55:00 CET 15763 start=2010-03-13 23:54:59 CET LOG: database system is ready to
>> accept read only connections
>> TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File: "twophase.c", Line: 1201)
>> 2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: startup process (PID 15765)
>> was terminated by signal 6: Aborted
>> 2010-03-14 05:28:59 CET 15763 start=2010-03-13 23:54:59 CET LOG: terminating any other active
>> server processes
>>
>>
>> Maybe I'll try now to setup a similar instance without postbio, to see if the crash still
>> occurs.
>>
>> hth,
>>
>> Erik Rijkers
>>
>>
>>
>
>
> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Erik Rijkers <er(at)xs4all(dot)nl>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Assertion failure twophase.c (3) (testing HS/SR)
Date: 2010-04-24 22:38:55
Message-ID: 1272148735.4161.1599.camel@ebony
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2010-04-23 at 03:08 +0200, Erik Rijkers wrote:

> It's a very easy test; I will probably run it a few more times.

Please share details of your system and hardware.

--
Simon Riggs www.2ndQuadrant.com