Errors on missing pg_subtrans/ files with 9.3

Lists: pgsql-hackers
From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 15:56:55
Message-ID: CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

G'day list. Didn't get any interest in pgsql-general, thought I'd try
my luck here, which perhaps would be more fitting in case I've
stumbled upon an edge case issue or something...

I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
9.3.1 and have started getting the following errors every couple of
hours along with some failed transactions. I have been unable to track
down any sort of rhyme or reason for the errors yet, so I figured I'd
check with the mailing list to see if I've potentially hit into a
potential bug. The errors look like the following:

ERROR could not access status of transaction 4179979
Could not open file "pg_subtrans/003F": No such file or directory.

Indeed, pg_subtrans/003F does not exist on the file system.

I have found similar errors mentioned before a number of
years ago with a thread starting here, but couldn't find any real resolution.

http://www.postgresql.org/message-id/20111109150225.076c2f38@list.ru

A few details:

- the new database cluster was created by dumping global settings
using pg_dumpall and then dumping each database individually. Several
of the databases within this cluster have PostGIS installed and I
wanted to do a full PostGIS upgrade, which necessitated individual
dumps.

- this is the only database in a cluster of 14 databases that is
exhibiting the problem and it is sporadic at best. This database is
fairly write-intensive, and has been up since November 6th. In that
time, we've had 17 such errors.

- the pg_subtrans files mentioned in the log file do not exist on the
file system.

- I have rebuilt the cluster several times, as well as cleaned out the
pg_statistic table and run `VACUUM FULL ANALYZE` on every database in
case that was affecting it based on some similar threads back from the
PostgreSQL 8.4 days, but there was no affect and the errors still
occur.

Anyone have any suggestions or ideas? Thankfully these are dev systems
so I have some room to experiment and can post some more details as
necessary.

The following are the log file entries from one of the errors. Some
details have been sanitized for privacy reasons, but the overall lines
themselves are accurate.

Nov 10 10:14:02 dev-server postgres[29835]: [4-1] user=dev,db=dev
ERROR: could not access status of transaction 4179979
Nov 10 10:14:02 dev-server postgres[29835]: [4-2] user=dev,db=dev
DETAIL: Could not open file "pg_subtrans/003F": No such file or
directory.
Nov 10 10:14:02 dev-server postgres[29835]: [4-3] user=dev,db=dev
CONTEXT: SQL statement "SELECT 1 FROM ONLY
"typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1
FOR KEY SHARE OF x"
Nov 10 10:14:02 dev-server postgres[29835]: [4-4] user=dev,db=dev
STATEMENT: insert into raw (url, id) values ($1, $2)
Nov 10 10:14:02 dev-server postgres[29839]: [4-1] user=dev,db=dev
ERROR: could not access status of transaction 4179979
Nov 10 10:14:02 dev-server postgres[29839]: [4-2] user=dev,db=dev
DETAIL: Could not open file "pg_subtrans/003F": No such file or
directory.
Nov 10 10:14:02 dev-server postgres[29839]: [4-3] user=dev,db=dev
CONTEXT: SQL statement "SELECT 1 FROM ONLY
"typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1
FOR KEY SHARE OF x"
Nov 10 10:14:02 dev-server postgres[29839]: [4-4] user=dev,db=dev
STATEMENT: update listings set value_a=$1 where id=$2
Nov 10 10:14:02 dev-server postgres[29827]: [4-1] user=dev,db=dev
ERROR: could not access status of transaction 4179979
Nov 10 10:14:02 dev-server postgres[29827]: [4-2] user=dev,db=dev
DETAIL: Could not open file "pg_subtrans/003F": No such file or
directory.
Nov 10 10:14:02 dev-server postgres[29827]: [4-3] user=dev,db=dev
CONTEXT: SQL statement "SELECT 1 FROM ONLY
"typhon"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1
FOR KEY SHARE OF x"

I can provide additional details like my postgresql.conf if it would
help, although it's fairly standard outside of tuning as provided by
pgtune.

Cheers and thanks, List.


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:13:52
Message-ID: 20131112161352.GD23777@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-12 10:56:55 -0500, J Smith wrote:
> G'day list. Didn't get any interest in pgsql-general, thought I'd try
> my luck here, which perhaps would be more fitting in case I've
> stumbled upon an edge case issue or something...

Normally the bug report for/the -bugs mailing list is the right one for
things like this.

> ERROR could not access status of transaction 4179979
> Could not open file "pg_subtrans/003F": No such file or directory.

A couple of question:
* Did you have any OS, hardware crashes?
* Do you run with fsync=off or full_page_writes=off?
* Could you post the pg_controldata output?
* Is it always the same xid?
* Does SELECT count(*) FROM pg_prepared_xacts; return 0?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:25:03
Message-ID: 20131112162503.GK17272@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* J Smith (dark(dot)panda+lists(at)gmail(dot)com) wrote:
> I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
> 9.3.1 and have started getting the following errors every couple of
> hours along with some failed transactions.

How was this upgrade done? If you used pg_upgrade, what version of the
pg_upgrade code did you use? As I recall, there was a bug in older
versions which could exhibit in this way..

http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix

Thanks,

Stephen


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:32:14
Message-ID: 20131112163214.GF23777@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-12 11:25:03 -0500, Stephen Frost wrote:
> * J Smith (dark(dot)panda+lists(at)gmail(dot)com) wrote:
> > I've recently upgraded a number of servers from PostgreSQL 9.2.5 to
> > 9.3.1 and have started getting the following errors every couple of
> > hours along with some failed transactions.
>
> How was this upgrade done? If you used pg_upgrade, what version of the
> pg_upgrade code did you use? As I recall, there was a bug in older
> versions which could exhibit in this way..

He referred to using pg_dumpall/pg_dump. But that bug was erroring out
on pg_clog, not pg_subtrans, right?

Getting such errors on pg_subtrans indicates corruption or a bug
somewhere, we don't do them if the to-be-looked up xid is older than
TransactionXmin and we zero pg_subtrans at restarts.

My gut feeling is thats it's related to foreign key locks doing
XactLockTableWait()s on too old transactions or something like that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:32:34
Message-ID: CADFUPgdwREx1deU8vmJ8+K8BsypBjaDNgxa_LTYbXZhDG1RzHQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 12, 2013 at 11:25 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>
> How was this upgrade done? If you used pg_upgrade, what version of the
> pg_upgrade code did you use? As I recall, there was a bug in older
> versions which could exhibit in this way..
>
> http://wiki.postgresql.org/wiki/20110408pg_upgrade_fix
>

As mentioned in the original post...

"the new database cluster was created by dumping global settings
using pg_dumpall and then dumping each database individually."

We haven't been able to use pg_upgrade as we rely heavily on PostGIS
and do hard upgrades via pg_dump and postgis_restore.pl when we
upgrade.

Cheers


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:46:19
Message-ID: CADFUPgc4uVNCdCosptnJVVuz=yrcdPsW051C3Hx7OhdabwKgOA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

G'day Andres.

On Tue, Nov 12, 2013 at 11:13 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 2013-11-12 10:56:55 -0500, J Smith wrote:
>> G'day list. Didn't get any interest in pgsql-general, thought I'd try
>> my luck here, which perhaps would be more fitting in case I've
>> stumbled upon an edge case issue or something...
>
> Normally the bug report for/the -bugs mailing list is the right one for
> things like this.

Ah, okay. I can redirect there if that's the better location.
(Assuming it's a bug and not a misconfiguration on my part, although
the configuration on these servers has been consistent since 9.0, and
we've upgraded through each of the 9.x series while merging together
new settings accordingly.)

>> ERROR could not access status of transaction 4179979
>> Could not open file "pg_subtrans/003F": No such file or directory.
>
> A couple of question:
> * Did you have any OS, hardware crashes?

Haven't had any issues on this machine.

> * Do you run with fsync=off or full_page_writes=off?

Both fsync and full_page_writes are in their default "on"
configurations. I'll post our full postgresql.conf below, as it is
quite short.

> * Could you post the pg_controldata output?

Sure, posted below.

> * Is it always the same xid?

Seems to be different every time.

> * Does SELECT count(*) FROM pg_prepared_xacts; return 0?

Yes it does.

Here's the postgresql.conf and pg_controldata dump.

# postgresql.conf

listen_addresses = '*'
wal_level = hot_standby
checkpoint_completion_target = 0.7
max_wal_senders = 3
wal_keep_segments = 64
logging_collector = off
datestyle = 'iso, mdy'
timezone = 'Canada/Eastern'
lc_messages = 'en_US.UTF-8'
lc_monetary = 'en_US.UTF-8'
lc_numeric = 'en_US.UTF-8'
lc_time = 'en_US.UTF-8'
default_text_search_config = 'pg_catalog.english'
default_statistics_target = 100
maintenance_work_mem = 1GB
effective_cache_size = 44GB
work_mem = 320MB
wal_buffers = 4MB
checkpoint_segments = 64
shared_buffers = 15GB
max_connections = 200

# pg_controldata

pg_control version number: 937
Catalog version number: 201306121
Database system identifier: 5942827484423487452
Database cluster state: in production
pg_control last modified: Tue 12 Nov 2013 11:39:07 AM EST
Latest checkpoint location: 30/50F7A180
Prior checkpoint location: 30/4914C428
Latest checkpoint's REDO location: 30/4CB2C450
Latest checkpoint's REDO WAL file: 00000001000000300000004C
Latest checkpoint's TimeLineID: 1
Latest checkpoint's PrevTimeLineID: 1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/5138174
Latest checkpoint's NextOID: 3574036
Latest checkpoint's NextMultiXactId: 5762623
Latest checkpoint's NextMultiOffset: 11956824
Latest checkpoint's oldestXID: 1673
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 5138174
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Time of latest checkpoint: Tue 12 Nov 2013 11:35:37 AM EST
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
Current wal_level setting: hot_standby
Current max_connections setting: 200
Current max_prepared_xacts setting: 0
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0

Cheers and thanks for the help all.


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, J Smith <dark(dot)panda+lists(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:51:12
Message-ID: 20131112165112.GL17272@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* Andres Freund (andres(at)2ndquadrant(dot)com) wrote:
> He referred to using pg_dumpall/pg_dump. But that bug was erroring out
> on pg_clog, not pg_subtrans, right?

Yeah, that was pg_clog. Obviously responded before really looking at
it. :)

> My gut feeling is thats it's related to foreign key locks doing
> XactLockTableWait()s on too old transactions or something like that.

Ugh, that wouldn't be good..

Thanks,

Stephen


From: Stephen Frost <sfrost(at)snowman(dot)net>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:54:48
Message-ID: 20131112165447.GM17272@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

* J Smith (dark(dot)panda+lists(at)gmail(dot)com) wrote:
> We haven't been able to use pg_upgrade as we rely heavily on PostGIS
> and do hard upgrades via pg_dump and postgis_restore.pl when we
> upgrade.

Did you also upgrade to PostGIS 2.x as part of this..? Seems like it'd
be unrelated, but one never knows. Any chance you could distill this
down into a small test case which exhibits the problem? I'm guessing
'no', but figured I'd ask anyway..

Thanks,

Stephen


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 16:55:38
Message-ID: 20131112165538.GG23777@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-12 11:46:19 -0500, J Smith wrote:
> > * Does SELECT count(*) FROM pg_prepared_xacts; return 0?
>
> Yes it does.

Could you show the output? Do you actually use prepared xacts actively?

Do you actively use row level locking? Is there high concurrency in that
environment? In short, could you describe your database usage a bit?

Did you restart the database since getting the errors?

> Latest checkpoint's NextXID: 0/5138174
> Latest checkpoint's NextOID: 3574036
> Latest checkpoint's NextMultiXactId: 5762623
> Latest checkpoint's NextMultiOffset: 11956824
> Latest checkpoint's oldestXID: 1673
> Latest checkpoint's oldestXID's DB: 1
> Latest checkpoint's oldestActiveXID: 5138174
> Latest checkpoint's oldestMultiXid: 1

Interesting. Pretty high multixact usage.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Stephen Frost <sfrost(at)snowman(dot)net>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 17:17:03
Message-ID: CADFUPgfqPiazOauuM7M-qfwA5SWqwQ9mHevRB5kh0A1Lm8-w=g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 12, 2013 at 11:54 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
>
> Did you also upgrade to PostGIS 2.x as part of this..? Seems like it'd
> be unrelated, but one never knows. Any chance you could distill this
> down into a small test case which exhibits the problem? I'm guessing
> 'no', but figured I'd ask anyway..
>

Yeah, we upgraded to PostGIS 2.1 as part of the upgrade. In the case
of the specific database that is exhibiting the problem, we only
really use PostGIS for storage and don't do a lot of spatial querying,
as this database is basically for an aggregation service we run. In
the case of the transactions that appear to be causing the problems,
we don't even use geometry fields at all.

Cheers


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-12 18:45:01
Message-ID: CADFUPgdU62AgYisfFci9CJA1u15mpEtqZE2D375HrjGzp0VEdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 12, 2013 at 11:55 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 2013-11-12 11:46:19 -0500, J Smith wrote:
>> > * Does SELECT count(*) FROM pg_prepared_xacts; return 0?
>>
>> Yes it does.
>
> Could you show the output? Do you actually use prepared xacts actively?

jay:dev(at)jagger=# select * from pg_prepared_xacts;
transaction | gid | prepared | owner | database
-------------+-----+----------+-------+----------
(0 rows)

We don't actively use prepared transactions, no. We use a lot of
prepared statements, but no two-phase committing.

> Do you actively use row level locking?

Not to my knowledge, no. I've enabled statement logging for a little
while so we can get a snapshot of what our average queries look like,
but I'm not seeing any PREPARE TRANSACTION or explicit LOCK
statements.

> Is there high concurrency in that environment? In short, could you describe your database usage a bit?

This particular database is used for an aggregation service. There are
a handful of Hibernate-based Java processes that manipulate and
massage the data from a number of data feeds we receive 24-7 From a
quick grepping of the statement log file we captured the following
over the course of approximately 13 minutes:

167452 SELECTs
72388 UPDATEs
6782 INSERTs
2 DELETEs

> Did you restart the database since getting the errors?

Yes, we've restarted several times.

>> Latest checkpoint's NextXID: 0/5138174
>> Latest checkpoint's NextOID: 3574036
>> Latest checkpoint's NextMultiXactId: 5762623
>> Latest checkpoint's NextMultiOffset: 11956824
>> Latest checkpoint's oldestXID: 1673
>> Latest checkpoint's oldestXID's DB: 1
>> Latest checkpoint's oldestActiveXID: 5138174
>> Latest checkpoint's oldestMultiXid: 1
>
> Interesting. Pretty high multixact usage.
>

Part of our process involves a collecting several records and
aggregating them within a single transaction. There will usually be a
handful of SELECTs to fetch the data, a bit of processing in the Java
backend, then a set of either UPDATEs or INSERTs into two tables to
store the results of the processing or in rarer cases a couple of
DELETEs.

Our Java guy would have more details, as I just have the output of the
log files and a high-level view of how our backend processes work, but
this is the general gist of things as I understand it.

Cheers


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-13 17:29:37
Message-ID: CADFUPge4ySeoRNtNsG8S2hThNXFT+RD8CDkVgw7nb8BaPk1cfA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Looks like we got another set of errors overnight. Here's the log file
from the errors. (Log file scrubbed slightly to remove private data,
but still representative of the problem I believe.)

Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3
Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR:
could not access status of transaction 6337381
Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL:
Could not open file "pg_subtrans/0060": No such file or directory.
Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT:
SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT:
update listings set deactivated_at=$1 where id=$2 and lock_version=$3

Several processes all seemed to hit the problem at the same moment,
and all of them refer to the same transaction ID. Again, the file
pg_subtrans/0060 doesn't exist, and the only file that does exist is
pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in
length.

Still don't have a clue as to how I can reproduce the problem. It
seems that in all cases the error occurred during either an UPDATE to
a table_X or an INSERT to table_Y. In all cases, the error occurred in
a manner identical to those shown in the log above, the only
difference being either an UPDATE on table_X or an INSERT on table_Y.

Not sure what direction I should head to now. Perhaps some aggressive
logging would help, so we can see the queries surrounding the
problems? I could reconfigure things to capture all statements and set
up monit or something to send an alert when the problem resurfaces,
for instance.

Cheers all.


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-15 20:21:57
Message-ID: CA+TgmoYX-gRKoVjD42tpk5PfYmoHeey8qRoFUAjJ=EC-jPUXuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Nov 13, 2013 at 12:29 PM, J Smith <dark(dot)panda+lists(at)gmail(dot)com> wrote:
> Looks like we got another set of errors overnight. Here's the log file
> from the errors. (Log file scrubbed slightly to remove private data,
> but still representative of the problem I believe.)
>
> Nov 13 05:34:34 dev postgres[6084]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6084]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6084]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6084]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6076]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6076]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6076]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6076]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6087]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6087]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6087]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6087]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6086]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6086]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6086]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6086]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6088]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6088]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6088]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6088]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
> Nov 13 05:34:34 dev postgres[6085]: [4-1] user=dev,db=dev ERROR:
> could not access status of transaction 6337381
> Nov 13 05:34:34 dev postgres[6085]: [4-2] user=dev,db=dev DETAIL:
> Could not open file "pg_subtrans/0060": No such file or directory.
> Nov 13 05:34:34 dev postgres[6085]: [4-3] user=dev,db=dev CONTEXT:
> SQL statement "SELECT 1 FROM ONLY "typhon"."collection_batches" x
> WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
> Nov 13 05:34:34 dev postgres[6085]: [4-4] user=dev,db=dev STATEMENT:
> update listings set deactivated_at=$1 where id=$2 and lock_version=$3
>
> Several processes all seemed to hit the problem at the same moment,
> and all of them refer to the same transaction ID. Again, the file
> pg_subtrans/0060 doesn't exist, and the only file that does exist is
> pg_subtrans/005A which appears to be a zeroed-out file 245760 bytes in
> length.
>
> Still don't have a clue as to how I can reproduce the problem. It
> seems that in all cases the error occurred during either an UPDATE to
> a table_X or an INSERT to table_Y. In all cases, the error occurred in
> a manner identical to those shown in the log above, the only
> difference being either an UPDATE on table_X or an INSERT on table_Y.
>
> Not sure what direction I should head to now. Perhaps some aggressive
> logging would help, so we can see the queries surrounding the
> problems? I could reconfigure things to capture all statements and set
> up monit or something to send an alert when the problem resurfaces,
> for instance.
>
> Cheers all.

I think what would help the most is if you could arrange to obtain a
stack backtrace at the point when the error is thrown. Maybe put a
long sleep call in just before the error happens, and when it gets
stuck there, attach gdb and run bt full.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-15 21:01:07
Message-ID: CADFUPgf8R1_EBcnJJ323oGS4pHJr=Ob_MscOPYB4vKpcJNjBhw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>
> I think what would help the most is if you could arrange to obtain a
> stack backtrace at the point when the error is thrown. Maybe put a
> long sleep call in just before the error happens, and when it gets
> stuck there, attach gdb and run bt full.
>

That could potentially be doable. Perhaps I could use something like
google-coredumper or something similar to have a core dump generated
if the error comes up? Part of the problem is that the error is so
sporadic that it's going to be tough to say when the next one will
occur. For instance, we haven't changed our load on the server, yet
the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
blocking on the server with sleep or anything like that unless
absolutely necessary, as there are other services we have in
development that are using other databases on this cluster. (I can as
a matter of last resort, of course, but if google-coredumper can do
the job I'd like to give that a shot first.)

Any hints on where I could insert something like this? Should I try
putting it into the section of elog.c dealing with ENOENT errors, or
try to find a spot closer to where the file itself is being opened? I
haven't looked at Postgres internals for a while now so I'm not quite
sure of the best location for this sort of thing.

Cheers


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-19 14:14:24
Message-ID: CA+Tgmoa+S0RwO+736t8Unx8qydVOS3S3vhiHWxgqjGPSs4GQ8w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, Nov 15, 2013 at 4:01 PM, J Smith <dark(dot)panda+lists(at)gmail(dot)com> wrote:
> On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> I think what would help the most is if you could arrange to obtain a
>> stack backtrace at the point when the error is thrown. Maybe put a
>> long sleep call in just before the error happens, and when it gets
>> stuck there, attach gdb and run bt full.
>>
>
> That could potentially be doable. Perhaps I could use something like
> google-coredumper or something similar to have a core dump generated
> if the error comes up? Part of the problem is that the error is so
> sporadic that it's going to be tough to say when the next one will
> occur. For instance, we haven't changed our load on the server, yet
> the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
> blocking on the server with sleep or anything like that unless
> absolutely necessary, as there are other services we have in
> development that are using other databases on this cluster. (I can as
> a matter of last resort, of course, but if google-coredumper can do
> the job I'd like to give that a shot first.)
>
> Any hints on where I could insert something like this? Should I try
> putting it into the section of elog.c dealing with ENOENT errors, or
> try to find a spot closer to where the file itself is being opened? I
> haven't looked at Postgres internals for a while now so I'm not quite
> sure of the best location for this sort of thing.

I'd look for the specific ereport() call that's firing, and put it
just before that.

(note that setting the error verbosity to 'verbose' will give you the
file and line number where the error is happening, which is useful if
the message can be generated from more than one place)

I'm not familiar with google-coredumper but it sounds like a promising
technique.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-19 15:16:29
Message-ID: CADFUPgeDA2fzX-NeiOFW39uSbvOC9aApv8y0BZJ8FY=1EJTHGQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alright, we'll look into doing that heading into the weekend.
Interestingly, we haven't experienced the issue since our main Java
developer made some modifications to our backend system. I'm not
entirely sure what the changes entail except that it's a one-liner
that involves re-SELECTing a table during a transaction. We'll
rollback this change and re-compile Postgres with google-coredumper
and let it run over the weekend and see where we stand.

Cheers

On Tue, Nov 19, 2013 at 9:14 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Fri, Nov 15, 2013 at 4:01 PM, J Smith <dark(dot)panda+lists(at)gmail(dot)com> wrote:
>> On Fri, Nov 15, 2013 at 3:21 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> I think what would help the most is if you could arrange to obtain a
>>> stack backtrace at the point when the error is thrown. Maybe put a
>>> long sleep call in just before the error happens, and when it gets
>>> stuck there, attach gdb and run bt full.
>>>
>>
>> That could potentially be doable. Perhaps I could use something like
>> google-coredumper or something similar to have a core dump generated
>> if the error comes up? Part of the problem is that the error is so
>> sporadic that it's going to be tough to say when the next one will
>> occur. For instance, we haven't changed our load on the server, yet
>> the error hasn't occurred since Nov 13, 15:01. I'd also like to avoid
>> blocking on the server with sleep or anything like that unless
>> absolutely necessary, as there are other services we have in
>> development that are using other databases on this cluster. (I can as
>> a matter of last resort, of course, but if google-coredumper can do
>> the job I'd like to give that a shot first.)
>>
>> Any hints on where I could insert something like this? Should I try
>> putting it into the section of elog.c dealing with ENOENT errors, or
>> try to find a spot closer to where the file itself is being opened? I
>> haven't looked at Postgres internals for a while now so I'm not quite
>> sure of the best location for this sort of thing.
>
> I'd look for the specific ereport() call that's firing, and put it
> just before that.
>
> (note that setting the error verbosity to 'verbose' will give you the
> file and line number where the error is happening, which is useful if
> the message can be generated from more than one place)
>
> I'm not familiar with google-coredumper but it sounds like a promising
> technique.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-23 18:27:15
Message-ID: CADFUPgfD_j1M3RP2QPKLZsE9MhAWN0L0LnY574C8KJXxd7TP_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Nov 19, 2013 at 10:16 AM, J Smith <dark(dot)panda+lists(at)gmail(dot)com> wrote:
> Alright, we'll look into doing that heading into the weekend.
> Interestingly, we haven't experienced the issue since our main Java
> developer made some modifications to our backend system. I'm not
> entirely sure what the changes entail except that it's a one-liner
> that involves re-SELECTing a table during a transaction. We'll
> rollback this change and re-compile Postgres with google-coredumper
> and let it run over the weekend and see where we stand.
>

Okay, I've patched our Postgres install and added in a call for
coredumper to output a core dump. We'll let our processes run for the
next few days and hopefully we'll get something we can look at. I've
attached the patch just in case anyone would like to make sure I've
done it in a sane way and that I've inserted the call in the correct
location.

Cheers

Attachment Content-Type Size
postgresql-coredumper.patch application/octet-stream 1.3 KB

From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-24 21:56:26
Message-ID: CADFUPgcST8hVgcH-zjVXfYAaHsuCdUwbKizAMUD+5woZjajL6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

coredumper worked like a charm. Useful tool, that is... although as a
bit of advice, I'd try not to run it on Postgres if your various
memory settings are tweaked towards production use -- the core dump
that was captured on my server weighed in at 16 GB.

Anyways, I've attached both the log entries that caused the error as
well as a backtrace using `bt full`.

Cheers, hope it helps.

Attachment Content-Type Size
postgresql-backtrace.txt text/plain 22.8 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 11:47:12
Message-ID: 20131125114712.GI28145@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-24 16:56:26 -0500, J Smith wrote:
> coredumper worked like a charm. Useful tool, that is... although as a
> bit of advice, I'd try not to run it on Postgres if your various
> memory settings are tweaked towards production use -- the core dump
> that was captured on my server weighed in at 16 GB.

> Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 13514992
> Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/00CE": Success.
> Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "dev"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

Ok, this is helpful. Do you rather longrunning transactions? The
transaction that does foreign key checks has an xid of 10260613, while
the row that's getting checked has 13514992.

#4 0x0000000000635dc7 in XactLockTableWait (xid=13514992) at lmgr.c:501
tag = {locktag_field1 = 13514992, locktag_field2 = 0, locktag_field3 = 0, locktag_field4 = 0, locktag_type = 4 '\004', locktag_lockmethodid = 1 '\001'}
#5 0x0000000000482223 in heap_lock_updated_tuple_rec (rel=0x2b20f050a8d0, tuple=<value optimized out>, ctid=<value optimized out>, xid=10260613, mode=LockTupleKeyShare) at heapam.c:4847

I am not sure whether that's the origin of the problem but at the very
least it seems to me that heap_lock_updated_tuple_rec() is missing
several important pieces:
a) do the priorXmax==xmin dance to check we're still following the same
ctid chain. Currently we could easily stumble across completely
unrelated tuples if a chain element aborted and got vacuumed.
b) Check whether a chain element actually aborted - currently we're
only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
wrong (we can't check for committed tho!).
c) (reported separately as well) cope with failure of heap_fetch() to
return anything.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 16:38:53
Message-ID: CADFUPgcaq-yFxK++vHS3ECYmwBtD2Sz1dgeChtNRHKEGooEREQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 25, 2013 at 6:47 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> On 2013-11-24 16:56:26 -0500, J Smith wrote:
>
>> Nov 23 14:38:32 dev postgres[23810]: [4-1] user=dev,db=dev ERROR: could not access status of transaction 13514992
>> Nov 23 14:38:32 dev postgres[23810]: [4-2] user=dev,db=dev DETAIL: Could not open file "pg_subtrans/00CE": Success.
>> Nov 23 14:38:32 dev postgres[23810]: [4-3] user=dev,db=dev CONTEXT: SQL statement "SELECT 1 FROM ONLY "dev"."collection_batches" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
>
>
> Ok, this is helpful. Do you rather longrunning transactions? The
> transaction that does foreign key checks has an xid of 10260613, while
> the row that's getting checked has 13514992.

We did have some long-running transactions, yes. We refactored a bit
and removed them and the problem ceased on our end. We ended up
reverting our changes for the sake of running this experiment over the
weekend and the errors returned. We've since restored our fix and
haven't had any problems since, so yeah, long-running transactions
appear to be involved.

We can continue to experiment if you have any additional tests you'd
like us to run. We may have to keep the experiments to running over
the weekend, but they're definitely do-able.

Cheers


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 16:46:47
Message-ID: 20131125164647.GD6597@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

J Smith escribió:

> We did have some long-running transactions, yes. We refactored a bit
> and removed them and the problem ceased on our end. We ended up
> reverting our changes for the sake of running this experiment over the
> weekend and the errors returned. We've since restored our fix and
> haven't had any problems since, so yeah, long-running transactions
> appear to be involved.
>
> We can continue to experiment if you have any additional tests you'd
> like us to run. We may have to keep the experiments to running over
> the weekend, but they're definitely do-able.

I am working on patches to get these bugs fixed. Would you be up for
running a patched binary and see if the errors go away?

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 18:04:35
Message-ID: CADFUPgeuq1o1NJAdzqKK_3CbAGzR8SS_zVGHJ-4+97N3ggDR=w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Nov 25, 2013 at 11:46 AM, Alvaro Herrera
<alvherre(at)2ndquadrant(dot)com> wrote:
> J Smith escribió:
>
>> We did have some long-running transactions, yes. We refactored a bit
>> and removed them and the problem ceased on our end. We ended up
>> reverting our changes for the sake of running this experiment over the
>> weekend and the errors returned. We've since restored our fix and
>> haven't had any problems since, so yeah, long-running transactions
>> appear to be involved.
>>
>> We can continue to experiment if you have any additional tests you'd
>> like us to run. We may have to keep the experiments to running over
>> the weekend, but they're definitely do-able.
>
> I am working on patches to get these bugs fixed. Would you be up for
> running a patched binary and see if the errors go away?

Sure, just give me a git commit hash and I can do a test build towards
the weekend. I'll include coredumper again just in case things go awry
and we can go from there.

Cheers


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 20:10:39
Message-ID: 20131125201039.GF6597@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund escribió:

> Ok, this is helpful. Do you rather longrunning transactions? The
> transaction that does foreign key checks has an xid of 10260613, while
> the row that's getting checked has 13514992.

Thanks for the analysis.

> #4 0x0000000000635dc7 in XactLockTableWait (xid=13514992) at lmgr.c:501
> tag = {locktag_field1 = 13514992, locktag_field2 = 0, locktag_field3 = 0, locktag_field4 = 0, locktag_type = 4 '\004', locktag_lockmethodid = 1 '\001'}
> #5 0x0000000000482223 in heap_lock_updated_tuple_rec (rel=0x2b20f050a8d0, tuple=<value optimized out>, ctid=<value optimized out>, xid=10260613, mode=LockTupleKeyShare) at heapam.c:4847
>
> I am not sure whether that's the origin of the problem but at the very
> least it seems to me that heap_lock_updated_tuple_rec() is missing
> several important pieces:
> a) do the priorXmax==xmin dance to check we're still following the same
> ctid chain. Currently we could easily stumble across completely
> unrelated tuples if a chain element aborted and got vacuumed.

This seems simple to handle by adding the check you propose to the loop.
Basically if the xmax doesn't match the xmin, we reached the end,
there's nothing more to lock and we can return success without any
further work:

/*
* Check the tuple XMIN against prior XMAX, if any. If we reached
* the end of the chain, we're done, so return success.
*/
if (TransactionIdIsValid(priorXmax) &&
!TransactionIdEquals(HeapTupleHeaderGetXmin(&mytup),
priorXmax))
{
UnlockReleaseBuffer(buf);
return HeapTupleMayBeUpdated;
}

> b) Check whether a chain element actually aborted - currently we're
> only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
> wrong (we can't check for committed tho!).

Let me point out that this is exactly the same code that would be
affected by my proposed fix for #8434, which would have this check the
updateXid in all cases, not only in KEYS_UPDATED as currently.

I don't understand your comment about "can't check for committed". I
mean, if the updating transaction committed, then we need to return
failure to caller, HeapTupleUpdated. This signals to the caller that
the future version of the tuple is locked and the whole thing needs to
be failed. (In READ COMMITTED isolation level, this would cause
EvalPlanQual to fetch the updated version of the tuple and redo the
whole thing from there. In REPEATABLE READ and above, the whole
operation would be aborted.)

In short I propose to fix part this with the simple patch I proposed for
bug #8434.

> c) (reported separately as well) cope with failure of heap_fetch() to
> return anything.

Proposed patch for this was posted in the other thread.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 20:52:25
Message-ID: 20131125205225.GB23284@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-25 17:10:39 -0300, Alvaro Herrera wrote:
> > I am not sure whether that's the origin of the problem but at the very
> > least it seems to me that heap_lock_updated_tuple_rec() is missing
> > several important pieces:
> > a) do the priorXmax==xmin dance to check we're still following the same
> > ctid chain. Currently we could easily stumble across completely
> > unrelated tuples if a chain element aborted and got vacuumed.
>
> This seems simple to handle by adding the check you propose to the loop.
> Basically if the xmax doesn't match the xmin, we reached the end,
> there's nothing more to lock and we can return success without any
> further work:

Right, that's what we do in other places (notably EvalPlanQualFetch()).

> > b) Check whether a chain element actually aborted - currently we're
> > only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
> > wrong (we can't check for committed tho!).
>
> Let me point out that this is exactly the same code that would be
> affected by my proposed fix for #8434, which would have this check the
> updateXid in all cases, not only in KEYS_UPDATED as currently.

Hm. I don't see a patch in any of the mails about #8434 although I seem
to remember talking with you about one. Maybe that was on IM?

> I don't understand your comment about "can't check for committed". I
> mean, if the updating transaction committed, then we need to return
> failure to caller, HeapTupleUpdated.

I mean that in the !KEYS_UPDATED case we don't need to abort if we're
only acquiring a key share...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 21:06:30
Message-ID: 20131125210630.GG6597@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund escribió:
> On 2013-11-25 17:10:39 -0300, Alvaro Herrera wrote:

> > Let me point out that this is exactly the same code that would be
> > affected by my proposed fix for #8434, which would have this check the
> > updateXid in all cases, not only in KEYS_UPDATED as currently.
>
> Hm. I don't see a patch in any of the mails about #8434 although I seem
> to remember talking with you about one. Maybe that was on IM?

Ah, yeah, that's possible. The patch I proposed back then is attached
here. I haven't merged this to latest master; this applies cleanly on
top of 16a906f5350

> > I don't understand your comment about "can't check for committed". I
> > mean, if the updating transaction committed, then we need to return
> > failure to caller, HeapTupleUpdated.
>
> I mean that in the !KEYS_UPDATED case we don't need to abort if we're
> only acquiring a key share...

Hm, I think that's correct -- we don't need to abort. But we still need
to wait until the updater completes. So this proposed patch is not the
full story.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
patch-for-8434.patch text/x-diff 10.0 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-25 21:11:41
Message-ID: 20131125211141.GC23284@alap2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-25 18:06:30 -0300, Alvaro Herrera wrote:
> > I mean that in the !KEYS_UPDATED case we don't need to abort if we're
> > only acquiring a key share...
>
> Hm, I think that's correct -- we don't need to abort. But we still need
> to wait until the updater completes. So this proposed patch is not the
> full story.

Hm. Why do we need to wait in that case? Isn't the entire point of KEY
SHARE locks *not* having to wait for !KEYS_UPDATED? ISTM in that case we
should only check whether the creating transaction has aborted because
in that case we don't need to take out a lock?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-27 16:57:52
Message-ID: 20131127165750.GB14522@eldon.alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera escribió:
> Andres Freund escribió:

> This seems simple to handle by adding the check you propose to the loop.
> Basically if the xmax doesn't match the xmin, we reached the end,
> there's nothing more to lock and we can return success without any
> further work:

As mentioned in the thread for bug #8434, the complete working patch for
this is attached.

> > b) Check whether a chain element actually aborted - currently we're
> > only doing that in the HEAP_KEYS_UPDATED updated case, but that seems
> > wrong (we can't check for committed tho!).
>
> Let me point out that this is exactly the same code that would be
> affected by my proposed fix for #8434, which would have this check the
> updateXid in all cases, not only in KEYS_UPDATED as currently.

I posted a patch for this problem in the thread about #8434.

--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
check-xminxmax-while-traversing.patch text/x-diff 1.8 KB

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: J Smith <dark(dot)panda+lists(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-27 23:09:23
Message-ID: 20131127230923.GD2592@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-11-27 13:57:52 -0300, Alvaro Herrera wrote:
> Per bug report by J Smith in
> CADFUPgc5bmtv-yg9znxV-vcfkb+JPRqs7m2OesQXaM_4Z1JpdQ(at)mail(dot)gmail(dot)com
> diagnosed by Andres Freund.

Alvaro, do you see a way this could actually have caused J.'s problems?
I thought about a few, but each turned about to not really be
possible...
I can easily see why we would uselessly wait or worse, but the
pg_subtrans errors I can't really explain with it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-11-28 10:15:22
Message-ID: 20131128101522.GQ31748@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-11-24 16:56:26 -0500, J Smith wrote:
> coredumper worked like a charm. Useful tool, that is... although as a
> bit of advice, I'd try not to run it on Postgres if your various
> memory settings are tweaked towards production use -- the core dump
> that was captured on my server weighed in at 16 GB.
>
> Anyways, I've attached both the log entries that caused the error as
> well as a backtrace using `bt full`.

Do you still have the core file around? If so could you 'p
*ShmemVariableCache' and 'p *ControlFile'?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: J Smith <dark(dot)panda+lists(at)gmail(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-12-05 17:18:17
Message-ID: CADFUPgcbSW6Fu1TYiC0OAo=+-0-4Bcww4aXkqjwJzFPwU558DQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Thu, Nov 28, 2013 at 5:15 AM, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
> Hi,
>
> Do you still have the core file around? If so could you 'p
> *ShmemVariableCache' and 'p *ControlFile'?
>

So sorry, I didn't see this message until just today. Seems it was
accidentally archived before hitting my eyeballs.

I see that 9.3.2 was released today along with what appears to be some
fixes regarding this and similar issues. Sorry if my missing this
message held anything up.

We still have the core file kicking around, so here's the output at any rate.

(gdb) p *ShmemVariableCache

$1 = {nextOid = 8036795, oidCount = 6974, nextXid = 10260628,
oldestXid = 1673, xidVacLimit = 200001673, xidWarnLimit = 2136485320,
xidStopLimit = 2146485320, xidWrapLimit = 2147485320, oldestXidDB = 1,

latestCompletedXid = 10260621}

(gdb) p *ControlFile

$2 = {system_identifier = 5942827484423487452, pg_control_version =
937, catalog_version_no = 201306121, state = DB_IN_PRODUCTION, time =
1385234938, checkPoint = 381771294048,

prevCheckPoint = 381771293888, checkPointCopy = {redo =
381771293992, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites =
1 '\001', nextXidEpoch = 0, nextXid = 10217377, nextOid = 8035577,

nextMulti = 13448295, nextMultiOffset = 32161320, oldestXid =
1673, oldestXidDB = 1, oldestMulti = 1, oldestMultiDB = 1, time =
1385234938, oldestActiveXid = 10217377}, unloggedLSN = 1,

minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0,
backupEndPoint = 0, backupEndRequired = 0 '\000', wal_level = 2,
MaxConnections = 200, max_prepared_xacts = 0, max_locks_per_xact = 64,

maxAlign = 8, floatFormat = 1234567, blcksz = 8192, relseg_size =
131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen =
64, indexMaxKeys = 32, toast_max_chunk_size = 1996,

enableIntTimes = 1 '\001', float4ByVal = 1 '\001', float8ByVal = 1
'\001', data_checksum_version = 0, crc = 517534097}

(gdb)

Cheers?


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-12-10 18:38:32
Message-ID: 52A75FA8.2090508@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres, all:

We've just run across a case of this exact issue on 9.2.4. I thought it
was supposed to be 9.3-only?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-12-10 18:39:59
Message-ID: 20131210183959.GA30072@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
> We've just run across a case of this exact issue on 9.2.4. I thought it
> was supposed to be 9.3-only?

Could you please describe "this exact issue"?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-12-10 18:44:30
Message-ID: 52A7610E.70400@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/2013 10:39 AM, Andres Freund wrote:
> Hi,
>
> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
>> We've just run across a case of this exact issue on 9.2.4. I thought it
>> was supposed to be 9.3-only?
>
> Could you please describe "this exact issue"?

Fatal errors due to missing pg_subtrans files on creating a new replica.

Sequence:

1. Failed over from master to replica #1

2. Remastered other replicas

3. Tried to create a new replica.

4. New replica started failing with errors similar to the original report.

If this is fixed in 9.2.6, great, but I didn't get that impression from
the commits ...

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-12-10 18:48:41
Message-ID: 20131210184841.GB30072@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-12-10 10:44:30 -0800, Josh Berkus wrote:
> On 12/10/2013 10:39 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
> >> We've just run across a case of this exact issue on 9.2.4. I thought it
> >> was supposed to be 9.3-only?
> >
> > Could you please describe "this exact issue"?
>
> Fatal errors due to missing pg_subtrans files on creating a new replica.

> If this is fixed in 9.2.6, great, but I didn't get that impression from
> the commits ...

I am pretty sure this thread isn't about the bug you're hitting. You get
errors during xid assignment, right? If so, upgrade do 9.2.6, that's
fixed (it's the bug in which wake the replication bug was introduced).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-12-10 19:12:03
Message-ID: 52A76783.6010509@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 12/10/2013 10:48 AM, Andres Freund wrote:
> On 2013-12-10 10:44:30 -0800, Josh Berkus wrote:
>> On 12/10/2013 10:39 AM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
>>>> We've just run across a case of this exact issue on 9.2.4. I thought it
>>>> was supposed to be 9.3-only?
>>>
>>> Could you please describe "this exact issue"?
>>
>> Fatal errors due to missing pg_subtrans files on creating a new replica.
>
>> If this is fixed in 9.2.6, great, but I didn't get that impression from
>> the commits ...
>
> I am pretty sure this thread isn't about the bug you're hitting. You get
> errors during xid assignment, right? If so, upgrade do 9.2.6, that's
> fixed (it's the bug in which wake the replication bug was introduced).

I thought that only affected 9.2.5? This machine is 9.2.4, as was the
prior master.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Josh Berkus <josh(at)agliodbs(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Errors on missing pg_subtrans/ files with 9.3
Date: 2013-12-10 19:16:23
Message-ID: 20131210191623.GC30072@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2013-12-10 11:12:03 -0800, Josh Berkus wrote:
> On 12/10/2013 10:48 AM, Andres Freund wrote:
> > On 2013-12-10 10:44:30 -0800, Josh Berkus wrote:
> >> On 12/10/2013 10:39 AM, Andres Freund wrote:
> >>> Hi,
> >>>
> >>> On 2013-12-10 10:38:32 -0800, Josh Berkus wrote:
> >>>> We've just run across a case of this exact issue on 9.2.4. I thought it
> >>>> was supposed to be 9.3-only?
> >>>
> >>> Could you please describe "this exact issue"?
> >>
> >> Fatal errors due to missing pg_subtrans files on creating a new replica.
> >
> >> If this is fixed in 9.2.6, great, but I didn't get that impression from
> >> the commits ...
> >
> > I am pretty sure this thread isn't about the bug you're hitting. You get
> > errors during xid assignment, right? If so, upgrade do 9.2.6, that's
> > fixed (it's the bug in which wake the replication bug was introduced).
>
> I thought that only affected 9.2.5? This machine is 9.2.4, as was the
> prior master.

9.2.5 tried to fix the issue (errors because of missing pg_subtrans
files during xid assignments), but the fix had unintended consequences,
namely the HS init bug. 9.2.6 removed those unintended consequences.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services