Re: Error on vacuum: xmin before relfrozenxid

Lists: pgsql-generalpgsql-hackers
From: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-22 14:18:20
Message-ID: CAKMFJucqbuoDRfxPDX39WhA3vJyxweRg_zDVXzncr6+5wOguWA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

we have an error happening on a catalog table on one of the dbs in the
instance.
This is the error report form:

A description of what you are trying to achieve and what results you
expect.:

Any kind of vacuum fails on pg_authid table, I would expect it to succeed.
This is occasionaly blocking autovacuums, so we must resort to manual
vacuum of busy tables, which is bearable but inconvenient.

PostgreSQL version number you are running:

PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
20150623 (Red Hat 4.8.5-28), 64-bit

How you installed PostgreSQL:

From the pgdg yum repositories.

Changes made to the settings in the postgresql.conf file: see Server
Configuration for a quick way to list them all.

name |
current_setting | source
------------------------------+------------------------------------------------------------------------------+----------------------
application_name | psql
| client
archive_command | rsync -a %p
barman(at)x(dot)x(dot)x(dot)x:/data/backup/barman/ubipgsql-10/incoming/%f
| configuration file
archive_mode | on
| configuration file
checkpoint_completion_target | 0.9
| configuration file
checkpoint_timeout | 25min
| configuration file
client_encoding | UTF8
| client
DateStyle | ISO, MDY
| configuration file
default_text_search_config | pg_catalog.english
| configuration file
dynamic_shared_memory_type | posix
| configuration file
effective_cache_size | 64GB
| configuration file
hot_standby | on
| configuration file
hot_standby_feedback | on
| configuration file
lc_messages | en_US.UTF-8
| configuration file
lc_monetary | en_US.UTF-8
| configuration file
lc_numeric | en_US.UTF-8
| configuration file
lc_time | en_US.UTF-8
| configuration file
listen_addresses | *
| configuration file
log_autovacuum_min_duration | 1s
| configuration file
log_checkpoints | on
| configuration file
log_destination | syslog
| configuration file
log_directory | log
| configuration file
log_filename | postgresql-%a.log
| configuration file
log_line_prefix | user=%u,db=%d,client=%h
| configuration file
log_lock_waits | on
| configuration file
log_min_duration_statement | 1s
| configuration file
log_min_messages | info
| configuration file
log_rotation_age | 1d
| configuration file
log_rotation_size | 0
| configuration file
log_statement | ddl
| configuration file
log_temp_files | 1MB
| configuration file
log_timezone | Europe/Rome
| configuration file
log_truncate_on_rotation | on
| configuration file
logging_collector | on
| configuration file
maintenance_work_mem | 1GB
| configuration file
max_connections | 1000
| configuration file
max_stack_depth | 2MB
| environment variable
max_wal_size | 12GB
| configuration file
min_wal_size | 80MB
| configuration file
password_encryption | scram-sha-256
| configuration file
pg_stat_statements.max | 10000
| configuration file
pg_stat_statements.track | all
| configuration file
shared_buffers | 32GB
| configuration file
shared_preload_libraries | pg_stat_statements
| configuration file
syslog_facility | local0
| configuration file
syslog_ident | postgres
| configuration file
TimeZone | Europe/Rome
| configuration file
track_activity_query_size | 2048
| configuration file
track_functions | all
| configuration file
track_io_timing | on
| configuration file
wal_keep_segments | 200
| configuration file
wal_level | logical
| configuration file
(51 rows)

Operating system and version:

CentOS Linux release 7.4.1708 (Core)
Linux xx.xx.com 3.10.0-693.11.1.el7.x86_64 #1 SMP Mon Dec 4 23:52:40 UTC
2017 x86_64 x86_64 x86_64 GNU/Linux

What program you're using to connect to PostgreSQL:

psql, jdbc driver. It's not an application side error.

Is there anything relevant or unusual in the PostgreSQL server logs?:

Yes, we see these kind of errors:

May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
ERROR: found xmin 2889675859 from before relfrozenxid 400011439
May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid"

When these errors pop up, autovacuum repeatedly fails on this very same
table, and we have to resort to manual vacuums

For questions about any kind of error:
What you were doing when the error happened / how to cause the error:

The first time we saw the error, we found an idle transaction was left open
for several days. The user was not monitored for long running transactions.
We killed the session, deleted the offending row (the one with the xmin
value reported) and then vacuumed the table.
However, teh same error pops up every time we try to add a new user. We
would like to fix these error once and for all, so we can add new users.

The EXACT TEXT of the error message you're getting, if there is one: (Copy
and paste the message to the email, do not send a screenshot)

This is the error:

psql (10.4)
Type "help" for help.

postgres=# vacuum pg_authid;
ERROR: found xmin 3031994631 from before relfrozenxid 400011439

Some datas:

postgres=# select xmin from pg_authid ;
xmin
------------
1
1
1
1
1
557
7216348
110077819
110511334
3031994631
3032044199
3032044199
3032044199
3032070282
(14 rows)

postgres=# select relfrozenxid from pg_class where relname='pg_authid';
relfrozenxid
--------------
400011439
(1 row)

postgres=#

Is this a sympthom of data corruption or transaction wraparound due to the
long running transaction that we killed weeks ago? This is the only table
in the whole cluster that has this error. We are monitoring transactions
wraparound with the the check_postgres.pl script, the check is still
running fine and no alert was given at all since the cluster has been
running.

Thank you in advance for any answer.

Paolo Crosato


From: Andres Freund <andres(at)anarazel(dot)de>
To: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-22 16:49:11
Message-ID: 20180522164911.lczs6qsod23tx6eb@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> PostgreSQL version number you are running:
>
> PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> 20150623 (Red Hat 4.8.5-28), 64-bit
>
> How you installed PostgreSQL:
>
> From the pgdg yum repositories.
>
> Changes made to the settings in the postgresql.conf file: see Server
> Configuration for a quick way to list them all.
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> ERROR: found xmin 2889675859 from before relfrozenxid 400011439
> May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid"
>
> postgres=# select xmin from pg_authid ;
> xmin
> ------------
> 1
> 1
> 1
> 1
> 1
> 557
> 7216348
> 110077819
> 110511334
> 3031994631
> 3032044199
> 3032044199
> 3032044199
> 3032070282
> (14 rows)
>
> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439
> (1 row)
>
> postgres=#
>
> Is this a sympthom of data corruption or transaction wraparound due to the
> long running transaction that we killed weeks ago?

That shouldn't be the sole cause.

> This is the only table in the whole cluster that has this error. We
> are monitoring transactions wraparound with the the check_postgres.pl
> script, the check is still running fine and no alert was given at all
> since the cluster has been running.
>
> Thank you in advance for any answer.

Could you report the result of
select ctid, xmin, xmax from pg_authid ;
and
CREATE EXTENSION pageinspect;
SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));

Thanks.
Greetings,

Andres Freund


From: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-22 19:43:01
Message-ID: CAKMFJufB73W04jTYUVEdUgcsicW6F_wUrAv-dgWxinSoCTLBdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello,

2018-05-22 18:49 GMT+02:00 Andres Freund <andres(at)anarazel(dot)de>:

> Hi,
>
> On 2018-05-22 16:18:20 +0200, Paolo Crosato wrote:
> > PostgreSQL version number you are running:
> >
> > PostgreSQL 10.4 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5
> > 20150623 (Red Hat 4.8.5-28), 64-bit
> >
> > How you installed PostgreSQL:
> >
> > From the pgdg yum repositories.
> >
> > Changes made to the settings in the postgresql.conf file: see Server
> > Configuration for a quick way to list them all.
> > May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-1] user=,db=,client=
> > ERROR: found xmin 2889675859 from before relfrozenxid 400011439
> > May 20 16:06:04 ubipgsql01 postgres[26739]: [1380-2] user=,db=,client=
> > CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_authid"
> >
> > postgres=# select xmin from pg_authid ;
> > xmin
> > ------------
> > 1
> > 1
> > 1
> > 1
> > 1
> > 557
> > 7216348
> > 110077819
> > 110511334
> > 3031994631
> > 3032044199
> > 3032044199
> > 3032044199
> > 3032070282
> > (14 rows)
> >
> > postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> > relfrozenxid
> > --------------
> > 400011439
> > (1 row)
> >
> > postgres=#
> >
> > Is this a sympthom of data corruption or transaction wraparound due to
> the
> > long running transaction that we killed weeks ago?
>
> That shouldn't be the sole cause.
>
>
> > This is the only table in the whole cluster that has this error. We
> > are monitoring transactions wraparound with the the check_postgres.pl
> > script, the check is still running fine and no alert was given at all
> > since the cluster has been running.
> >
> > Thank you in advance for any answer.
>
> Could you report the result of
> select ctid, xmin, xmax from pg_authid ;
>

This is the result:

postgres=# select ctid, xmin, xmax from pg_authid ;
ctid | xmin | xmax
--------+------------+------
(0,1) | 1 | 0
(0,2) | 1 | 0
(0,3) | 1 | 0
(0,4) | 1 | 0
(0,5) | 1 | 0
(0,6) | 557 | 0
(0,11) | 7216348 | 0
(0,12) | 110077819 | 0
(0,13) | 110511334 | 0
(0,16) | 3031994631 | 0
(0,17) | 3032044199 | 0
(0,18) | 3032044199 | 0
(0,19) | 3032044199 | 0
(0,20) | 3032070282 | 0
(14 rows)

postgres=#

> and
> CREATE EXTENSION pageinspect;
> SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));
>

This is the output of the second query:

postgres=# CREATE EXTENSION pageinspect;
CREATE EXTENSION
postgres=# SELECT * FROM heap_page_items(get_raw_page('pg_authid', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid
| t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |
t_data
----+--------+----------+--------+------------+--------+----------+--------+-------------+------------+--------+------------------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 | 8080 | 1 | 108 | 1 | 0 | 0 | (0,1)
| 11 | 2313 | 32 | 1111111110000000 | 3373 |
\x70675f6d6f6e69746f720000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
2 | 7968 | 1 | 108 | 1 | 0 | 0 | (0,2)
| 11 | 2313 | 32 | 1111111110000000 | 3374 |
\x70675f726561645f616c6c5f73657474696e677300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
3 | 7856 | 1 | 108 | 1 | 0 | 0 | (0,3)
| 11 | 2313 | 32 | 1111111110000000 | 3375 |
\x70675f726561645f616c6c5f737461747300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
4 | 7744 | 1 | 108 | 1 | 0 | 0 | (0,4)
| 11 | 2313 | 32 | 1111111110000000 | 3377 |
\x70675f737461745f7363616e5f7461626c65730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
5 | 7632 | 1 | 108 | 1 | 0 | 0 | (0,5)
| 11 | 2313 | 32 | 1111111110000000 | 4200 |
\x70675f7369676e616c5f6261636b656e6400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
6 | 7384 | 1 | 245 | 557 | 0 | 0 | (0,6)
| 11 | 11019 | 32 | 1111111111000000 | 24576 |
\x7265706c6963610000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a305a6a336a39676d31314e473166455a6c75774a52413d3d2448394d446b2b445a4a723856544955702f684933664e4933376c777675696e49336731444a6672447863383d3a312f485442645265694c54306341724c47526f4d6d7739415a7351674c365a4c454a324e4e55306b5775413d
7 | 16 | 2 | 0 | | | |
| | | | | |
8 | 17 | 2 | 0 | | | |
| | | | | |
9 | 18 | 2 | 0 | | | |
| | | | | |
10 | 19 | 2 | 0 | | | |
| | | | | |
11 | 7136 | 1 | 245 | 7216348 | 0 | 0 | (0,11)
| 11 | 11019 | 32 | 1111111111000000 | 10 |
\x706f73746772657300000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101010101010100ffffffff24020000534352414d2d5348412d32353624343039363a72532f36303674706f523975465558797566392f6f413d3d2435412b4e745a444265333365397143556570422b324748664236567564717a65434b56427067576c2f41773d3a5a5863594e662f394f436744423345756f64514245564a526d39496361596d5a4b7856584e6a6d636554513d
12 | 6888 | 1 | 245 | 110077819 | 0 | 0 | (0,12)
| 11 | 11019 | 32 | 1111111111000000 | 432115 |
\x6261726d616e000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000101000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7a716d737472687231416e376e39456f367a735677773d3d244d6351307473424e743466593169334d456e4c6a754866306e6c38774c44386f594236704630464e302b493d3a2b6336764a32372b68327832616d626c393554465168474867387270456a5a4a4952376d513044594e33553d
13 | 6640 | 1 | 245 | 110511334 | 0 | 0 | (0,13)
| 11 | 11019 | 32 | 1111111111000000 | 432127 |
\x73747265616d696e675f6261726d616e0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001010000ffffffff24020000534352414d2d5348412d32353624343039363a4e45496938767a714d7076752b594c585952377458413d3d245a39364b744d32316479687658716d3074616b7669717568794671574f45543647384174363954326e52413d3a5a4973674e71305537485344525168784f732b617364336d7559314d7365333373456956352b78694f52413d
14 | 20 | 2 | 0 | | | |
| | | | | |
15 | 0 | 3 | 0 | | | |
| | | | | |
16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16)
| 32779 | 10507 | 32 | 1111111111000000 | 675851 |
\x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639
17 | 6384 | 1 | 108 | 3032044199 | 0 | 0 | (0,17)
| 32779 | 10505 | 32 | 1111111110000000 | 774033 |
\x726561646163636573730000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000000000000ffffffff
18 | 6136 | 1 | 245 | 3032044199 | 0 | 0 | (0,18)
| 32779 | 10507 | 32 | 1111111111000000 | 774134 |
\x74696572676173745f726f00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a43335379382b507849324d4e4f4a67306e6f467a5a773d3d24373575643339325154422b46672f364949396a32576e41326e544b62535636666572447662657142432b343d3a5074736777516278566a7336352f4833554e6b484e58664335793578422f552f6973636b426269746159633d
19 | 5888 | 1 | 245 | 3032044199 | 0 | 0 | (0,19)
| 32779 | 10507 | 32 | 1111111111000000 | 8608367 |
\x666c6565745f6d6f62696c655f6170700000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a5069564870613054594247703075746f6a47365061413d3d247951577151776e64574d317a4651787050644936505662516a4246534f5131424632374f7a6162616d36413d3a64504a5144572b4258454f632f526554566c4a7678654541714e4d365878314b77744c667942674745334d3d
20 | 5640 | 1 | 245 | 3032070282 | 0 | 0 | (0,20)
| 32779 | 10507 | 32 | 1111111111000000 | 1661808 |
\x746965726761737400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff24020000534352414d2d5348412d32353624343039363a7373427448544b574b68674232623265757634664e673d3d2453575a5253613578644a624951796649634244674875796e55684b436848676841475643767742504c50303d3a31656938464d354b30632b674a5a3233716e37314e46767978766d6978756a754c41666c4d445a676c63413d
(20 rows)

>
> Thanks.
> Greetings,
>
> Andres Freund
>

Thank you very much!

Best Regards

Paolo Crosato


From: Andres Freund <andres(at)anarazel(dot)de>
To: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-22 21:19:37
Message-ID: 20180522211937.bfotr476fkn3rav6@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > Could you report the result of
> > select ctid, xmin, xmax from pg_authid ;
> >
>
> This is the result:
>
> postgres=# select ctid, xmin, xmax from pg_authid ;

> (0,16) | 3031994631 | 0
> 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 | (0,16)
> | 32779 | 10507 | 32 | 1111111111000000 | 675851 |
> \x6e6167696f73000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000000ffffffff496d64356333633236616163636439616665346437383061396239613464663634653639

> postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> relfrozenxid
> --------------
> 400011439

That tuple looks, to me, like it indeed shouldn't exist, given the
relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right,
shows:

HEAP_HASNULL
HEAP_HASVARWIDTH
HEAP_HASOID
HEAP_XMIN_COMMITTED
HEAP_XMAX_INVALID
HEAP_UPDATED

so it's not frozen. That suggests there's either a bug, or you had
corruption in your cluster.

Could you give a bit of "history" about that postgres instance? What
version of postgres was it run on earliest? Was it ever pg_upgraded?
Were there any OS crashes? Other hardware issues? Was the cluster ever
used with fsync=off or full_page_writes=off?

Greetings,

Andres Freund


From: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-22 22:04:26
Message-ID: CAKMFJucLdKK+Js1+o+z-aJ8qRuWMSivmpzPK_8zQNLLpysRsAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Good evening,

2018-05-22 23:19 GMT+02:00 Andres Freund <andres(at)anarazel(dot)de>:

> Hi,
>
> On 2018-05-22 21:43:01 +0200, Paolo Crosato wrote:
> > > Could you report the result of
> > > select ctid, xmin, xmax from pg_authid ;
> > >
> >
> > This is the result:
> >
> > postgres=# select ctid, xmin, xmax from pg_authid ;
>
> > (0,16) | 3031994631 | 0
> > 16 | 6496 | 1 | 144 | 3031994631 | 0 | 0 |
> (0,16)
> > | 32779 | 10507 | 32 | 1111111111000000 | 675851 |
> > \x6e6167696f7300000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 0000000000001000001000000ffffffff496d64356333633236616163636
> 439616665346437383061396239613464663634653639
>
> > postgres=# select relfrozenxid from pg_class where relname='pg_authid';
> > relfrozenxid
> > --------------
> > 400011439
>
> That tuple looks, to me, like it indeed shouldn't exist, given the
> relfrozenxid. Decoding infomask (10507 / 0x290B), if I did it right,
> shows:
>
> HEAP_HASNULL
> HEAP_HASVARWIDTH
> HEAP_HASOID
> HEAP_XMIN_COMMITTED
> HEAP_XMAX_INVALID
> HEAP_UPDATED
>
> so it's not frozen. That suggests there's either a bug, or you had
> corruption in your cluster.
>
>
> Could you give a bit of "history" about that postgres instance? What
> version of postgres was it run on earliest? Was it ever pg_upgraded?
> Were there any OS crashes? Other hardware issues? Was the cluster ever
> used with fsync=off or full_page_writes=off?
>
> Greetings,
>
> Andres Freund
>

The cluster is made of a primary master instance, and a secondary slave in
hot standby, with streaming replication. There is a barman server for the
backups.
The first version it ran on was 10.2, the cluster was promoted in
production in mid february. It was never pg_upgraded, we upgraded it
yesterday to 10.4 and restarted, in hope that this would resolve the issue.
The restart was performed by the upgrading process. There was never any
crash or hardware issue, the instance run without any problem. we never
enabled fsync=off or full_page_writes=off. This is the only real issue so
far. We have many nagios alerts monitoring the instance, the only problem
is we are creating many temporary files, most of them are caused by a query
that doesn't overflow work_mem when run alone with explain analyze buffers,
which is weird. I should mention that we had a major creation of temp files
in the first hours of the cluster history, however that was quickly
resolved.

I managed to recover the log of the first time we run into the issue, the
error was the same but on template1:

May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
found xmin 2600758304 from before relfrozenxid 400011439
May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
automatic vacuum of table "template1.pg_catalog.pg_authid"

Deleting the row worked, but, as you see, the problem with 400011439 was
there already. This happened when we added a user, shortly afterwards these
errors started popping up in the log. We found the issue because the
autovacuums stopped working, they were blocked on the pg_authid table. The
same time we found out the long running transaction, it was a multi session
kill that someway got stuck. Killing the transaction and deleting the
offending row was enough for the autovacuums to restart. We also had a
materialized view that was created months before and nobody used. Since i
found in the past there was a similar error related to materialized views,
I dropped it.

The instance is doing a lot of transactions every day, probably around 100
millions. I would need to do more checks to provide an accurate measure.

I still have the logs since the first time the error appeared, so I can
provide further details if needed.

Thanks,
Best Regards

Paolo Crosato


From: Andres Freund <andres(at)anarazel(dot)de>
To: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-22 23:39:58
Message-ID: 20180522233957.zbflihb5qrsh3vqp@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi,

On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> I managed to recover the log of the first time we run into the issue, the
> error was the same but on template1:
>
> May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> found xmin 2600758304 from before relfrozenxid 400011439
> May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> automatic vacuum of table "template1.pg_catalog.pg_authid"

pg_authid (along with a few other tables) is shared between
databases. So that's just hte same error. At which rate are you
creating / updating database users / roles?

Greetings,

Andres Freund


From: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-23 07:39:54
Message-ID: CAKMFJuffEmarjLJY+8GQVXH=F=EWgxX=VgcYtub6_BG-Wfq4Zw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

2018-05-23 1:39 GMT+02:00 Andres Freund <andres(at)anarazel(dot)de>:

> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error. At which rate are you
> creating / updating database users / roles?
>

Once or twice a month. Yesterday I updated some rows on the table with a

update pg_auth_id set rolname=rolname where rolname=...

for some users in the hope to work around the issue, but it didn't work.
That's why there are recent xmin on some rows.

Best Regards,

Paolo Crosato


From: Andres Freund <andres(at)anarazel(dot)de>
To: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-23 18:32:38
Message-ID: 20180523183238.nfrzh35c7n4mhfuq@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> Hi,
>
> On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > I managed to recover the log of the first time we run into the issue, the
> > error was the same but on template1:
> >
> > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > found xmin 2600758304 from before relfrozenxid 400011439
> > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client= CONTEXT:
> > automatic vacuum of table "template1.pg_catalog.pg_authid"
>
> pg_authid (along with a few other tables) is shared between
> databases. So that's just hte same error. At which rate are you
> creating / updating database users / roles?

Other questions:
- did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
tables)?
- Did you have any failovers?
- Do you use logical replication?

Greetings,

Andres Freund


From: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-23 19:55:55
Message-ID: CAKMFJud2Vt1x0Kd2=K9NuomdWzJpZTpKWkBZ+iVs0j7sxfPb0g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

2018-05-23 20:32 GMT+02:00 Andres Freund <andres(at)anarazel(dot)de>:

> On 2018-05-22 16:39:58 -0700, Andres Freund wrote:
> > Hi,
> >
> > On 2018-05-23 00:04:26 +0200, Paolo Crosato wrote:
> > > I managed to recover the log of the first time we run into the issue,
> the
> > > error was the same but on template1:
> > >
> > > May 8 11:26:46 xxx postgres[32543]: [1154-1] user=,db=,client= ERROR:
> > > found xmin 2600758304 from before relfrozenxid 400011439
> > > May 8 11:26:46 xxx postgres[32543]: [1154-2] user=,db=,client=
> CONTEXT:
> > > automatic vacuum of table "template1.pg_catalog.pg_authid"
> >
> > pg_authid (along with a few other tables) is shared between
> > databases. So that's just hte same error. At which rate are you
> > creating / updating database users / roles?
>
> Other questions:
> - did you ever use VACUUM FULL or CLUSTER on pg_authid (or just on all
> tables)?
> - Did you have any failovers?
> - Do you use logical replication?
>

1) VACUUM FULL was issued after the first time the error occurred, and a
couple of times later. CLUSTER was never run.
2) Several failovers tests were perfomed before the cluster was moved to
production. However, before the move, the whole cluster was wiped,
including all the application and monitoring users. After the db was moved
to production, a couple of users were added without any problem.
3) No, even if the replication level is set to logical in postgresql.conf,
we only use streaming replication.

Best Regards,

Paolo Crosato


From: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
To: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-24 06:30:05
Message-ID: 7D158CD1-E6FC-4A49-9FC2-15F1AC2A7A05@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hi!

> 24 мая 2018 г., в 0:55, Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com> написал(а):
>
> 1) VACUUM FULL was issued after the first time the error occurred, and a couple of times later. CLUSTER was never run.
> 2) Several failovers tests were perfomed before the cluster was moved to production. However, before the move, the whole cluster was wiped, including all the application and monitoring users. After the db was moved to production, a couple of users were added without any problem.
> 3) No, even if the replication level is set to logical in postgresql.conf, we only use streaming replication.

I've encountered seemingly similar ERROR:
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin 747375134 from before relfrozenxid 2467346321
[ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum of table "postgres.pg_catalog.pg_database"

Table pg_database, probably, was not changed anyhow for a long period of database exploitation.
Unfortunately, I've found out this only there were million of xids left and had to vacuum freeze db in single user mode asap. But, probably, I will be able to restore database from backups and inspect it, if necessary. Though first occurrence of this error was beyond recovery window.

Best regards, Andrey Borodin.


From: Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
To: Andrey Borodin <x4mmm(at)yandex-team(dot)ru>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-general(at)lists(dot)postgresql(dot)org, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-24 19:00:55
Message-ID: CAKMFJue7Cod47-MSc-2D4sK2Vhg5jsBqDH5u2VgBGm_LwSakzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

2018-05-24 8:30 GMT+02:00 Andrey Borodin <x4mmm(at)yandex-team(dot)ru>:

> Hi!
>
> > 24 мая 2018 г., в 0:55, Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
> написал(а):
> >
> > 1) VACUUM FULL was issued after the first time the error occurred, and a
> couple of times later. CLUSTER was never run.
> > 2) Several failovers tests were perfomed before the cluster was moved to
> production. However, before the move, the whole cluster was wiped,
> including all the application and monitoring users. After the db was moved
> to production, a couple of users were added without any problem.
> > 3) No, even if the replication level is set to logical in
> postgresql.conf, we only use streaming replication.
>
> I've encountered seemingly similar ERROR:
> [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:ERROR: found xmin
> 747375134 from before relfrozenxid 2467346321
> [ 2018-05-22 15:04:03.270 MSK ,,,281756,XX001 ]:CONTEXT: automatic vacuum
> of table "postgres.pg_catalog.pg_database"
>
> Table pg_database, probably, was not changed anyhow for a long period of
> database exploitation.
> Unfortunately, I've found out this only there were million of xids left
> and had to vacuum freeze db in single user mode asap. But, probably, I will
> be able to restore database from backups and inspect it, if necessary.
> Though first occurrence of this error was beyond recovery window.
>
> Best regards, Andrey Borodin.

I could build a mirror instance with barman and see if the issue is present
as well, then try to vacuum freeze it in single mode, and see if it
disappears; but I would like to know why it happened in the first time. I
wonder if the autovacuum settings played a role, we kept the defaults, even
if the instance has a very heavy update workload.

Best Regards,

Paolo Crosato


From: "Wong, Yi Wen" <yiwong(at)amazon(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Paolo Crosato <paolo(dot)crosato(at)gmail(dot)com>
Cc: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Error on vacuum: xmin before relfrozenxid
Date: 2018-05-24 20:25:04
Message-ID: 1527193504642.36340@amazon.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Hello,

We have also seen this issue ourselves and are actively working on it. We do not have a definite proposed solution or full RCA yet, but we have managed to narrow down the cause to stale values of relfrozenxid in the relcache for shared relations.

We are in the process of coming up with a repro as well.

A proposed workaround to this issue is to delete global/pg_internal.init. That will allow a vacuum of the shared relation to proceed.

Thanks,

Yi Wen Wong