Re: repeatable system index corruption on 7.4.2 (SOLVED)

Lists: pgsql-hackers
From: Joe Conway <mail(at)joeconway(dot)com>
To: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: repeatable system index corruption on 7.4.2
Date: 2004-08-19 17:59:22
Message-ID: 4124EA7A.4030706@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I'm seeing the following errors after a few hours of fairly aggressive
bulk load of a database running on Postgres 7.4.2:

cyspec=# select version();
ERROR: root page 43 of "pg_proc_proname_args_nsp_index" has level 0,
expected 2
cyspec=# select * from pg_class;
ERROR: root page 3 of "pg_attribute_relid_attnum_index" has level 0,
expected 1

When I say aggressive, I mean up to 6 simultaneous COPY processes. It is
different from the issue Tom solved the other day in that we don't get
SIGABORT, just corrupt index pages. Here is a backtrace:

#0 errfinish (dummy=0) at elog.c:319
#1 0x081cbc26 in elog_finish (elevel=20, fmt=0x81e6ee0 "root page %u of
\"%s\" has level %u, expected %u")
at elog.c:853
#2 0x0808c632 in _bt_getroot (rel=0x82d58d4, access=1429534320) at
nbtpage.c:287
#3 0x080902f3 in _bt_search (rel=0x82d58d4, keysz=2, scankey=0x8307358,
bufP=0xfeffefa8, access=1) at nbtsearch.c:46
#4 0x08090fea in _bt_first (scan=0x8307198, dir=ForwardScanDirection)
at nbtsearch.c:575
#5 0x0808ed47 in btgettuple (fcinfo=0x82a65cc) at nbtree.c:326
#6 0x081ce5b6 in FunctionCall2 (flinfo=0x0, arg1=136996300,
arg2=136996300) at fmgr.c:993
#7 0x08088329 in index_getnext (scan=0x8307198,
direction=ForwardScanDirection) at indexam.c:503
#8 0x08087951 in systable_getnext (sysscan=0x82a65cc) at genam.c:253
#9 0x081c3c43 in RelationBuildTupleDesc (buildinfo=
{infotype = 2, i = {info_id = 136666169, info_name = 0x8255c39
"pg_index_indrelid_index"}},
relation=0x5a2cdea4) at relcache.c:548
#10 0x081c459f in RelationBuildDesc (buildinfo=
{infotype = 2, i = {info_id = 136666169, info_name = 0x8255c39
"pg_index_indrelid_index"}}, oldrelation=0x0)
at relcache.c:884
#11 0x081c56c0 in RelationSysNameGetRelation (relationName=0x8255c39
"pg_index_indrelid_index") at relcache.c:1637
#12 0x0807febe in relation_openr (sysRelationName=0x8255c39
"pg_index_indrelid_index", lockmode=0) at heapam.c:529
#13 0x08087ab5 in index_openr (sysRelationName=0x8255c39
"pg_index_indrelid_index") at indexam.c:179
#14 0x0808790a in systable_beginscan (heapRelation=0x82ee11c,
indexRelname=0x8255c39 "pg_index_indrelid_index",
indexOK=1 '\001', snapshot=0x0, nkeys=1, key=0xfefff2d0) at genam.c:192
#15 0x081c6f53 in RelationGetIndexList (relation=0x82d846c) at
relcache.c:2717
#16 0x08147bc4 in get_relation_info (relationObjectId=1259,
rel=0x83070d4) at plancat.c:81
#17 0x081492b4 in make_base_rel (root=0x82a65cc, relid=1) at relnode.c:159
#18 0x08148f91 in build_base_rel (root=0x8302164, relid=1) at relnode.c:70
#19 0x0813d5b8 in add_base_rels_to_query (root=0x8302164,
jtnode=0x83070b8) at initsplan.c:86
#20 0x0813e4c9 in query_planner (root=0x8302164, tlist=0x83067a0,
tuple_fraction=0, cheapest_path=0xfefff4e0,
sorted_path=0xfefff4e4) at planmain.c:119
#21 0x0813f03b in grouping_planner (parse=0x8302164, tuple_fraction=0)
at planner.c:897
#22 0x0813e9b4 in subquery_planner (parse=0x8302164, tuple_fraction=0)
at planner.c:315
#23 0x0813e69c in planner (parse=0x8302164, isCursor=0 '\0',
cursorOptions=0) at planner.c:119
#24 0x0816e8ba in pg_plan_query (querytree=0x8302164) at postgres.c:589
#25 0x0816e944 in pg_plan_queries (querytrees=0x830673c, needSnapshot=0
'\0') at postgres.c:656
#26 0x0816eafc in exec_simple_query (query_string=0x8301dbc "select *
from pg_class;\n") at postgres.c:814

Any thoughts? Do you think the patch in 7.4.5 will address this also, or
are we looking at something else?

Thanks,

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: repeatable system index corruption on 7.4.2
Date: 2004-08-19 18:15:14
Message-ID: 1726.1092939314@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> I'm seeing the following errors after a few hours of fairly aggressive
> bulk load of a database running on Postgres 7.4.2:

> cyspec=# select version();
> ERROR: root page 43 of "pg_proc_proname_args_nsp_index" has level 0,
> expected 2

[ scratches head... ] Are you just doing bulk *data* load, or are you
doing extensive schema hacking too? What reason would there be for
index churn in pg_proc (or pg_class for that matter)?

Can you get a pg_filedump dump of that page of the index?

> When I say aggressive, I mean up to 6 simultaneous COPY processes. It is
> different from the issue Tom solved the other day in that we don't get
> SIGABORT, just corrupt index pages. Here is a backtrace:

That may have more to do with whether the error is detected inside a
critical section than anything else.

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: repeatable system index corruption on 7.4.2
Date: 2004-08-19 18:37:11
Message-ID: 4124F357.6020706@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Joe Conway <mail(at)joeconway(dot)com> writes:
>
>>I'm seeing the following errors after a few hours of fairly aggressive
>>bulk load of a database running on Postgres 7.4.2:
>
>>cyspec=# select version();
>>ERROR: root page 43 of "pg_proc_proname_args_nsp_index" has level 0,
>>expected 2
>
> [ scratches head... ] Are you just doing bulk *data* load, or are you
> doing extensive schema hacking too? What reason would there be for
> index churn in pg_proc (or pg_class for that matter)?

Here's more detail about what we're doing:

Done once
----------
create the database
create schemas
create "base" tables
create functions

Done in a loop
--------------
for year in X to Y
{
create ~60 tables, inherited from the "base" tables
for each table (up to 6 at a time in parallel)
{
load data via COPY
create index
}
}

I was just informed that the most recent attempt failed again, this time
with no more than 1 table loaded at a time.

> Can you get a pg_filedump dump of that page of the index?

Will do.

BTW, I also tried reindexing one of the corrupt databases and got this:

backend> reindex database cyspec;
2004-08-19 11:13:37 NOTICE: table "pg_class" was reindexed
2004-08-19 11:13:38 ERROR: invalid page header in block 6 of relation
"pg_attribute"

Joe


From: Joe Conway <mail(at)joeconway(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: repeatable system index corruption on 7.4.2
Date: 2004-08-19 21:49:13
Message-ID: 41252059.2060408@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
>>Joe Conway writes
>>I'm seeing the following errors after a few hours of fairly aggressive
>>bulk load of a database running on Postgres 7.4.2:
>
>>When I say aggressive, I mean up to 6 simultaneous COPY processes. It is
>>different from the issue Tom solved the other day in that we don't get
>>SIGABORT, just corrupt index pages.
>
> OK, problem accepted, but why would you run 6 simultaneous COPYs? Presumably
> on > 1 CPU? Sounds like you're hitting the right edge of the index really
> hard (as well as finding a hole in the logic).

This is fairly high end hardware -- 4 hyperthreaded CPUs (hence 8 CPUs
from the OS perspective), 8GB RAM.

But in any case, since last report we've reproduced the problem with a
single COPY at a time.

> Can I ask, are you also running simultaneous INSERTs or just COPYs? And
> presumably you're mixing that with SELECTs that are doing index scans? Does
> the table have random deletes on it, or just occaisional regular bulk
> deletes? How often is it VACUUMed?

This is an initial data load, coming from an export from a large
cOmmeRciAl database. No other activity other than loading, no inserts.
The total rows to be imported into Postgres is ~900 million -- that's
why we would like to do as much in parallel as possible.

Joe


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "Joe Conway" <mail(at)joeconway(dot)com>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: repeatable system index corruption on 7.4.2
Date: 2004-08-19 21:56:41
Message-ID: NOEFLCFHBPDAFHEIPGBOMEOPCCAA.simon@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Joe Conway writes
> I'm seeing the following errors after a few hours of fairly aggressive
> bulk load of a database running on Postgres 7.4.2:
>

> When I say aggressive, I mean up to 6 simultaneous COPY processes. It is
> different from the issue Tom solved the other day in that we don't get
> SIGABORT, just corrupt index pages.

OK, problem accepted, but why would you run 6 simultaneous COPYs? Presumably
on > 1 CPU? Sounds like you're hitting the right edge of the index really
hard (as well as finding a hole in the logic).

Can you stream the data to a shared pipe and run just a single COPY?
Could you "dilute" the index by adding an irrelevant number as the first
column on the index, so that the index hotspot moves to 6 places? Or use a
functional index to alter the distribution of values?
You'd be quite likely to see an improvement in performance by avoiding the
contention, and that may be a good workaround for now.

Can I ask, are you also running simultaneous INSERTs or just COPYs? And
presumably you're mixing that with SELECTs that are doing index scans? Does
the table have random deletes on it, or just occaisional regular bulk
deletes? How often is it VACUUMed?

Best Regards, Simon Riggs


From: Joe Conway <mail(at)joeconway(dot)com>
To:
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, "Hackers (PostgreSQL)" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: repeatable system index corruption on 7.4.2 (SOLVED)
Date: 2004-08-21 11:32:09
Message-ID: 412732B9.1030706@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway wrote:
> Simon Riggs wrote:
>>> Joe Conway writes
>>> I'm seeing the following errors after a few hours of fairly aggressive
>>> bulk load of a database running on Postgres 7.4.2:
>>
>>> When I say aggressive, I mean up to 6 simultaneous COPY processes. It is
>>> different from the issue Tom solved the other day in that we don't get
>>> SIGABORT, just corrupt index pages.
>>
>> OK, problem accepted, but why would you run 6 simultaneous COPYs?
>> Presumably
>> on > 1 CPU? Sounds like you're hitting the right edge of the index really
>> hard (as well as finding a hole in the logic).
>
> This is fairly high end hardware -- 4 hyperthreaded CPUs (hence 8 CPUs
> from the OS perspective), 8GB RAM.
>
> But in any case, since last report we've reproduced the problem with a
> single COPY at a time.

I just want to close the loop on this thread. In summary, the problem
turned out to be related to the logical volume, and NOT Postgres. For
those interested in the gory detail, read on -- others may safely move
on ;-)

Joe

----------------------
Here's what we had, in terms of storage layout (i.e. the layout
susceptible to Postgres system catalog corruption). Note that I'm
neither the unix admin nor the storage expert -- we had our own unix
admin, a storage expert from the local VAR, and a storage expert from
the SAN vendor involved. They decided how to lay this all out -- I'll do
my best to be accurate in my description.

[------------- jfs filesystem -------------]
[------------- logical volume -------------]
[------------- volume group -------------]
+------+------+------+------+------+------+------+
| LUN1 | LUN2 | LUN3 | LUN4 | LUN5 | LUN6 | LUN7 |
+______+______+______+______+______+______+______+

LUN[1-7] are actually each comprised of 14 x 73GB x 15K rpm SCSI drives,
configured (I think) in a RAID 5 array, totaling just under 1TB of
useable space. The SAN presents each of these arrays, via fibrechannel,
to the OS as a single, large SCSI LUN. The LUNs are individually
partitioned using fdisk, a single primary partition on each, and in this
case the partition was offset 128MB to allow for stripe alignment (more
on stripes later).

The volume group creates a block device. We used 128MB physical extent
size to allow for up to an 8TB logical volume.

The logical volume is then created on top of the volume group. We
created a single, 6.4TB logical volume using 128MB stripes across the
LUNs in the volume group.

Finally the filesytem is laid on top. We started out with jfs based on
vendor recommendations (I believe). I'll represent this topology a bit
more compactly as (SAN-SCSI->SLV->jfs), where SLV stands for striped
logical volume.

----------------------

In the above configuration we consistently got data corruption (as
evidenced by bad system indexes) regardless of the Postgres
configuration, and with 1 or 6 parallel data loads. We then changed the
filesystem to xfs. With an xfs filesystem, we got substantially farther
into our data load, but corruption did eventually occur.

After a variety of successful data loads (i.e. no corruption) on other
volumes, with different topologies, we decided that the problem was
related to either the logical volume level, or the SAN hardware itself.

So we deleted the logical volume and the volume group. Then we
re-partitioned the LUNs -- this time without the offset for stripe
alignment. The volume group was rebuilt using 128MB extents again, but
the logical volume was build using concatenated, instead of striped,
LUNs. Finally we formatted with xfs. That is, (SAN-SCSI->CLV->xfs),
where CLV is concatenated logical volume.

(SAN-SCSI->CLV->xfs) worked with no data corruption. We're not entirely
sure why the original configuration was a problem, but the SAN vendor
has agreed to try to reproduce this scenario in their lab.

----------------------

Finally, some comparitive times on the various volume types that we
tested while troubleshooting:

fail == system catalog data corruption
pass == no evident corruption

data1 -> (SAN-SCSI->SLV->jfs) -> fail
data1 -> (SAN-SCSI->SLV->xfs) -> fail
data3 -> (NFS-mounted-NAS) -> pass (122 minutes)
data2 -> (SAN-IDE->CLV->jfs) -> pass (103 minutes)
data1 -> (SAN-SCSI->CLV->xfs) -> pass (94 minutes)
data1 -> (SAN-SCSI->CLV->xfs) -> pass (93 minutes)

Times listed are the total clock time to complete the entire data load,
using 6 parallel processes doing bulk COPY.

Hope someone finds this useful.

Joe