Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory

Lists: pgsql-admin
From: Mike Williams <mike(dot)williams(at)comodo(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-28 17:41:41
Message-ID: 201003281841.41865.mike.williams@comodo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Hi guys,

We upgraded from 8.3.5 to 8.3.9 yesterday, after an extended period of
testing, to 8.3.10 today in hope of fixing it, and we're getting the following
error on occasion.

ERROR: could not open segment 1 of relation 1663/743352/743420 (target block
2171336): No such file or directory
CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND
unique_ip=-1 ORDER BY day DESC LIMIT 1"
PL/pgSQL function "update_unique_ips" line 13 at SQL statement

update_unique_ips() is called once per minute, but we get anywhere from one
error per minute, to one error per hour.

Relation 743420 in database 743352 is 441270272 bytes long and growing by a
few thousand bytes here and there.
Obviously, as we've gone from 8.3.9 to 8.3.10, we're restarted postgres a few
times. We have also REINDEXd each table, the database, dropped and re-created
all the indexes.
I've even pg_dump'd the database, and loaded it into a fresh database. The
relation numbers changed, but the 1663 bit remained the same.

Best I can tell it's this bit of code from the beginning of
update_unique_ips() that's doing it, but it seems completely innocuous to me.

CREATE FUNCTION update_unique_ips() RETURNS timestamp without time zone
AS $$

DECLARE
last_day timestamp;
day_for_update timestamp;

BEGIN
-- declare last_day
SELECT INTO last_day day-interval '30 min' as day
FROM min_stats
ORDER BY day DESC
LIMIT 1;

SELECT INTO day_for_update day
FROM min_stats
WHERE day<=last_day AND unique_ip=-1
ORDER BY day DESC
LIMIT 1;

Could anyone help us out here please?
Thanks

--
Mike Williams


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Mike Williams <mike(dot)williams(at)comodo(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-28 18:20:27
Message-ID: 25082.1269800427@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Mike Williams <mike(dot)williams(at)comodo(dot)com> writes:
> We upgraded from 8.3.5 to 8.3.9 yesterday, after an extended period of
> testing, to 8.3.10 today in hope of fixing it, and we're getting the following
> error on occasion.

> ERROR: could not open segment 1 of relation 1663/743352/743420 (target block
> 2171336): No such file or directory
> CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND
> unique_ip=-1 ORDER BY day DESC LIMIT 1"
> PL/pgSQL function "update_unique_ips" line 13 at SQL statement

> update_unique_ips() is called once per minute, but we get anywhere from one
> error per minute, to one error per hour.

> Relation 743420 in database 743352 is 441270272 bytes long and growing by a
> few thousand bytes here and there.
> Obviously, as we've gone from 8.3.9 to 8.3.10, we're restarted postgres a few
> times. We have also REINDEXd each table, the database, dropped and re-created
> all the indexes.
> I've even pg_dump'd the database, and loaded it into a fresh database. The
> relation numbers changed, but the 1663 bit remained the same.

Wow. The symptom seems to be what you'd get from a corrupt tuple
pointer in an index, but it's hard to believe that such a thing would
persist over reindexes and even reloads. Is the "target block" number
always the same, or does that change? Can you reproduce the problem on
a different machine? What platform are you running on?

regards, tom lane


From: "Mike Williams" <mike(dot)williams(at)comodo(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-28 19:29:52
Message-ID: 3c7898fc579b37bfec2f5ff971df7532.squirrel@mail.brad.office.comodo.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Sun, March 28, 2010 7:20 pm, Tom Lane wrote:
> Mike Williams <mike(dot)williams(at)comodo(dot)com> writes:
>> ERROR: could not open segment 1 of relation 1663/743352/743420 (target
>> block
>> 2171336): No such file or directory
>> CONTEXT: SQL statement "SELECT day FROM min_stats WHERE day<= $1 AND
>> unique_ip=-1 ORDER BY day DESC LIMIT 1"
>> PL/pgSQL function "update_unique_ips" line 13 at SQL statement

> Wow. The symptom seems to be what you'd get from a corrupt tuple
> pointer in an index, but it's hard to believe that such a thing would
> persist over reindexes and even reloads. Is the "target block" number
> always the same, or does that change? Can you reproduce the problem on
> a different machine? What platform are you running on?

While it's always pleasing to find an error condition that makes an expert
go "Wow", it is also quite troubling! :)

The "target block" stays the same for extended periods, but does change.
Oddly it's been quite for a while now, half an hour or so ago it had been
going nearly every minute for a good hour. For the last 30 times the block
was 11387925, before then 6634333 once, then 15520041 for 4 times, 6667190
before that for a while.
And just as I'm about to hit send, 3 more errors. All block 14528541.

Where does the 1663 number come from?
<datadir/base>/743352/743420 is a real file, but I don't have any file 1663.

The server is Gentoo, and I don't have an identical server to use. This
machines HA partner is staying at 8.3.5 until we're completely and
entirely sure 8.3.10 works right.

--
Mike Williams


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Mike Williams" <mike(dot)williams(at)comodo(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-28 19:49:50
Message-ID: 4588.1269805790@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

"Mike Williams" <mike(dot)williams(at)comodo(dot)com> writes:
> The "target block" stays the same for extended periods, but does change.

Fascinating. Don't know what it means, but it's fascinating.

> Where does the 1663 number come from?
> <datadir/base>/743352/743420 is a real file, but I don't have any file 1663.

1663 is the OID of the pg_default tablespace; that's never going to
change unless you move the table to a non-default tablespace.

> The server is Gentoo, and I don't have an identical server to use. This
> machines HA partner is staying at 8.3.5 until we're completely and
> entirely sure 8.3.10 works right.

I wasn't really suggesting that you need an *identical* machine. If
this is a software problem, as seems moderately likely at this point,
it should be possible to reproduce it on a different machine. Even more
to the point, it's going to be difficult to find the problem without
tracing through the fault case with a debugger --- which isn't something
you're going to want to do on a production server. Have you got a test
machine you can load the data onto? Preferably one you could give a PG
developer access to?

<aside>Using gentoo for production servers is generally viewed with
suspicion around here; gentoo is a fine distro but, um, not especially
focused on stability. Still, there's no obvious reason to blame this on
the distro.</aside>

regards, tom lane


From: "Mike Williams" <mike(dot)williams(at)comodo(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-28 22:27:33
Message-ID: 719e433b1c118d4d8de72f171199678a.squirrel@mail.brad.office.comodo.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Sun, March 28, 2010 8:49 pm, Tom Lane wrote:
>> The server is Gentoo, and I don't have an identical server to use. This
>> machines HA partner is staying at 8.3.5 until we're completely and
>> entirely sure 8.3.10 works right.
>
> I wasn't really suggesting that you need an *identical* machine. If
> this is a software problem, as seems moderately likely at this point,
> it should be possible to reproduce it on a different machine. Even more
> to the point, it's going to be difficult to find the problem without
> tracing through the fault case with a debugger --- which isn't something
> you're going to want to do on a production server. Have you got a test
> machine you can load the data onto? Preferably one you could give a PG
> developer access to?

We do have a test system, and the only significant difference from
production is the kernel.
Test are para-virt VMs with "regular" kernels, production are real
machines with hardened kernels (grsec+pax).
I got prod onto a regular kernel and nothing in nearly an hour.

8.3.5 has worked forever and a day on this type of setup.

--
Mike Williams


From: ALEXANDER JOSE <aangelj(at)hotmail(dot)com>
To: <tgl(at)sss(dot)pgh(dot)pa(dot)us>, <mike(dot)williams(at)comodo(dot)com>
Cc: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-29 13:49:36
Message-ID: SNT114-W53A0EA56F0BDE515EBC445A5200@phx.gbl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin


Good
morning

You
have to revise the base folder to create postgres when you perform the
initial installation and verify the relation fallanado that I spend And
it turned out he had a folder in C: \ Program Files \ PostgreSQL \ 8.2 \
data \ base folder exists the number 1, this
is where the file belongs to the database that lost the connection does
not exist, determine why and try to copy again, delete the database that
you are creating and then think again.

Alexander Angel
Venezuela

_________________________________________________________________
Invite your mail contacts to join your friends list with Windows Live Spaces. It's easy!
http://spaces.live.com/spacesapi.aspx?wx_action=create&wx_url=/friends.aspx&mkt=en-us


From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: Mike Williams <mike(dot)williams(at)comodo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin(at)postgresql(dot)org
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-29 19:26:08
Message-ID: 34d269d41003291226n4bdc8f57t96132a22df64dea1@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Sun, Mar 28, 2010 at 16:27, Mike Williams <mike(dot)williams(at)comodo(dot)com> wrote:
> Test are para-virt VMs with "regular" kernels, production are real
> machines with hardened kernels (grsec+pax).

Ive seen this error on a few boxes around here, using a non grsec
kernel fixes it. I never bothered to report it because I cant
reproduce it and it goes away without grsec. (2.6.30-32 IIRC, have
not tired the latest 33). Interestingly I have the *exact* same
hardware and setup (kernel/userspace/postgres etc) on a test machine
that I've never been able to trigger it on.


From: Mike Williams <mike(dot)williams(at)comodo(dot)com>
To: Alex Hunsaker <badalex(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin(at)postgresql(dot)org
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-30 10:16:28
Message-ID: 201003301116.28805.mike.williams@comodo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Monday 29 March 2010 20:26:08 Alex Hunsaker wrote:
> > Test are para-virt VMs with "regular" kernels, production are real
> > machines with hardened kernels (grsec+pax).
>
> Ive seen this error on a few boxes around here, using a non grsec
> kernel fixes it. I never bothered to report it because I cant
> reproduce it and it goes away without grsec. (2.6.30-32 IIRC, have
> not tired the latest 33). Interestingly I have the *exact* same
> hardware and setup (kernel/userspace/postgres etc) on a test machine
> that I've never been able to trigger it on.

Thanks Alex, good to know I've not screwed up the kernel somehow.

I've been using 2.6.32 with grsecurity-2.1.14-2.6.32.9-201002231820 applied.

--
Mike Williams


From: Alex Hunsaker <badalex(at)gmail(dot)com>
To: Mike Williams <mike(dot)williams(at)comodo(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-admin(at)postgresql(dot)org
Subject: Re: ERROR: could not open segment 1 of relation 1663/743352/743420 (target block 6407642): No such file or directory
Date: 2010-03-30 20:28:09
Message-ID: 34d269d41003301328p760d4cci74da558baf1576b9@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Tue, Mar 30, 2010 at 04:16, Mike Williams <mike(dot)williams(at)comodo(dot)com> wrote:
> Thanks Alex, good to know I've not screwed up the kernel somehow.
>
> I've been using 2.6.32 with grsecurity-2.1.14-2.6.32.9-201002231820 applied.

Looks like the first instance I had of this problem was with
2.6.31.1-rc1-grsec. I know I tried 2.6.32-grsec and various
2.6.32.X-grsecs but all those had this issue at some point. Currently
im on a mostly stock 2.6.33.1 with no problems. I have not had the
nerve to try a -grsec kernel on it again.

For reference here are the errors I got:

could not open segment 3 of relation base/4440720/8003730

COPY public.page_loads (cgi, content_length, date_created, defunct,
host, ip, page_load_id, protocol, proxy_ip, referrer, request_method,
sessionid, url, user_id, audit_tid, user_agent_id, action, server) TO
'/tmp/blah.sql';
ERROR: invalid memory alloc request size 18446744073709551613

There were more could not open segment errors... but I seem to have lost them.

Normally I would think the above is corrupt data, but it would
sometimes work. It *always* worked on the non grsec kernel. So
instead it smells like bad ram, well its got ecc ram and survived
multiple runs of memtest, memtest86+ various versions. [ Yeah I know
people including me have seen ram that passes all that and is still
bad ]

Since you are having similar problems with a -grsec kernel sounds like
there might be some kind of memory corruption bug with it. I would
recommend trying a stock kernel and seeing if the problem goes away.
I also think the general attitude here is if you run crazy security
patches you get to keep both pieces. :)

Another fact that seemed to point to bad ram or some kind of kernel
corruption was trying to find the bad row COPY reported above:

SELECT count(*) from (select * from page_loads order by page_load_id
desc limit 937980) as foo;
ERROR: could not open segment 3 of relation base/4440720/8003730
(target block 4680336): No such file or directory

SELECT count(*) from ( select * from page_loads order by page_load_id
desc limit 937970) as foo;
count
--------
937970

<70-79 snipped all worked>

SELECT count(*) from (select * from page_loads order by page_load_id
desc limit 937979) as foo;
count
--------
937979

-- Uhh this was just broken...
SELECT count(*) from (select * order by page_load_id desc limit 937980) as foo;
count
--------
937980

I thought I had some stacktraces... but they are not in my notes... I
do remember tracing through them and coming to the conclusion that its
most likely some kind of kernel bug. (That error can only happen if we
try to open a file that does not exist, but we can only get that far
if the file existed or some such) Sorry Im a bit hazy this was back
in September.