Re: vacuum analyze corrupts database

Lists: pgsql-hackers
From: Michael Brusser <michael(at)synchronicity(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: vacuum analyze corrupts database
Date: 2003-05-25 22:55:38
Message-ID: IGEFLDJHFEOIFILGOFDJMEBBCAAA.michael@synchronicity.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>... Again, how about a debugger stack trace?
> Also, the pg_stats rows for the table would be interesting.
>

---------
Tom,
thanks for feedback.

This time I started postgres directly and ran 'analyze nla;'
followed by 'select ... where ... like..' clause with debugger attached -
I got this trace:

=================================================================
(dbx)
signal SEGV (no mapping at the fault address) in pfree at 0x2535e4
pfree+0x1c: ld [%o1 + 0x4], %o0
(dbx) where 30
=>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628,
0xffbee888), at 0x2535e4
[2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0,
0x413), at 0x1fc6b4
[3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at
0x1f94e4
[4] scalarltsel(0xffbeea30, 0x1f9864, 0x413, 0x413, 0xb, 0xff0000), at
0x1f9a00
[5] DirectFunctionCall4(0x1f9864, 0x484608, 0x42a, 0x4892d0, 0x0,
0x488a08), at 0x242de8
[6] prefix_selectivity(0x484608, 0x488a88, 0x488e28, 0xffbeec08,
0xffbeec23, 0xffbeec90), at 0x1fe8fc
[7] patternsel(0xffbeed38, 0x0, 0x42e9b0, 0x0, 0x0, 0xffbeed10), at
0x1f9e3c
[8] likesel(0xffbeed38, 0xffbeed38, 0x1f9ff4, 0x20, 0x0, 0x28), at
0x1fa004
[9] OidFunctionCall4(0x71b, 0x484608, 0x4bd, 0x488b70, 0x1, 0x492c18), at
0x2447c0
[10] restriction_selectivity(0x484608, 0x4bd, 0x488b70, 0x1, 0x0,
0xffbeee80), at 0x169720
[11] clauselist_selectivity(0x484608, 0x488d98, 0x1, 0x0, 0x0,
0xff141c4c), at 0x14d990
[12] restrictlist_selectivity(0x484608, 0x488d50, 0x1, 0x20, 0x0,
0xff18e9d4), at 0x14d7f4
[13] set_baserel_size_estimates(0x484608, 0x488be0, 0x1000000, 0x10, 0xc0,
0x488a08), at 0x1503d4
[14] set_plain_rel_pathlist(0x484608, 0x488be0, 0x484a10, 0x0, 0x0, 0x0),
at 0x14cb94
[15] set_base_rel_pathlists(0x484608, 0x0, 0x1, 0x0, 0x0, 0x488cb0), at
0x14cb58
[16] make_one_rel(0x484608, 0x484608, 0xffbef268, 0xffbef260, 0x4885d8,
0x28), at 0x14ca00
[17] subplanner(0x484608, 0x0, 0x0, 0x0, 0x29, 0x492c18), at 0x15c600
[18] query_planner(0x484608, 0x0, 0x0, 0x0, 0x4, 0x0), at 0x15c4f0
[19] grouping_planner(0x484608, 0xbff00000, 0x0, 0x5, 0x0, 0xff141c4c), at
0x15e4c8
[20] subquery_planner(0x484608, 0xbff00000, 0x0, 0x2b14c0, 0x225e0,
0xff18e9d4), at 0x15cc28
[21] planner(0x484608, 0x0, 0x0, 0xa, 0x4354, 0x0), at 0x15c8d0
[22] pg_plan_query(0x484608, 0x1, 0x0, 0xffbef590, 0x0, 0x0), at 0x1a5afc
[23] pg_exec_query_string(0x4844e8, 0x1, 0x42e8a0, 0x2b8f6c, 0x73,
0x484500), at 0x1a618c
[24] PostgresMain(0x2, 0x323428, 0x323440, 0x400, 0x3251c4, 0x175), at
0x1a8038
[25] main(0x2, 0xffbef7b4, 0xffbef7c0, 0x2ea310, 0x0, 0x0), at 0x1282e0

===================================================================

At some point I also ran this case under truss, here's the tail of the truss
log.
... ... ...
read(0, " s e l e c t c o u n".., 1024) = 68
write(2, " D E B U G : S t a r".., 32) = 32
write(2, " L O G : q u e r y :".., 82) = 82
write(2, " L O G : p a r s e ".., 1246) = 1246
write(2, " L O G : r e w r i t".., 1258) = 1258
llseek(24, 81920, SEEK_SET) = 81920
read(24, "\0\0\0\0\0\0\010\0\0\001".., 8192) = 8192
llseek(25, 49152, SEEK_SET) = 49152
read(25, "\0\0\0\0\0\0\010\0\0\001".., 8192) = 8192
brk(0x00499AE8) = 0
brk(0x004D9AE8) = 0
open64("/home/tmichael/sean/postgresql/base/100423/16645", O_RDWR) = 32
read(32, "\0\0\0\0\0 C 6E0\0\0\0\t".., 8192) = 8192
read(32, "\0\0\0\001 Y 7C8\0\0\0 '".., 8192) = 8192
open64("/home/tmichael/sean/postgresql/base/100423/16408", O_RDWR) = 33
llseek(33, 24576, SEEK_SET) = 24576
read(33, "\0\0\0\001 Y17 h\0\0\0 '".., 8192) = 8192
llseek(24, 57344, SEEK_SET) = 57344
read(24, "\0\0\0\0\0\0\010\0\0\001".., 8192) = 8192
Incurred fault #6, FLTBOUNDS %pc = 0x002541D4
siginfo: SIGSEGV SEGV_MAPERR addr=0x0161F3EC
Received signal #11, SIGSEGV [default]
siginfo: SIGSEGV SEGV_MAPERR addr=0x0161F3EC
*** process killed ***

===========================================

I am attaching the dump of pg_statistic table,
I believe the record for the nla table is the last one in the file.

Thank you,
please let me know what else can I do.
Mike.

Attachment Content-Type Size
pg_statistic.dump application/octet-stream 17.6 KB

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: michael(at)synchronicity(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: vacuum analyze corrupts database
Date: 2003-05-26 01:03:04
Message-ID: 24660.1053910984@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Michael Brusser <michael(at)synchronicity(dot)com> writes:
> (dbx) where 30
> =>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628,
> 0xffbee888), at 0x2535e4
> [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0,
> 0x413), at 0x1fc6b4
> [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at
> 0x1f94e4

Hmm. Are you running the database in a non-C locale? (pg_controldata
would tell you.) If so, this pfree is trying to pfree one of three
strings that were filled with strxfrm().

I am wondering if strxfrm() on your platform sometimes writes more bytes
than it is supposed to. I have seen vsnprintf() overrun its output
buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit
libc in Solaris 7 had such a problem). Could there be a similar bug in
their strxfrm?

regards, tom lane


From: Michael Brusser <michael(at)synchronicity(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: vacuum analyze corrupts database
Date: 2003-05-28 03:22:39
Message-ID: IGEFLDJHFEOIFILGOFDJMEDCCAAA.michael@synchronicity.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Hmm. Are you running the database in a non-C locale?
> (pg_controldata would tell you.)
-----------

Here's the output of pg_controldata:

pg_control version number: 72
Catalog version number: 200211021
Database cluster state: in production
pg_control last modified: Sun May 25 18:38:06 2003
Current log file ID: 0
Next log file segment: 2
Latest checkpoint location: 0/15EF7A8
Prior checkpoint location: 0/15ED2D8
Latest checkpoint's REDO location: 0/15EF7A8
Latest checkpoint's UNDO location: 0/0
Latest checkpoint's StartUpID: 47
Latest checkpoint's NextXID: 3563
Latest checkpoint's NextOID: 118086
Time of latest checkpoint: Sun May 25 18:38:02 2003
Database block size: 8192
Blocks per segment of large relation: 131072
Maximum length of identifiers: 64
Maximum number of function arguments: 32
Date/time type storage: Floating point
Maximum length of locale name: 128
LC_COLLATE: en_US.ISO8859-1
LC_CTYPE: en_US.ISO8859-1
-----------------------------
locale settings on the host:
tomkins% locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=
-----------------------------
> I am wondering if strxfrm() on your platform sometimes writes more bytes
> than it is supposed to. I have seen vsnprintf() overrun its output
> buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit
> libc in Solaris 7 had such a problem). Could there be a similar bug in
> their strxfrm?

We're on Solaris 8. I'll try to find information on strxfrm bugs,
but do you rule out any problems in Postgres code?
Is there a good explanation to why the same table loaded into another
Postgres
installation on the same host can be analyzed without any problems?
Also in my database I can drop/create database, load table and reproduce the
error.
Not sure what to make out of this.

Mike.

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of Tom Lane
Sent: Sunday, May 25, 2003 9:03 PM
To: michael(at)synchronicity(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [HACKERS] vacuum analyze corrupts database

Michael Brusser <michael(at)synchronicity(dot)com> writes:
> (dbx) where 30
> =>[1] pfree(0x489420, 0xffbee890, 0x489420, 0xffbee880, 0x489628,
> 0xffbee888), at 0x2535e4
> [2] convert_to_scalar(0x489078, 0x19, 0xffbee890, 0x489008, 0x488fc0,
> 0x413), at 0x1fc6b4
> [3] scalarineqsel(0x484608, 0x42a, 0x0, 0x488a88, 0x489078, 0x19), at
> 0x1f94e4

Hmm. Are you running the database in a non-C locale? (pg_controldata
would tell you.) If so, this pfree is trying to pfree one of three
strings that were filled with strxfrm().

I am wondering if strxfrm() on your platform sometimes writes more bytes
than it is supposed to. I have seen vsnprintf() overrun its output
buffer on some flavors of Solaris (according to FAQ_Solaris, the 64-bit
libc in Solaris 7 had such a problem). Could there be a similar bug in
their strxfrm?

regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
message can get through to the mailing list cleanly


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: michael(at)synchronicity(dot)com
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: vacuum analyze corrupts database
Date: 2003-05-28 03:34:21
Message-ID: 8199.1054092861@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Michael Brusser <michael(at)synchronicity(dot)com> writes:
> but do you rule out any problems in Postgres code?

Never ... but when you're the only one reporting a problem, a local
issue is something to consider.

> Is there a good explanation to why the same table loaded into another
> Postgres
> installation on the same host can be analyzed without any problems?

Well, first thing I'd ask is whether the other installation is using
the same locale settings.

regards, tom lane


From: Michael Brusser <michael(at)synchronicity(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: vacuum analyze corrupts database
Date: 2003-05-28 13:04:18
Message-ID: IGEFLDJHFEOIFILGOFDJCEDMCAAA.michael@synchronicity.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Well, first thing I'd ask is whether the other installation is using
> the same locale settings.
Oh, yes, I am very sure of that.
Mike


From: Alvaro Herrera Munoz <alvherre(at)dcc(dot)uchile(dot)cl>
To: Michael Brusser <michael(at)synchronicity(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: vacuum analyze corrupts database
Date: 2003-05-28 14:16:59
Message-ID: 20030528141659.GA9896@dcc.uchile.cl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, May 28, 2003 at 09:04:18AM -0400, Michael Brusser wrote:
> > Well, first thing I'd ask is whether the other installation is using
> > the same locale settings.
> Oh, yes, I am very sure of that.

Note that while the machine reports the C locale, pg_controldata
actually said the database was en_US.iso8859-1 or something like that.
Make sure pg_controldata reports the same on the other machine...

--
Alvaro Herrera (<alvherre[(at)]dcc(dot)uchile(dot)cl>)
"The eagle never lost so much time as
when he submitted to learn from the crow." (William Blake, citado por Nobody)