Re: Lost rows/data corruption?

Lists: pgsql-general
From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-15 05:15:24
Message-ID: 00b101c5131d$5a7259b0$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

>> It sounds like a mess, all right. Do you have a procedure to follow to
>> replicate this havoc? Are you sure there's not a hardware problem
>> underlying it all?
>>
>> regards, tom lane
>>

We haven't been able to isolate what causes it but it's unlikely to be
hardware as it happens on quite a few of our customer's boxes. We also use
XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
power-outages. Any ideas as to how I might go about isolating this? Have
you
heard any other reports of this kind and suggested remedies?


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-15 05:58:43
Message-ID: 8678.1108447123@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Andrew Hall" <temp02(at)bluereef(dot)com(dot)au> writes:
> We haven't been able to isolate what causes it but it's unlikely to be
> hardware as it happens on quite a few of our customer's boxes.

Okay, then not hardware; but it seems like you ought to be in a position
to create a test case for other people to poke at. I don't insist on
a 100% reproducible case, but something that will show the problem if
run for awhile would be a great help.

regards, tom lane


From: Geoffrey <esoteric(at)3times25(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-15 10:56:51
Message-ID: 4211D573.5050304@3times25.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Tom Lane wrote:
> "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au> writes:
>
>> We haven't been able to isolate what causes it but it's unlikely to be
>> hardware as it happens on quite a few of our customer's boxes.
>
>
> Okay, then not hardware; but it seems like you ought to be in a position
> to create a test case for other people to poke at. I don't insist on
> a 100% reproducible case, but something that will show the problem if
> run for awhile would be a great help.

His original statement prompts a question in my mind. I may be wrong
here, but when he noted:

'We also use XFS on linux 2.6 as a file system, so the FS should be
fairly tolerant to power-outages.'

Is Andrew indicating here that there might be some issues with power
loss on some of these boxes? If so, is it reasonable to assume that the
filesystem is able to maintain the database integrity in such a power
loss? I understand that XFS is quite a robust file system, but I can't
see relying on such robustness for database integrity (or any file
integrity for that matter). UPS's might be a better solution.

So the actual question in my mind that I didn't see anyone touch on is,
is it safe to assume that a power outage will not affect the database
integrity based on the robustness of the file system type?

Personally, I would not rely on such, but I'd like to hear what the
Postgresql experts think about this issue.

Then again, I may have read too much into Andrew's post. Andrew, do you
assume there have been power issues with any of these machines? Are you
comfortable relying on the filesystem to deal with such issues?

Ideally, I would research any corrolation between power outages and the
database problems. If there are no power outages to speak of, then
sorry for yapping up the wrong woody perennial plant.

Really just fishing for some insights here folks.

--
Until later, Geoffrey


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Geoffrey <esoteric(at)3times25(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-15 14:29:08
Message-ID: 1108477748.11967.157.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 2005-02-15 at 04:56, Geoffrey wrote:
> Tom Lane wrote:
> > "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au> writes:
> >
> >> We haven't been able to isolate what causes it but it's unlikely to be
> >> hardware as it happens on quite a few of our customer's boxes.
> >
> >
> > Okay, then not hardware; but it seems like you ought to be in a position
> > to create a test case for other people to poke at. I don't insist on
> > a 100% reproducible case, but something that will show the problem if
> > run for awhile would be a great help.
>
> His original statement prompts a question in my mind. I may be wrong
> here, but when he noted:
>
> 'We also use XFS on linux 2.6 as a file system, so the FS should be
> fairly tolerant to power-outages.'
>
> Is Andrew indicating here that there might be some issues with power
> loss on some of these boxes? If so, is it reasonable to assume that the
> filesystem is able to maintain the database integrity in such a power
> loss? I understand that XFS is quite a robust file system, but I can't
> see relying on such robustness for database integrity (or any file
> integrity for that matter). UPS's might be a better solution.

If I were him I'd try running my database on a different file system to
see if his version of XFS might be causing these problems.

While I agree that frequent power loss is NOT something a database
should be exposed to, a properly setup machine with a properly
functioning journalling file system should not experience these
problems. Might be time to check the drive subsystem to make sure it's
properly fsyncing data.


From: Marco Colombo <pgsql(at)esiway(dot)net>
To: Andrew Hall <temp02(at)bluereef(dot)com(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-15 15:58:09
Message-ID: Pine.LNX.4.61.0502151458430.18326@Megathlon.ESI
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Tue, 15 Feb 2005, Andrew Hall wrote:

>
>
>>> It sounds like a mess, all right. Do you have a procedure to follow to
>>> replicate this havoc? Are you sure there's not a hardware problem
>>> underlying it all?
>>>
>>> regards, tom lane
>>>
>
> We haven't been able to isolate what causes it but it's unlikely to be
> hardware as it happens on quite a few of our customer's boxes. We also use
> XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
> power-outages. Any ideas as to how I might go about isolating this? Have you
> heard any other reports of this kind and suggested remedies?

Are you running with fsync = off? and did the hosts experience any
power-outage recently?

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo(at)ESI(dot)it


From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Marco Colombo" <pgsql(at)esiway(dot)net>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-16 00:27:54
Message-ID: 021e01c513be$5b20d700$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

fsync is on for all these boxes. Our customers run their own hardware with
many different specification of hardware in use. Many of our customers don't
have UPS, although their power is probably pretty reliable (normal city
based utilities), but of course I can't guarantee they don't get an outage
once in a while with a thunderstorm etc.

The problem here is that we are consistently seeing the same kind of
corruption and symptoms across a fairly large number of customers (52 have
reported this problem), so there is something endemic happening here that to
be honest, I'm surprised no one else is seeing. Fundamentally there is
nothing particularly abnormal with our application or data, but regardless,
I would have thought these kind of things (application design, data
representation etc) irrelevant to the reliability of the database not to
allow duplicate data on a primary key. Something is causing this corruption,
and one thing we do know is that it doesn't happen immediately with a new
installation, it takes time (several months of usage) before we start to see
this condition. I'd be really surprised if XFS is the problem as I know
there are plenty of other people across the world using it reliability with
PG.

We're going to see if we can build a test environment that can forcibly
cause this but I don't hold much hope, as we've tried to isolate it before
with little success. Here's what we tried changing when we originally went
searching for the problem, and it still here:

- the hardware (tried single CPU instead of dual - though that maybe an
issue with the OS)
- the OS version (tried Linux 2.6.5, 2.6.6, 2.6.7, 2.6.8.1, 2.6.10 and
2.4.22) - all using XFS
- the database table layout (tried changing the way the data is stored)
- the version of Jetty (servlet engine)
- the DB pool manager and PG JDBC driver versions
- the version of PG (tried two or three back from the latest)
- various vacuum regimes

----- Original Message -----
From: "Marco Colombo" <pgsql(at)esiway(dot)net>
To: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
Cc: <pgsql-general(at)postgresql(dot)org>
Sent: Wednesday, February 16, 2005 2:58 AM
Subject: Re: Lost rows/data corruption?

> On Tue, 15 Feb 2005, Andrew Hall wrote:
>
>>
>>
>>>> It sounds like a mess, all right. Do you have a procedure to follow to
>>>> replicate this havoc? Are you sure there's not a hardware problem
>>>> underlying it all?
>>>>
>>>> regards, tom lane
>>>>
>>
>> We haven't been able to isolate what causes it but it's unlikely to be
>> hardware as it happens on quite a few of our customer's boxes. We also
>> use
>> XFS on linux 2.6 as a file system, so the FS should be fairly tolerant to
>> power-outages. Any ideas as to how I might go about isolating this? Have
>> you
>> heard any other reports of this kind and suggested remedies?
>
> Are you running with fsync = off? and did the hosts experience any
> power-outage recently?
>
> .TM.
> --
> ____/ ____/ /
> / / / Marco Colombo
> ___/ ___ / / Technical Manager
> / / / ESI s.r.l.
> _____/ _____/ _/ Colombo(at)ESI(dot)it
>


From: Marco Colombo <pgsql(at)esiway(dot)net>
To: Andrew Hall <temp02(at)bluereef(dot)com(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-16 12:46:55
Message-ID: Pine.LNX.4.61.0502161245020.18326@Megathlon.ESI
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 16 Feb 2005, Andrew Hall wrote:

> fsync is on for all these boxes. Our customers run their own hardware with
> many different specification of hardware in use. Many of our customers don't
> have UPS, although their power is probably pretty reliable (normal city based
> utilities), but of course I can't guarantee they don't get an outage once in
> a while with a thunderstorm etc.

I see. Well I can't help much, then, I don't run PG on XFS. I suggest testing
on a different FS, to exclude XFS problems. But with fsync on, the FS has
very little to do with reliability, unless it _lies_ about fsync(). Any
FS should return from fsync only after data is on disc, journal or not
(there might be issues with meta-data, but it's hardly a problem with PG).

It's more likely the hardware (IDE disks) lies about data being on plate.
But again that's only in case of sudden poweroffs.

[...]
> this condition. I'd be really surprised if XFS is the problem as I know there
> are plenty of other people across the world using it reliability with PG.

This is kind of OT, but I don't follow your logic here.

I don't see why plenty of success stories of XFS+PG suggest to you
the culprit is PG. To me it's still 50% - 50%. :-)

Moreover, XFS is continuosly updated (as it follows normal linux kernel
fast release cycle, like any other linux FS), so it's hard to make a
data point unless someone else is using _exactly_ the same versions as
you do.

For example, in kernel changelog from 2.6.7 to 2.6.10 you can read:

"[XFS] Fix a race condition in the undo-delayed-write buffer routine."

"[XFS] Fix up memory allocators to be more resilient."

"[XFS] Fix a possible data loss issue after an unaligned unwritten
extent write."

"[XFS] handle inode creating race"

(only a few of them)

Now, I don't have even the faintest idea if that might have affected you
or nor, but still the point is that the linux kernel changes a lot.
And vendors tend to customize their kernels a lot, too. On the PostreSQL
side, releases are slowly-paced, so it's easier.

Anyway, I agree your problem is weird, and that it must be something
on the server side.
No matter what you do on the client side (pool manager, JDBC driver,
servlets engige), in no way the DB should get corrupted with duplicated
primary keys.

I know this is a silly question, but when you write 'We do nothing with
any indexes' do you mean indeces are never, _never_ touched (I mean
explicitly, as in drop/create index), i.e. they are created at schema
creation time and then left alone? Just to make sure...

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo(at)ESI(dot)it


From: Alban Hertroys <alban(at)magproductions(dot)nl>
To: Marco Colombo <pgsql(at)esiway(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-16 13:14:56
Message-ID: 42134750.3050001@magproductions.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Marco Colombo wrote:
> On Wed, 16 Feb 2005, Andrew Hall wrote:
>
>> fsync is on for all these boxes. Our customers run their own hardware
>> with many different specification of hardware in use. Many of our
>> customers don't have UPS, although their power is probably pretty
>> reliable (normal city based utilities), but of course I can't
>> guarantee they don't get an outage once in a while with a thunderstorm
>> etc.
>
>
> I see. Well I can't help much, then, I don't run PG on XFS. I suggest
> testing
> on a different FS, to exclude XFS problems. But with fsync on, the FS has
> very little to do with reliability, unless it _lies_ about fsync(). Any
> FS should return from fsync only after data is on disc, journal or not
> (there might be issues with meta-data, but it's hardly a problem with PG).
>
> It's more likely the hardware (IDE disks) lies about data being on plate.
> But again that's only in case of sudden poweroffs.

Do you happen to have the same type disks in all these systems? That
could point to a disk cache "problem" (f.e. the disks lying about having
written data from the cache to disk).

Or do you use the same disk parameters on all these machines? Have you
tried using the disks w/o write caching and/or in synchronous mode
(contrary to "async").

--
Alban Hertroys
MAG Productions

T: +31(0)53 4346874
F: +31(0)53 4346876
E: alban(at)magproductions(dot)nl
W: http://www.magproductions.nl


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Alban Hertroys <alban(at)magproductions(dot)nl>
Cc: Marco Colombo <pgsql(at)esiway(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-16 14:18:01
Message-ID: 1108563480.11967.214.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 2005-02-16 at 07:14, Alban Hertroys wrote:
> Marco Colombo wrote:
> > On Wed, 16 Feb 2005, Andrew Hall wrote:
> >
> >> fsync is on for all these boxes. Our customers run their own hardware
> >> with many different specification of hardware in use. Many of our
> >> customers don't have UPS, although their power is probably pretty
> >> reliable (normal city based utilities), but of course I can't
> >> guarantee they don't get an outage once in a while with a thunderstorm
> >> etc.
> >
> >
> > I see. Well I can't help much, then, I don't run PG on XFS. I suggest
> > testing
> > on a different FS, to exclude XFS problems. But with fsync on, the FS has
> > very little to do with reliability, unless it _lies_ about fsync(). Any
> > FS should return from fsync only after data is on disc, journal or not
> > (there might be issues with meta-data, but it's hardly a problem with PG).
> >
> > It's more likely the hardware (IDE disks) lies about data being on plate.
> > But again that's only in case of sudden poweroffs.
>
> Do you happen to have the same type disks in all these systems? That
> could point to a disk cache "problem" (f.e. the disks lying about having
> written data from the cache to disk).
>
> Or do you use the same disk parameters on all these machines? Have you
> tried using the disks w/o write caching and/or in synchronous mode
> (contrary to "async").

I was wondering if this problem had ever shown up on a machine that
HADN'T lost power abrubtly or not. IFF the only machines that
experience corruption have lost power beforehand sometime, then I would
look towards either the drives, controller or file system or somewhere
in there.

I know there are write modes in ext3 that will allow corruption on power
loss (I think it's writeback). I know little of XFS in a production
environment, as I run ext3, warts and all.


From: Marco Colombo <pgsql(at)esiway(dot)net>
To: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-16 16:29:13
Message-ID: Pine.LNX.4.61.0502161705290.18326@Megathlon.ESI
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Wed, 16 Feb 2005, Scott Marlowe wrote:

> I know there are write modes in ext3 that will allow corruption on power
> loss (I think it's writeback). I know little of XFS in a production
> environment, as I run ext3, warts and all.

Yeah, but even in writeback mode, ext3 doesn't lie on fsync. No FS does.

Since PG can't expect any data to be on disk _before_ fsync completes,
it doesn't really make a difference. You can loose data in writeback mode
_if_ the application is not fsync-ing it (XFS only "mode" is similar to
writeback). I'm not aware of any case in which the system can lie about
fsync(), unless the hardware is lying in turn.

One question for gurus: does PG use fsync() on dirty data pages when
they are flushed to disk at checkpoint time? Does it fsync() the
directory in case of file creation/deletion/rename?

.TM.
--
____/ ____/ /
/ / / Marco Colombo
___/ ___ / / Technical Manager
/ / / ESI s.r.l.
_____/ _____/ _/ Colombo(at)ESI(dot)it


From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Marco Colombo" <pgsql(at)esiway(dot)net>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-17 08:25:34
Message-ID: 011701c514ca$409ce2e0$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> I know this is a silly question, but when you write 'We do nothing with
> any indexes' do you mean indeces are never, _never_ touched (I mean
> explicitly, as in drop/create index), i.e. they are created at schema
> creation time and then left alone? Just to make sure...

Hi and thanks for your feedback,

Yes we never touch them, as in, they are implicitly created at schema create
time and then we don't touch them.


From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Alban Hertroys" <alban(at)magproductions(dot)nl>, "Marco Colombo" <pgsql(at)esiway(dot)net>
Cc: <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-17 08:31:17
Message-ID: 011c01c514cb$0d139170$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> Do you happen to have the same type disks in all these systems? That could
> point to a disk cache "problem" (f.e. the disks lying about having written
> data from the cache to disk).
>
> Or do you use the same disk parameters on all these machines? Have you
> tried using the disks w/o write caching and/or in synchronous mode
> (contrary to "async").

It's all pretty common stuff, quite a few customers use standard IDE
(various flavours of controller/disk), some now use SATA (again various
brands) and the rest use SCSI. The kernel we use is the standard Linus
approved kernel with the inbuilt drivers as part of the kernel. We don't
supply any non-default parameters to the disk controllers.

Thanks for your suggestion on write caching, I'll look into this, I'm also
tempted to try a different journalling FS too.


From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>, "Alban Hertroys" <alban(at)magproductions(dot)nl>
Cc: "Marco Colombo" <pgsql(at)esiway(dot)net>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-17 08:40:25
Message-ID: 012101c514cc$537b7190$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

> I was wondering if this problem had ever shown up on a machine that
> HADN'T lost power abrubtly or not. IFF the only machines that
> experience corruption have lost power beforehand sometime, then I would
> look towards either the drives, controller or file system or somewhere
> in there.

I can't be sure. We have an automated maintenance process that reboots all
our customers machines every 10 days at 2am. Having said this I'm now
wondering if this may have something to do with the issue. This automated
process issues a 'shutdown' to the database (and all other processes), waits
20 seconds and then issues a 'reboot' to the kernel. If the database was
still processing, the active postmaster process may wait for the client to
complete the query before allowing it to close, but I'm assuming that if
this exceeds 20 seconds, the kernel will issue a 'sigquit' to the process
tree and reboot immediately. Could this cause corruption?


From: Michael Fuhr <mike(at)fuhr(dot)org>
To: Andrew Hall <temp02(at)bluereef(dot)com(dot)au>
Cc: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>, Alban Hertroys <alban(at)magproductions(dot)nl>, Marco Colombo <pgsql(at)esiway(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-17 14:56:02
Message-ID: 20050217145602.GB25761@winnie.fuhr.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote:
>
> We have an automated maintenance process that reboots all our
> customers machines every 10 days at 2am.

What's the purpose of doing this? If it's necessary then the reboots
aren't really fixing anything. Is whatever problem that prompted
this procedure being investigated so a permanent fix can be applied?

--
Michael Fuhr
http://www.fuhr.org/~mfuhr/


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
Cc: "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>, "Alban Hertroys" <alban(at)magproductions(dot)nl>, "Marco Colombo" <pgsql(at)esiway(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-17 15:26:46
Message-ID: 7097.1108654006@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

"Andrew Hall" <temp02(at)bluereef(dot)com(dot)au> writes:
> I can't be sure. We have an automated maintenance process that reboots all
> our customers machines every 10 days at 2am.

Why? Sounds like a decision made by someone who is used to Windows.
I've never seen any variant of Unix that needed that.

regards, tom lane


From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Michael Fuhr" <mike(at)fuhr(dot)org>
Cc: "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>, "Alban Hertroys" <alban(at)magproductions(dot)nl>, "Marco Colombo" <pgsql(at)esiway(dot)net>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-17 23:11:35
Message-ID: 019401c51546$06c73620$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

We do the maintenance reboot (and other various log cleanups etc) as part of
our normal maintenance practice. We don't really 'need' to do this, however
we've traditionally found that operating systems perform better with an
occassional reboot (cleanup fragmented memory etc.).

----- Original Message -----
From: "Michael Fuhr" <mike(at)fuhr(dot)org>
To: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
Cc: "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>; "Alban Hertroys"
<alban(at)magproductions(dot)nl>; "Marco Colombo" <pgsql(at)esiway(dot)net>;
<pgsql-general(at)postgresql(dot)org>
Sent: Friday, February 18, 2005 1:56 AM
Subject: Re: [GENERAL] Lost rows/data corruption?

> On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote:
>>
>> We have an automated maintenance process that reboots all our
>> customers machines every 10 days at 2am.
>
> What's the purpose of doing this? If it's necessary then the reboots
> aren't really fixing anything. Is whatever problem that prompted
> this procedure being investigated so a permanent fix can be applied?
>
> --
> Michael Fuhr
> http://www.fuhr.org/~mfuhr/
>


From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: Andrew Hall <temp02(at)bluereef(dot)com(dot)au>
Cc: Michael Fuhr <mike(at)fuhr(dot)org>, Alban Hertroys <alban(at)magproductions(dot)nl>, Marco Colombo <pgsql(at)esiway(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-17 23:23:25
Message-ID: 1108682605.10956.15.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

A unix box or a mainframe that needs rebooting is generally considered
broken.

On Thu, 2005-02-17 at 17:11, Andrew Hall wrote:
> We do the maintenance reboot (and other various log cleanups etc) as part of
> our normal maintenance practice. We don't really 'need' to do this, however
> we've traditionally found that operating systems perform better with an
> occassional reboot (cleanup fragmented memory etc.).
>
> ----- Original Message -----
> From: "Michael Fuhr" <mike(at)fuhr(dot)org>
> To: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
> Cc: "Scott Marlowe" <smarlowe(at)g2switchworks(dot)com>; "Alban Hertroys"
> <alban(at)magproductions(dot)nl>; "Marco Colombo" <pgsql(at)esiway(dot)net>;
> <pgsql-general(at)postgresql(dot)org>
> Sent: Friday, February 18, 2005 1:56 AM
> Subject: Re: [GENERAL] Lost rows/data corruption?
>
>
> > On Thu, Feb 17, 2005 at 07:40:25PM +1100, Andrew Hall wrote:
> >>
> >> We have an automated maintenance process that reboots all our
> >> customers machines every 10 days at 2am.
> >
> > What's the purpose of doing this? If it's necessary then the reboots
> > aren't really fixing anything. Is whatever problem that prompted
> > this procedure being investigated so a permanent fix can be applied?
> >
> > --
> > Michael Fuhr
> > http://www.fuhr.org/~mfuhr/
> >
>


From: "Keith C(dot) Perry" <netadmin(at)vcsn(dot)com>
To: Andrew Hall <temp02(at)bluereef(dot)com(dot)au>
Cc: Alban Hertroys <alban(at)magproductions(dot)nl>, Marco Colombo <pgsql(at)esiway(dot)net>, pgsql-general(at)postgresql(dot)org
Subject: Re: Lost rows/data corruption?
Date: 2005-02-25 19:02:21
Message-ID: 1109358141.421f763de57fe@webmail.vcsn.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Quoting Andrew Hall <temp02(at)bluereef(dot)com(dot)au>:

> > Do you happen to have the same type disks in all these systems? That could
>
> > point to a disk cache "problem" (f.e. the disks lying about having written
>
> > data from the cache to disk).
> >
> > Or do you use the same disk parameters on all these machines? Have you
> > tried using the disks w/o write caching and/or in synchronous mode
> > (contrary to "async").
>
> It's all pretty common stuff, quite a few customers use standard IDE
> (various flavours of controller/disk), some now use SATA (again various
> brands) and the rest use SCSI. The kernel we use is the standard Linus
> approved kernel with the inbuilt drivers as part of the kernel. We don't
> supply any non-default parameters to the disk controllers.
>
> Thanks for your suggestion on write caching, I'll look into this, I'm also
> tempted to try a different journalling FS too.
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>

I'm a little late on this thread but in regards to the SATA support. 2.4.29 in
my experience is really the first kernel that decent SATA support (i.e. much
better data throughput). I think that would corresponse to 2.6.9 or .10 but
even before you get into all that. I am curious to know what do you mean by
"standard Linus kernel". Do you not compile your own kernels for the hardware
platform being used?

--
Keith C. Perry, MS E.E.
Director of Networks & Applications
VCSN, Inc.
http://vcsn.com

____________________________________
This email account is being host by:
VCSN, Inc : http://vcsn.com


From: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
To: "Keith C(dot) Perry" <netadmin(at)vcsn(dot)com>
Cc: "Alban Hertroys" <alban(at)magproductions(dot)nl>, "Marco Colombo" <pgsql(at)esiway(dot)net>, <pgsql-general(at)postgresql(dot)org>
Subject: Re: Lost rows/data corruption?
Date: 2005-02-26 05:01:15
Message-ID: 009001c51bc0$32e50db0$5001010a@bluereef.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general

Yes, we compile our own kernel based on the "stardardised" stable release
available at the time. Everything we need is compiled in. This is what I
mean by standard Linus approved kernel release (as opposed to an AC/MM
modified release etc.)

----- Original Message -----
From: "Keith C. Perry" <netadmin(at)vcsn(dot)com>
To: "Andrew Hall" <temp02(at)bluereef(dot)com(dot)au>
Cc: "Alban Hertroys" <alban(at)magproductions(dot)nl>; "Marco Colombo"
<pgsql(at)esiway(dot)net>; <pgsql-general(at)postgresql(dot)org>
Sent: Saturday, February 26, 2005 6:02 AM
Subject: Re: [GENERAL] Lost rows/data corruption?

> Quoting Andrew Hall <temp02(at)bluereef(dot)com(dot)au>:
>
>> > Do you happen to have the same type disks in all these systems? That
>> > could
>>
>> > point to a disk cache "problem" (f.e. the disks lying about having
>> > written
>>
>> > data from the cache to disk).
>> >
>> > Or do you use the same disk parameters on all these machines? Have you
>> > tried using the disks w/o write caching and/or in synchronous mode
>> > (contrary to "async").
>>
>> It's all pretty common stuff, quite a few customers use standard IDE
>> (various flavours of controller/disk), some now use SATA (again various
>> brands) and the rest use SCSI. The kernel we use is the standard Linus
>> approved kernel with the inbuilt drivers as part of the kernel. We don't
>> supply any non-default parameters to the disk controllers.
>>
>> Thanks for your suggestion on write caching, I'll look into this, I'm
>> also
>> tempted to try a different journalling FS too.
>>
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 2: you can get off all lists at once with the unregister command
>> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>>
>
> I'm a little late on this thread but in regards to the SATA support.
> 2.4.29 in
> my experience is really the first kernel that decent SATA support (i.e.
> much
> better data throughput). I think that would corresponse to 2.6.9 or .10
> but
> even before you get into all that. I am curious to know what do you mean
> by
> "standard Linus kernel". Do you not compile your own kernels for the
> hardware
> platform being used?
>
> --
> Keith C. Perry, MS E.E.
> Director of Networks & Applications
> VCSN, Inc.
> http://vcsn.com
>
> ____________________________________
> This email account is being host by:
> VCSN, Inc : http://vcsn.com
>