Re: Out of space situation and WAL log pre-allocation (was Tablespaces)

Lists: pgsql-hackers
From: "Zeugswetter Andreas SB SD" <ZeugswetterA(at)spardat(dot)at>
To: "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>
Cc: "Dennis Bjorklund" <db(at)zigo(dot)dhs(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tablespaces
Date: 2004-02-27 11:39:06
Message-ID: 46C15C39FEB2C44BA555E356FBCD6FA40184CFF2@m0114.s-mxs.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment.

Yes, and don't forget that pg also relys on the OS for grouping and
sorting the physical writes and doing readahead where appropriate.

The use of raw disks is usually paired with the use of kernel aio.
The difference is said to be up to 30% on Solaris. I can assert, that
it made the difference between a bogged down system and a much better behaved
DB on Sun here.

My experience with kaio on AIX Informix is, that kaio is faster as long as IO
is not the bottleneck (disk 100% busy is the metric to watch, not Mb/s), while
for an IO bound system the Informix builtin IO threads that can be used instead
win. (Since they obviously do better at grouping, sorting and readahead
than the AIX kernel does for kaio)

Overall I think the price and komplexity is too high, especially since there are
enough platforms where the kernel does a pretty good job at grouping, sorting and
readahead. Additionally the kernel takes non PostgreSQL IO into account.

Andreas


From: tswan(at)idigx(dot)com
To: "Zeugswetter Andreas SB SD" <ZeugswetterA(at)spardat(dot)at>
Cc: "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, "Dennis Bjorklund" <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tablespaces
Date: 2004-02-27 19:34:41
Message-ID: 51667.199.222.14.2.1077910481.squirrel@www.idigx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
>> interested if anyone could provide some real world benchmarking of file
>> system vs. raw disk. Postgres benefits a lot from kernel file system
>> cache
>> at the moment.
>
> Yes, and don't forget that pg also relys on the OS for grouping and
> sorting the physical writes and doing readahead where appropriate.
>
>

Most people I know want tablespaces in order to limit or preallocate the
disk space used by a table or database in addition to controlling the
physical location of a table or database.

I know on linux, there is the option of creating an empty file or a
specific size using dd, mounting it through loopback, formatting it,
symlinking the appropriate OID/TID (or mounting the lpb device in the
appropriate directory) and then you control how much space that
directory/mount point can contain.

Of course, with MVCC you would have to vacuum frequently, as you could
miss some updates if there weren't enough tuples marked as free. If there
were "in-place" updates, the preallocation and limitation much easier, but
that's not how PG works.

If the tablespace disk space allocation is exceeded there would need to be
some graceful reporting condition back to the client. "UPDATE/INSERT
failed (tablespace size exceeded)", "(tablespace full)", "(disk full)" or
some other error may need to be handled/reported.


From: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To: tswan(at)idigx(dot)com
Cc: Zeugswetter Andreas SB SD <ZeugswetterA(at)spardat(dot)at>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tablespaces
Date: 2004-02-28 00:03:12
Message-ID: Pine.LNX.4.58.0402281052160.29841@linuxworld.com.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 27 Feb 2004 tswan(at)idigx(dot)com wrote:

> >
> >> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> >> interested if anyone could provide some real world benchmarking of file
> >> system vs. raw disk. Postgres benefits a lot from kernel file system
> >> cache
> >> at the moment.
> >
> > Yes, and don't forget that pg also relys on the OS for grouping and
> > sorting the physical writes and doing readahead where appropriate.
> >
> >
>
> Most people I know want tablespaces in order to limit or preallocate the
> disk space used by a table or database in addition to controlling the
> physical location of a table or database.
>
> I know on linux, there is the option of creating an empty file or a
> specific size using dd, mounting it through loopback, formatting it,
> symlinking the appropriate OID/TID (or mounting the lpb device in the
> appropriate directory) and then you control how much space that
> directory/mount point can contain.
>
> Of course, with MVCC you would have to vacuum frequently, as you could
> miss some updates if there weren't enough tuples marked as free. If there
> were "in-place" updates, the preallocation and limitation much easier, but
> that's not how PG works.

I do not intend to work on such a system for the initial introduction of
table spaces. The problem is, of course, knowing when you're actually out
of space in a table space in any given transaction. Given that WAL is on a
different partition (at least for the moment) the table space will not
have transaction X's data written to it until after transaction X is
finished. And we cannot error out a transaction which is already commited.

The solution is to keep track of free space and error out at some
percentage of free space remaining. But I don't want to complicate
tablespaces too much in 7.5.

Thanks,

Gavin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc: tswan(at)idigx(dot)com, Zeugswetter Andreas SB SD <ZeugswetterA(at)spardat(dot)at>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tablespaces
Date: 2004-02-28 03:49:39
Message-ID: 25130.1077940179@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> I do not intend to work on such a system for the initial introduction of
> table spaces. The problem is, of course, knowing when you're actually out
> of space in a table space in any given transaction.

It should not be that hard, at least not on local filesystems. When PG
realizes that a new page must be added to a table, it does a write() to
append a page of zeroes to the physical table. This happens
immediately. It's true that actual data may not be written into that
section of the file till long after commit, but the kernel should do
space allocation checking upon the first write.

I have heard tell that this may not happen when you are dealing with NFS
(yet another reason not to run databases across NFS) but on all local
filesystems I know of, out-of-space should result in a failure before
transaction commit.

I say "should" because I suspect this isn't a very heavily tested code
path in Postgres. But in theory it should work. Feel free to submit
bug reports if you find it doesn't.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, <tswan(at)idigx(dot)com>
Cc: <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Tablespaces
Date: 2004-03-02 00:27:51
Message-ID: 003701c3ffed$360587d0$5baa87d9@LaptopDellXP
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Gavin Sherry
> On Fri, 27 Feb 2004 tswan(at)idigx(dot)com wrote:
> > Most people I know want tablespaces in order to limit or preallocate
the
> > disk space used by a table or database in addition to controlling
the
> > physical location of a table or database.

> I do not intend to work on such a system for the initial introduction
of
> table spaces. The problem is, of course, knowing when you're actually
out
> of space in a table space in any given transaction. Given that WAL is
on a
> different partition (at least for the moment) the table space will not
> have transaction X's data written to it until after transaction X is
> finished. And we cannot error out a transaction which is already
commited.
>
> The solution is to keep track of free space and error out at some
> percentage of free space remaining. But I don't want to complicate
> tablespaces too much in 7.5.

You're absolutely right about the not-knowing when you're out of space
issue. However, if the xlog has been written then it is not desirable,
but at least acceptable that the checkpoint/bgwriter cannot complete on
an already committed txn. It's not the txn which is getting the error,
that's all.

Hmmm...I'm not sure that we'll be able or should avoid the out of space
situation completely. The question is...what will we do when we hit it?
It doesn't matter whether you stop at 100% or 90% or whatever, you still
have to stop and then what? Stay up as long as possible hopefully: If
there wasn't enough space to write to the tablespace, going into
recovery won't help the situation either; youre still out of space until
you fix that. We now have the option not to crash, since it might be
perfectly viable to keep on chugging away on one Tablespace even though
all txn work on the out-of-space tablespace is frozen/barred etc. Sounds
like a refinement, but something to keep in mind at the design stage if
we can.

The problem is that tablespaces do complicate space management (that's
what people want though, so that's OK). That complicates admin and so pg
will hit many more out of space errors than we've seen previously.
Trying to work out how to spot these ahead of time, accept user defined
limits on each tablespace etc sounds like extra complexity for the
initial drop. I guess my own suggested approach is to start by handling
the error cases, then go back and try to avoid some of them.

All of this exposes for me the complication that doing PITR and
tablespaces at the same time is likely to be more complex for us both
than either had envisaged. The reduced complexity for PITR was what I
was shooting for, also! I'm happy to work together on any issues that
arise.

For PITR, I think we would need:
- a very accessible list of tablespace locations, so taking a full
physical database backup can be easily accomplished using OS utilities.
Hopefully a list maintained external to the database? We have the
equivalent now with env variables.
- decisions about what occurs when for-whatever-reason one or more
tablespaces are not recoverable from backup?
- it might be desirable to allow recovery with less than all of the
original tablespces
- it might also be desirable to allow recovery when the tablespaces txn
Ids don't match (though that is forbidden on many other dbms)

Best Regards, Simon Riggs


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: simon(at)2ndquadrant(dot)com
Cc: "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Tablespaces
Date: 2004-03-02 01:22:37
Message-ID: 26637.1078190557@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Gavin Sherry wrote:
>> I do not intend to work on such a system for the initial introduction of
>> table spaces. The problem is, of course, knowing when you're actually out
>> of space in a table space in any given transaction. Given that WAL is on a
>> different partition (at least for the moment) the table space will not
>> have transaction X's data written to it until after transaction X is
>> finished. And we cannot error out a transaction which is already
>> commited.

As long as the kernel doesn't lie about file extension, we will not
commit any transaction that requires a disallowed increase in the
allocated size of data files, because allocation of another table page
is checked with the kernel during the transaction. So on most
filesystems (maybe not NFS) the problem Gavin is worried about doesn't
exist.

> You're absolutely right about the not-knowing when you're out of space
> issue. However, if the xlog has been written then it is not desirable,
> but at least acceptable that the checkpoint/bgwriter cannot complete on
> an already committed txn. It's not the txn which is getting the error,
> that's all.

Right. This is in fact not a fatal situation, as long as you don't run
out of preallocated WAL space. For a recent practical example of our
behavior under zero-free-space conditions, see this thread:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00530.php
particularly the post-mortem here:
http://archives.postgresql.org/pgsql-hackers/2004-01/msg00606.php
Barring one small bug, the database would likely have stayed up, and
continued to service at least the read-only transactions, until Chris
got around to freeing some disk space.

I think it is sufficient (at least in the near term) to expect people to
use partition size limits if they want to control database size --- that
is, make a partition of the desired size and put the database directory
in there. Tablespaces as per the design we are discussing would make it
easier to apply such a policy to a sub-area of a database cluster than
it is today, but they needn't in themselves implement the restriction.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, <tswan(at)idigx(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-02 22:53:09
Message-ID: 004a01c400a9$225e3990$5baa87d9@LaptopDellXP
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> "Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> > You're absolutely right about the not-knowing when you're out of
space
> > issue. However, if the xlog has been written then it is not
desirable,
> > but at least acceptable that the checkpoint/bgwriter cannot complete
on
> > an already committed txn. It's not the txn which is getting the
error,
> > that's all.
>
> Right. This is in fact not a fatal situation, as long as you don't
run
> out of preallocated WAL space.

...following on also from thoughts on [PERFORM] list...

Clearly running out of pre-allocated WAL space is likely to be the next
issue. Running out of space in the first place is likely to be because
of an intense workload, which is exactly the thing which also makes you
run out of pre-allocated WAL space. Does that make sense?

Best regards, Simon Riggs


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: simon(at)2ndquadrant(dot)com
Cc: "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-03 04:12:23
Message-ID: 8088.1078287143@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:
> Tom Lane wrote:
>> Right. This is in fact not a fatal situation, as long as you don't
>> run out of preallocated WAL space.

> Clearly running out of pre-allocated WAL space is likely to be the next
> issue. Running out of space in the first place is likely to be because
> of an intense workload, which is exactly the thing which also makes you
> run out of pre-allocated WAL space. Does that make sense?

I think one of the first things people would do with tablespaces is
stick the data files onto a separate partition from the WAL and clog
files. (Actually you can do this today with a simple symlink hack, but
tablespaces will make it easier and clearer.) The space usage for WAL
is really pretty predictable, because of the checkpoint-at-least-
every-N-segments setting. clog is not exactly a space hog either.
Once you have that separation established, out-of-disk-space can kill
individual transactions but never the database as a whole.

One of the things that bothers me about the present PITR design is that
it presumes that individual WAL log segments can be kept until the
external archiver process feels like writing them somewhere. If there's
no guarantee that that happens within X amount of time, then you can't
bound the amount of space needed on the WAL drive, and so you are back
facing the possibility of an out-of-WAL-space panic. I suspect that we
cannot really do anything about that, but it's annoying. Any bright
ideas out there?

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: simon(at)2ndquadrant(dot)com, 'Gavin Sherry' <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was
Date: 2004-03-03 04:31:59
Message-ID: 40455FBF.5000702@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> One of the things that bothers me about the present PITR design is that
> it presumes that individual WAL log segments can be kept until the
> external archiver process feels like writing them somewhere. If there's
> no guarantee that that happens within X amount of time, then you can't
> bound the amount of space needed on the WAL drive, and so you are back
> facing the possibility of an out-of-WAL-space panic. I suspect that we
> cannot really do anything about that, but it's annoying. Any bright
> ideas out there?

Maybe specify an archive location (that of course could be on a separate
partition) that the external archiver should check in addition to the
normal WAL location. At some predetermined interval, push WAL log
segments no longer needed to the archive location.

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: simon(at)2ndquadrant(dot)com, "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-03 04:55:03
Message-ID: 8478.1078289703@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Tom Lane wrote:
>> facing the possibility of an out-of-WAL-space panic. I suspect that we
>> cannot really do anything about that, but it's annoying. Any bright
>> ideas out there?

> Maybe specify an archive location (that of course could be on a separate
> partition) that the external archiver should check in addition to the
> normal WAL location. At some predetermined interval, push WAL log
> segments no longer needed to the archive location.

Does that really help? The panic happens when you fill the "normal" and
"archive" partitions, how's that different from one partition?

regards, tom lane


From: Joe Conway <mail(at)joeconway(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: simon(at)2ndquadrant(dot)com, 'Gavin Sherry' <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was
Date: 2004-03-03 05:11:29
Message-ID: 40456901.2060007@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Joe Conway <mail(at)joeconway(dot)com> writes:
>>Maybe specify an archive location (that of course could be on a separate
>>partition) that the external archiver should check in addition to the
>>normal WAL location. At some predetermined interval, push WAL log
>>segments no longer needed to the archive location.
>
> Does that really help? The panic happens when you fill the "normal" and
> "archive" partitions, how's that different from one partition?

I see your point. But it would allow you to use a relatively modest
local partition for WAL segments, while you might be using a 1TB netapp
tray over NFS for the archive segments. I guess if the archive partition
fills up, I would err on the side of dropping archive segments on the
floor. That would mean a new full backup would be needed, but at least
it wouldn't result in a corrupt, or shut down, database.

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: simon(at)2ndquadrant(dot)com, "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-03 05:33:51
Message-ID: 8824.1078292031@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Tom Lane wrote:
>> Joe Conway <mail(at)joeconway(dot)com> writes:
>>> Maybe specify an archive location (that of course could be on a separate
>>> partition) that the external archiver should check in addition to the
>>> normal WAL location. At some predetermined interval, push WAL log
>>> segments no longer needed to the archive location.
>>
>> Does that really help? The panic happens when you fill the "normal" and
>> "archive" partitions, how's that different from one partition?

> I see your point. But it would allow you to use a relatively modest
> local partition for WAL segments, while you might be using a 1TB netapp
> tray over NFS for the archive segments.

Fair enough, but it seems to me that that sort of setup really falls in
the category of a user-defined archiving process --- that is, the hook
that Postgres calls will push WAL segments from the local partition to
the NFS server, and then pushing them off NFS to tape is the
responsibility of some other user-defined subprocess. Database panic
happens if and only if the local partition overflows. I don't see that
making Postgres explicitly aware of the secondary NFS arrangement will
buy anything.

> I guess if the archive partition fills up, I would err on the side of
> dropping archive segments on the floor.

That should be user-scriptable policy, in my worldview.

We haven't yet talked much about what the WAL-segment-archiving API
should look like, but if it cannot support implementing the above kind
of arrangement outside the database, then we've dropped the ball.
IMHO anyway.

regards, tom lane


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Joe Conway'" <mail(at)joeconway(dot)com>
Cc: "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, <tswan(at)idigx(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-03 21:40:09
Message-ID: 006601c40168$1a838530$5baa87d9@LaptopDellXP
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us]
> Joe Conway <mail(at)joeconway(dot)com> writes:
> > Tom Lane wrote:
> >> Joe Conway <mail(at)joeconway(dot)com> writes:
> >>> Maybe specify an archive location (that of course could be on a
> separate
> >>> partition) that the external archiver should check in addition to
the
> >>> normal WAL location. At some predetermined interval, push WAL log
> >>> segments no longer needed to the archive location.
> >>
> >> Does that really help? The panic happens when you fill the
"normal"
> and
> >> "archive" partitions, how's that different from one partition?
>
> > I see your point. But it would allow you to use a relatively modest
> > local partition for WAL segments, while you might be using a 1TB
netapp
> > tray over NFS for the archive segments.
>
> Fair enough, but it seems to me that that sort of setup really falls
in
> the category of a user-defined archiving process --- that is, the hook
> that Postgres calls will push WAL segments from the local partition to
> the NFS server, and then pushing them off NFS to tape is the
> responsibility of some other user-defined subprocess. Database panic
> happens if and only if the local partition overflows. I don't see
that
> making Postgres explicitly aware of the secondary NFS arrangement will
> buy anything.

Tom's last sentence there summarises the design I was working with. I
had considered Joe's suggested approach (which was Oracle's also).

However, the PITR design will come with a usable low-function program
which can easily copy logs from pg_xlog to another archive directory.
That's needed as a test harness anyway, so it may as well be part of the
package. You'd be able to use that in production to copy xlogs to
another larger directory as a staging area to tape/failover on another
system: effectively Joe's idea is catered for in the basic package.

Anyway I'm answering questions before publishing the design as
stands...though people do keep spurring me to refine it as I'm writing
it down! That's why its good to document it I guess.

> > I guess if the archive partition fills up, I would err on the side
of
> > dropping archive segments on the floor.
>
> That should be user-scriptable policy, in my worldview.

Hmmm. Very difficult that one.

My experience is in commercial systems. Dropping archive segments on the
floor is just absolutely NOT GOOD, if that is the only behaviour. The
whole purpose of having a dbms is so that you can protect your business
data, while using it. Such behaviour would most likely be a barrier to
wider commercial adoption. [Oracle and other dbms will freeze when this
situation is hit, rather than continue and drop archive logs.]

User-selectable behaviour? OK. That's how we deal with fsync; I can
relate to that. That hadn't been part of my thinking because of the
importance I'd attached to the log files themselves, but I can go with
that, if that's what was meant.

So, if we had a parameter called Wal_archive_policy that has 3 settings:
None = no archiving
Optimistic = archive, but if for some reason log space runs out then
make space by dropping the oldest archive logs
Strict = if log space runs out, stop further write transactions from
committing, by whatever means, even if this takes down dbms.

That way, we've got something akin to transaction isolation level with
various levels of protection.

Best Regards, Simon Riggs


From: Joe Conway <mail(at)joeconway(dot)com>
To: simon(at)2ndquadrant(dot)com
Cc: 'Tom Lane' <tgl(at)sss(dot)pgh(dot)pa(dot)us>, 'Gavin Sherry' <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was
Date: 2004-03-03 21:52:06
Message-ID: 40465386.7030204@joeconway.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
>> Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us] That should be user-scriptable
>> policy, in my worldview.

> O... and other dbms will freeze when this situation is hit, rather
> than continue and drop archive logs.]

Been there, done that, don't see how it's any better. I hesitate to be
real specific here, but let's just say the end result was restore from
backup :-(

> So, if we had a parameter called Wal_archive_policy that has 3
> settings: None = no archiving Optimistic = archive, but if for some
> reason log space runs out then make space by dropping the oldest
> archive logs Strict = if log space runs out, stop further write
> transactions from committing, by whatever means, even if this takes
> down dbms.

That sounds good to me. For the "Optimistic" case, we need to yell
loudly if we do find ourselves needing to drop segments. For the
"Strict" case, we just need to be sure it works correctly ;-)

Joe


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joe Conway <mail(at)joeconway(dot)com>
Cc: simon(at)2ndquadrant(dot)com, "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-03 22:10:01
Message-ID: 18411.1078351801@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Joe Conway <mail(at)joeconway(dot)com> writes:
> Simon Riggs wrote:
>> O... and other dbms will freeze when this situation is hit, rather
>> than continue and drop archive logs.]

> Been there, done that, don't see how it's any better. I hesitate to be
> real specific here, but let's just say the end result was restore from
> backup :-(

It's hard for me to imagine a situation in which killing the database
would be considered a more attractive option than dropping old log
data. You may or may not ever need the old log data, but you darn well
do need a functioning database. (If you don't, you wouldn't be going to
all this work.)

I think also that Simon completely misunderstood my intent in saying
that this could be "user-scriptable policy". By that I meant that the
*user* could write the code to behave whichever way he liked. Not that
we were going to go into a mad rush of feature invention and try to
support every combination we could think of. I repeat: code that pushes
logs into a secondary area is not ours to write. We should concentrate
on providing an API that lets users write it. We have only limited
manpower for this project and we need to spend it on getting the core
functionality done right, not on inventing frammishes.

regards, tom lane


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: simon(at)2ndquadrant(dot)com
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Joe Conway'" <mail(at)joeconway(dot)com>, "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was
Date: 2004-03-03 22:28:52
Message-ID: 200403032228.i23MSqn26002@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> User-selectable behaviour? OK. That's how we deal with fsync; I can
> relate to that. That hadn't been part of my thinking because of the
> importance I'd attached to the log files themselves, but I can go with
> that, if that's what was meant.
>
> So, if we had a parameter called Wal_archive_policy that has 3 settings:
> None = no archiving
> Optimistic = archive, but if for some reason log space runs out then
> make space by dropping the oldest archive logs
> Strict = if log space runs out, stop further write transactions from
> committing, by whatever means, even if this takes down dbms.
>
> That way, we've got something akin to transaction isolation level with
> various levels of protection.

Yep, we will definately need something like that. Basically whenever
the logs are being archived, you have to stop the database if you can't
archive, no?

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Shridhar Daithankar <shridhar(at)frodo(dot)hserus(dot)net>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was
Date: 2004-03-04 07:09:14
Message-ID: 4046D61A.20409@frodo.hserus.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I think also that Simon completely misunderstood my intent in saying
> that this could be "user-scriptable policy". By that I meant that the
> *user* could write the code to behave whichever way he liked. Not that
> we were going to go into a mad rush of feature invention and try to
> support every combination we could think of. I repeat: code that pushes
> logs into a secondary area is not ours to write. We should concentrate
> on providing an API that lets users write it. We have only limited
> manpower for this project and we need to spend it on getting the core
> functionality done right, not on inventing frammishes.

Hmm... I totally agree. I think the backend could just offer a shared memory
segment and a marker message to another process to allow copy from it. then it
is the applications business to do the things.

Of course there has to be a two way agreement about it but an API is a real nice
thing rather than an application.

Shridhar


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Joe Conway'" <mail(at)joeconway(dot)com>
Cc: "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, <tswan(at)idigx(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-08 23:28:25
Message-ID: 004a01c40565$15402e60$f3bd87d9@LaptopDellXP
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Please excuse the delay in replying..

>Tom Lane
> Joe Conway <mail(at)joeconway(dot)com> writes:
> > Simon Riggs wrote:
> >> O... and other dbms will freeze when this situation is hit, rather
> >> than continue and drop archive logs.]
>
> > Been there, done that, don't see how it's any better. I hesitate to
be
> > real specific here, but let's just say the end result was restore
from
> > backup :-(

Myself also. I accept your experience and insight, I apologise if my own
seemed overblown. My take on that is that if you're in a situation that
has a high probability of going bad, the last thing you would want is to
drop xlogs. Same technical experience, different viewpoint on what to
learn from it.

> It's hard for me to imagine a situation in which killing the database
> would be considered a more attractive option than dropping old log
> data. You may or may not ever need the old log data, but you darn
well
> do need a functioning database. (If you don't, you wouldn't be going
to
> all this work.)

The main point here for me is that the choice of keeping archived (not
old) log files against keeping the database up isn't actually mine to
make; that choice belongs to the owner of the database, not me as
developer or administrator, consultant or whatever.

Although I admit I did not at first comprehend that such a view was
possible, I did flex to allow yours and Joe's perspective when that was
voiced.

The point is one of risk: does the owner wish to risk the possibility
that a transaction may be lost in order to keep the database up? The
possibility of lost rows must be balanced against the probably higher
possibility of being unable to write new data. But which is worse? Who
can say?

In some environments where I have worked, (again forgive any seeming
personal arrogance or posturing), such as banks or finance generally, it
has been desirable to stop the system rather than risk losing even a
single row. In other situations, lost rows must be balanced against the
money lost through downtime. Guess it depends whether you've got a
contract for uptime or for data integrity?? ;)

> I repeat: code that pushes
> logs into a secondary area is not ours to write. We should
concentrate
> on providing an API that lets users write it.

Agreed.

> We have only limited
> manpower for this project and we need to spend it on getting the core
> functionality done right, not on inventing frammishes.

Love that word "frammish"...seriously, I understand and agree.

My understanding is that existing logic will cause a PANIC if the xlog
directory cannot be written to. Helping the database stay up by dropping
logs would require extra code...

This was an edge case anyhow...

Best Regards, Simon Riggs


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "'Joe Conway'" <mail(at)joeconway(dot)com>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-08 23:28:25
Message-ID: 004801c40565$12c78d40$f3bd87d9@LaptopDellXP
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Joe Conway [mailto:mail(at)joeconway(dot)com]
> Simon Riggs wrote:
> >> Tom Lane [mailto:tgl(at)sss(dot)pgh(dot)pa(dot)us] That should be user-scriptable
> >> policy, in my worldview.
>
> > O... and other dbms will freeze when this situation is hit, rather
> > than continue and drop archive logs.]
>
> Been there, done that, don't see how it's any better. I hesitate to be
> real specific here, but let's just say the end result was restore from
> backup :-(
>
> > So, if we had a parameter called Wal_archive_policy that has 3
> > settings: None = no archiving Optimistic = archive, but if for some
> > reason log space runs out then make space by dropping the oldest
> > archive logs Strict = if log space runs out, stop further write
> > transactions from committing, by whatever means, even if this takes
> > down dbms.
>
> That sounds good to me. For the "Optimistic" case, we need to yell
> loudly if we do find ourselves needing to drop segments. For the
> "Strict" case, we just need to be sure it works correctly ;-)

Good.

Yell loudly really needs to happen sometime earlier, which is as Gavin
originally thought something to do with tablespaces.

Strict behaviour is fairly straightforward, you just PANIC!

I'd think we could rename these to
Fail Operational rather than Optimistic
Fail Safe rather than Strict
...the other names were a bit like "I'm right" and "but I'll do yours
too" ;}

Best Regards, Simon Riggs


From: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
To: "'Bruce Momjian'" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Joe Conway'" <mail(at)joeconway(dot)com>, "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, <tswan(at)idigx(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Out of space situation and WAL log pre-allocation (was
Date: 2004-03-08 23:28:25
Message-ID: 004901c40565$1406eb10$f3bd87d9@LaptopDellXP
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>Bruce Momjian
> Simon Riggs wrote:
> > User-selectable behaviour? OK. That's how we deal with fsync; I can
> > relate to that. That hadn't been part of my thinking because of the
> > importance I'd attached to the log files themselves, but I can go
with
> > that, if that's what was meant.
> >
> > So, if we had a parameter called Wal_archive_policy that has 3
settings:
> > None = no archiving
> > Optimistic = archive, but if for some reason log space runs out then
> > make space by dropping the oldest archive logs
> > Strict = if log space runs out, stop further write transactions from
> > committing, by whatever means, even if this takes down dbms.
> >
> > That way, we've got something akin to transaction isolation level
with
> > various levels of protection.
>
> Yep, we will definately need something like that. Basically whenever
> the logs are being archived, you have to stop the database if you
can't
> archive, no?

That certainly was my initial feeling, though I believe it is possible
to accommodate both viewpoints. I would not want to have only the
alternative viewpoint, I must confess.

Best Regards, Simon Riggs


From: Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To: simon(at)2ndquadrant(dot)com
Cc: "'Tom Lane'" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "'Joe Conway'" <mail(at)joeconway(dot)com>, "'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, tswan(at)idigx(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was
Date: 2004-03-08 23:50:53
Message-ID: 200403082350.i28NorK29881@candle.pha.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> >Bruce Momjian
> > Simon Riggs wrote:
> > > User-selectable behaviour? OK. That's how we deal with fsync; I can
> > > relate to that. That hadn't been part of my thinking because of the
> > > importance I'd attached to the log files themselves, but I can go
> with
> > > that, if that's what was meant.
> > >
> > > So, if we had a parameter called Wal_archive_policy that has 3
> settings:
> > > None = no archiving
> > > Optimistic = archive, but if for some reason log space runs out then
> > > make space by dropping the oldest archive logs
> > > Strict = if log space runs out, stop further write transactions from
> > > committing, by whatever means, even if this takes down dbms.
> > >
> > > That way, we've got something akin to transaction isolation level
> with
> > > various levels of protection.
> >
> > Yep, we will definately need something like that. Basically whenever
> > the logs are being archived, you have to stop the database if you
> can't
> > archive, no?
>
> That certainly was my initial feeling, though I believe it is possible
> to accommodate both viewpoints. I would not want to have only the
> alternative viewpoint, I must confess.
>

Added to PITR TODO list. Anything else to add:

http://momjian.postgresql.org/main/writings/pgsql/project

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073


From: Greg Stark <gsstark(at)mit(dot)edu>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Out of space situation and WAL log pre-allocation (was Tablespaces)
Date: 2004-03-10 03:28:09
Message-ID: 87ptbl8jp2.fsf@stark.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Simon Riggs" <simon(at)2ndquadrant(dot)com> writes:

> Strict behaviour is fairly straightforward, you just PANIC!

There is another mode possible as well. Oracle for example neither panics nor
continues, it just freezes. It keeps retrying the transaction until it finds
it has space.

The sysadmin or dba just has to somehow create additional space by removing
old files or however and the database will continue where it left off. That
seems a bit nicer than panicing.

When I first heard that I was shocked. It means implementing archive logs
*created* a new failure mode where there was none before. I thought that was
the dumbest idea in the world: who needed a backup process that increased the
chances of an outage? Now I can see the logic, but I'm still not sure which
mode I would pick if it was up to me. As others have said, I guess it would
depend on the situation.

--
greg