Re: Hard limit on WAL space used (because PANIC sucks)

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: MauMau <maumau307(at)gmail(dot)com>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Hard limit on WAL space used (because PANIC sucks)
Date: 2013-06-22 15:03:49
Message-ID: 20130622150349.GF20417@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jun 10, 2013 at 07:28:24AM +0800, Craig Ringer wrote:
> (I'm still learning the details of Pg's WAL, WAL replay and recovery, so
> the below's just my understanding):
>
> The problem is that WAL for all tablespaces is mixed together in the
> archives. If you lose your tablespace then you have to keep *all* WAL
> around and replay *all* of it again when the tablespace comes back
> online. This would be very inefficient, would require a lot of tricks to
> cope with applying WAL to a database that has an on-disk state in the
> future as far as the archives are concerned. It's not as simple as just
> replaying all WAL all over again - as I understand it, things like
> CLUSTER or TRUNCATE will result in relfilenodes not being where they're
> expected to be as far as old WAL archives are concerned. Selective
> replay would be required, and that leaves the door open to all sorts of
> new and exciting bugs in areas that'd hardly ever get tested.
>
> To solve the massive disk space explosion problem I imagine we'd have to
> have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
> and loss of the rather nice property that WAL writes are nice sequential
> writes. It'd be complicated and probably cause nightmares during
> recovery, for archive-based replication, etc.
>
> The only other thing I can think of is: When a tablespace is offline,
> write WAL records to a separate "tablespace recovery log" as they're
> encountered. Replay this log when the tablespace comes is restored,
> before applying any other new WAL to the tablespace. This wouldn't
> affect archive-based recovery since it'd already have the records from
> the original WAL.
>
> None of these options seem exactly simple or pretty, especially given
> the additional complexities that'd be involved in allowing WAL records
> to be applied out-of-order, something that AFAIK _never_h happens at the
> moment.
>
> The key problem, of course, is that this all sounds like a lot of
> complicated work for a case that's not really supposed to happen. Right
> now, the answer is "your database is unrecoverable, switch to your
> streaming warm standby and re-seed it from the standby". Not pretty, but
> at least there's the option of using a sync standby and avoiding data loss.
>
> How would you approach this?

Sorry to be replying late. You are right that you could record/apply
WAL separately for offline tablespaces. The problem is that you could
have logical ties from the offline tablespace to online tablespaces.
For example, what happens if data in an online tablespace references a
primary key in an offline tablespace. What if the system catalogs are
stored in an offline tablespace? Right now, we allow logical bindings
across physical tablespaces. To do what you want, you would really need
to store each database in its own tablespace.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Simon Riggs 2013-06-22 16:19:49 Re: A better way than tweaking NTUP_PER_BUCKET
Previous Message Andres Freund 2013-06-22 14:54:50 Re: Possible bug in CASE evaluation