Quick Links

Re: Hard limit on WAL space used (because PANIC sucks)

From:	"MauMau" <maumau307(at)gmail(dot)com>
To:	"Craig Ringer" <craig(at)2ndquadrant(dot)com>
Cc:	"Josh Berkus" <josh(at)agliodbs(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Hard limit on WAL space used (because PANIC sucks)
Date:	2013-06-10 14:02:59
Message-ID:	16DAE49BCBAF4C03B9CB50A3CB8FF677@maumau
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

From: "Craig Ringer" <craig(at)2ndquadrant(dot)com>
> The problem is that WAL for all tablespaces is mixed together in the
> archives. If you lose your tablespace then you have to keep *all* WAL
> around and replay *all* of it again when the tablespace comes back
> online. This would be very inefficient, would require a lot of tricks to
> cope with applying WAL to a database that has an on-disk state in the
> future as far as the archives are concerned. It's not as simple as just
> replaying all WAL all over again - as I understand it, things like
> CLUSTER or TRUNCATE will result in relfilenodes not being where they're
> expected to be as far as old WAL archives are concerned. Selective
> replay would be required, and that leaves the door open to all sorts of
> new and exciting bugs in areas that'd hardly ever get tested.

Although I still lack understanding of PostgreSQL implementation, I have an
optimistic feeling that such complexity would not be required. While a
tablespace is offline, subsequent access from new or existing transactions
is rejected with an error "tablespace is offline". So new WAL records would
not be generated for the offline tablespace. To take the tablespace back
online, the DBA performs per-tablespace archive recovery. Per-tablespace
archive recovery restores tablespace data files from the backup, then read
through archive and pg_xlog/ WAL as usual, and selectively applies WAL
records for the tablespace.

I don't think it's a "must-be-fixed" problem that the WAL for all
tablespaces is mixed in one location. I suppose we can tolerate that
archive recovery takes a long time.

> To solve the massive disk space explosion problem I imagine we'd have to
> have per-tablespace WAL. That'd cause a *huge* increase in fsync costs
> and loss of the rather nice property that WAL writes are nice sequential
> writes. It'd be complicated and probably cause nightmares during
> recovery, for archive-based replication, etc.

Per-tablespace WAL is very interesting for another reason -- massive-scale
OLTP for database consolidation. This feature would certainly be a
breakthrough for amazing performance, because WAL is usually the last
bottleneck in OLTP. Yes, I can imagine recovery would be much, much more
complicated,.

> None of these options seem exactly simple or pretty, especially given
> the additional complexities that'd be involved in allowing WAL records
> to be applied out-of-order, something that AFAIK _never_h happens at the
> moment.

As I mentioned above, in my shallow understanding, it seems that the
additional complexities can be controlled.

> The key problem, of course, is that this all sounds like a lot of
> complicated work for a case that's not really supposed to happen. Right
> now, the answer is "your database is unrecoverable, switch to your
> streaming warm standby and re-seed it from the standby". Not pretty, but
> at least there's the option of using a sync standby and avoiding data
> loss.

Sync standby... maybe. Let me consider this.

> How would you approach this?

Thanks Craig, you gave me some interesting insights. All of these topics
are interesting, and I'd like to work on them when I have acquired enough
knowledge and experience in PostgreSQL development.

Regards
MauMau

In response to

Re: Hard limit on WAL space used (because PANIC sucks) at 2013-06-09 23:28:24 from Craig Ringer

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dimitri Fontaine	2013-06-10 14:06:46	Re: ALTER TABLE ... ALTER CONSTRAINT
Previous Message	Greg Stark	2013-06-10 14:02:22	Re: erroneous restore into pg_catalog schema