Re: Unlogged tables, persistent kind

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Unlogged tables, persistent kind
Date: 2011-04-24 17:22:43
Message-ID: BANLkTi=9eR4C4To9PdJD0ztGvR335PQ6HA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Unlogged tables are a good new feature.

I noticed Bruce had mentioned they were the equivalent of NoSQL, which
I don't really accept. I guess it depends upon whether you mean NoSQL
for caches (e.g. memcached) or NoSQL for useful datastores (e.g.
Mongo). It seems worth discussin now before we get too far into the
marketing hype around Beta.

If you don't log changes to tables you have two choices if we crash
1) truncate the table and any indexes
2) rebuild any indexes damaged by the crash

Currently, we do (1). That certainly has its place but most data
stores don't do this if they crash, since it would lead to data loss.
Not just a couple of rows either - serious, major data loss if you put
the wrong kind of data in it. We even delete data that has been safely
on disk for weeks, months, which IMHO some people could easily get
confused about.

In the future, I would like to work on (2), which preserves as much
data as possible, while recognising indexes may be damaged. I don't
really have any name for this, since the current naming seems to
assume there is only one kind of unlogged table.

My implementation path for that would be to add a crash_number onto
pg_control and pg_index. Any index marked as "unlogged, persistent"
would only be usable if it's crash number is the same as current
system crash number.

REINDEX would update the index crash number to current value. That
also allows us to imagine a "repair index" command in the future as
well.

Heap blocks would be zeroed if they were found to be damaged, following a crash.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-24 18:41:37
Message-ID: BANLkTin8gg542dH+pegubTqqsSAuJnEayw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 24, 2011 at 6:22 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> My implementation path for that would be to add a crash_number onto
> pg_control and pg_index. Any index marked as "unlogged, persistent"
> would only be usable if it's crash number is the same as current
> system crash number.
>
> REINDEX would update the index crash number to current value. That
> also allows us to imagine a "repair index" command in the future as
> well.

This seems useful for non-crash-safe indexes in general.

> Heap blocks would be zeroed if they were found to be damaged, following a crash.
>

How do you propose to detect that? Until we solve the whole checksum
story I don't think we have a reliable way to detect bad pages. And in
some cases where do detect them we would detect them by crashing.

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-24 21:27:50
Message-ID: 17903.1303680470@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
> If you don't log changes to tables you have two choices if we crash
> 1) truncate the table and any indexes
> 2) rebuild any indexes damaged by the crash

No, you have only one choice, and that's (1), because there's no
guarantee that what's in the table file is meaningful.

> Heap blocks would be zeroed if they were found to be damaged, following a crash.

This is sheerest fantasy. And even if you could implement it, what sort
of feature would you be offering? "Your data is preserved except when
it isn't"? People who want that can go use mysql.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-24 22:13:18
Message-ID: BANLkTikQF_CV5dM7P_fi1=MaaCDGhk0P_A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 24, 2011 at 7:41 PM, Greg Stark <gsstark(at)mit(dot)edu> wrote:
> On Sun, Apr 24, 2011 at 6:22 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> My implementation path for that would be to add a crash_number onto
>> pg_control and pg_index. Any index marked as "unlogged, persistent"
>> would only be usable if it's crash number is the same as current
>> system crash number.
>>
>> REINDEX would update the index crash number to current value. That
>> also allows us to imagine a "repair index" command in the future as
>> well.
>
> This seems useful for non-crash-safe indexes in general.
>
>> Heap blocks would be zeroed if they were found to be damaged, following a crash.
>>
>
> How do you propose to detect that? Until we solve the whole checksum
> story I don't think we have a reliable way to detect bad pages. And in
> some cases where do detect them we would detect them by crashing.

That should be changed.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-24 22:15:44
Message-ID: BANLkTi=gVYWOcw4rpyvfSdJPH6PHo0Rs1w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 24, 2011 at 10:27 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Simon Riggs <simon(at)2ndQuadrant(dot)com> writes:
>> If you don't log changes to tables you have two choices if we crash
>> 1) truncate the table and any indexes
>> 2) rebuild any indexes damaged by the crash
>
> No, you have only one choice, and that's (1), because there's no
> guarantee that what's in the table file is meaningful.

>> Heap blocks would be zeroed if they were found to be damaged, following a crash.
>
> This is sheerest fantasy.  And even if you could implement it, what sort
> of feature would you be offering?  "Your data is preserved except when
> it isn't"?  People who want that can go use mysql.

AFAIUI, a great many people do.

I am proposing a non-default mode, requiring explicit activation by
user which preserves as much data as possible. I am fully aware that
what is proposed is not an optimisation, but a downgrading of normal
resilience in exchange for some data loss in the event of a crash.

Yes, many other systems support this and people are becoming persuaded
that such risk/reward choices make sense for them.
I see no reason not to provide an option to do this, so people can
make informed choices.

For large sets of low value data, it makes sense. Deleting all data,
just simply because some of it might be damaged, is not the only
option. IMHO deleting all the data is a surprising option that will
cause many people to curse us. I don't see preserving some of the data
as being worse.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Christopher Browne <cbbrowne(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-24 22:31:55
Message-ID: BANLkTi=mrDxa6PYYC6L44pe5=-eWeQj1wA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 24, 2011 at 6:15 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> For large sets of low value data, it makes sense. Deleting all data,
> just simply because some of it might be damaged, is not the only
> option. IMHO deleting all the data is a surprising option that will
> cause many people to curse us. I don't see preserving some of the data
> as being worse.

For the "cache table" case, it is *certainly* reasonable to delete
everything upon discovering the risk that some might be damaged.

I have seen cases "in production" where the fix to 'oh, looks like
something's corrupted' is indeed "truncate the cache," where that has
become a Standard Operating Procedure.

Sure, to have Postgres do this is a bit heavy-handed, but it's not as
if this isn't going to be heavily documented with warnings like:
"Contains live plague bacteria.
Beware the Rabid Hippopotami. May cause bleeding at the eyes. If
your database crashes, this table WILL get truncated at startup time."
If the docs are short on warnings, that should probably get
rectified. (I'm allowed to be volunteered to do so :-).)

I'd actually find it unsurprising for such tables to get truncated
even on a clean restart; I'd favor that being an option, as along as
make it thus were generally unproblematic. If the application using
such a table can't cope with that logic, better to induce an
understanding of that sooner rather than later ;-).

It seems like a losing battle to try terribly hard to keep the data
around when, by marking it unlogged, the data definition specifically
warned that this was risky.

I'd not go so far as to suggest having autovac TRUNCATE such tables at
random intervals, but that's a pathology that's not completely
incompatible with the DDL declaration :-)
--
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-25 01:36:01
Message-ID: BANLkTi=5=N6+ync7b+ng1tvhzY5HNGvtZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, Apr 24, 2011 at 11:15 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> IMHO deleting all the data is a surprising option that will
> cause many people to curse us. I don't see preserving some of the data
> as being worse.

What possible damage to you want to recover from?

Without WAL logging after a software crash it's possible for update
chains to be broken, for multiple copies of the same tuple to be
visible, for some tuples to disappear but not others, etc.

And without checksums after a hardware crash it'll be possible for
pages to be torn resulting in tuple pointers that land in the middle
of nowhere or tuples that start off fine but are half overwritten with
unrelated garbage.

--
greg


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndQuadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-25 07:14:16
Message-ID: 7E068A07-31F6-481F-9F23-8F6C9BA190D8@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Apr 24, 2011, at 1:22 PM, Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
> Unlogged tables are a good new feature.

Thanks.

> I noticed Bruce had mentioned they were the equivalent of NoSQL, which
> I don't really accept.

Me neither. I thought that was poorly said.

> Heap blocks would be zeroed if they were found to be damaged, following a crash.

The problem is not so much the blocks that are damaged (e.g. half-written, torn page) but the ones that were never written at all. For example, read page A, read page B, update tuple on page A putting new version on page B, write one but not both of A and B out to the O/S, crash. Everything on disk is a valid page, but they are not coherent taken as a whole. It's normally XLOG replay that fixes this type of situation...

I thought about this problem a bit and I think you could perhaps deal with it by having some sort of partially logged table, where we would XLOG just enough to know which blocks or relations had been modified and only nuke enough data to be certain of being safe. But it isn't clear that there is much use case for this, especially because I think it would give up nearly all the performance benefit.

I do think it might be useful to have an unlogged index on a logged table, somehow frobnicated so that on a crash the index is known invalid and not used until a REINDEX is performed.

...Robert


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-25 09:04:22
Message-ID: BANLkTinDSXW60tkzB4Ubetuy57_NE3YiAA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 25, 2011 at 8:14 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Apr 24, 2011, at 1:22 PM, Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>> Unlogged tables are a good new feature.
>
> Thanks.
>
>> I noticed Bruce had mentioned they were the equivalent of NoSQL, which
>> I don't really accept.
>
> Me neither. I thought that was poorly said.
>
>> Heap blocks would be zeroed if they were found to be damaged, following a crash.
>
> The problem is not so much the blocks that are damaged (e.g. half-written, torn page) but the ones that were never written at all. For example, read page A, read page B, update tuple on page A putting new version on page B, write one but not both of A and B out to the O/S, crash.  Everything on disk is a valid page, but they are not coherent taken as a whole.  It's normally XLOG replay that fixes this type of situation...

Not really sure it matters what the cause of data loss is, does it?
The zeroing of the blocks definitely causes data loss but the
intention is to bring the table back to a consistent physical state,
not to in any way repair the data loss.

Repeating my words above, this proposed option trades potential minor
data loss for performance.

The amount of data loss on a big table will be <1% of the data loss
caused by truncating the whole table.

This is important on big tables where reloading from a backup might
take a long time.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-25 12:42:54
Message-ID: BANLkTin+i01J_BSX=QZ77+NBr9CRdmQr8g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 25, 2011 at 5:04 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
> On Mon, Apr 25, 2011 at 8:14 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> On Apr 24, 2011, at 1:22 PM, Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>>> Unlogged tables are a good new feature.
>>
>> Thanks.
>>
>>> I noticed Bruce had mentioned they were the equivalent of NoSQL, which
>>> I don't really accept.
>>
>> Me neither. I thought that was poorly said.
>>
>>> Heap blocks would be zeroed if they were found to be damaged, following a crash.
>>
>> The problem is not so much the blocks that are damaged (e.g. half-written, torn page) but the ones that were never written at all. For example, read page A, read page B, update tuple on page A putting new version on page B, write one but not both of A and B out to the O/S, crash.  Everything on disk is a valid page, but they are not coherent taken as a whole.  It's normally XLOG replay that fixes this type of situation...
>
> Not really sure it matters what the cause of data loss is, does it?
> The zeroing of the blocks definitely causes data loss but the
> intention is to bring the table back to a consistent physical state,
> not to in any way repair the data loss.

Right, but the trick is how you identify which blocks you need to
zero. You used the word "damaged", which to me implied that the block
had been modified in some way but ended up with other than the
expected contents, so that something like a CRC check might detect the
problem. My point (as perhaps you already understand) is that you
could easily have a situation where every block in the table passes a
hypothetical block-level CRC check, but the table as a whole is still
damaged because update chains aren't coherent. So you need some kind
of mechanism for identifying which portions of the table you need to
zero to get back to a guaranteed-coherent state.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-25 18:21:19
Message-ID: BANLkTikaiJi8jVhyNoUAnmm7h7DH8etwhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 25, 2011 at 1:42 PM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Apr 25, 2011 at 5:04 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> On Mon, Apr 25, 2011 at 8:14 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>>> On Apr 24, 2011, at 1:22 PM, Simon Riggs <simon(at)2ndQuadrant(dot)com> wrote:
>>>> Unlogged tables are a good new feature.
>>>
>>> Thanks.
>>>
>>>> I noticed Bruce had mentioned they were the equivalent of NoSQL, which
>>>> I don't really accept.
>>>
>>> Me neither. I thought that was poorly said.
>>>
>>>> Heap blocks would be zeroed if they were found to be damaged, following a crash.
>>>
>>> The problem is not so much the blocks that are damaged (e.g. half-written, torn page) but the ones that were never written at all. For example, read page A, read page B, update tuple on page A putting new version on page B, write one but not both of A and B out to the O/S, crash.  Everything on disk is a valid page, but they are not coherent taken as a whole.  It's normally XLOG replay that fixes this type of situation...
>>
>> Not really sure it matters what the cause of data loss is, does it?
>> The zeroing of the blocks definitely causes data loss but the
>> intention is to bring the table back to a consistent physical state,
>> not to in any way repair the data loss.
>
> Right, but the trick is how you identify which blocks you need to
> zero.  You used the word "damaged", which to me implied that the block
> had been modified in some way but ended up with other than the
> expected contents, so that something like a CRC check might detect the
> problem.  My point (as perhaps you already understand) is that you
> could easily have a situation where every block in the table passes a
> hypothetical block-level CRC check, but the table as a whole is still
> damaged because update chains aren't coherent.  So you need some kind
> of mechanism for identifying which portions of the table you need to
> zero to get back to a guaranteed-coherent state.

That sounds like progress.

The current mechanism is "truncate complete table". There are clearly
other mechanisms that would not remove all data.

Probably the common case would be for insert-only data.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-25 18:39:10
Message-ID: BANLkTi=ysNC0RFTAoWd+oMeZ-kUTSUn2RQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Apr 25, 2011 at 2:21 PM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>> Right, but the trick is how you identify which blocks you need to
>> zero.  You used the word "damaged", which to me implied that the block
>> had been modified in some way but ended up with other than the
>> expected contents, so that something like a CRC check might detect the
>> problem.  My point (as perhaps you already understand) is that you
>> could easily have a situation where every block in the table passes a
>> hypothetical block-level CRC check, but the table as a whole is still
>> damaged because update chains aren't coherent.  So you need some kind
>> of mechanism for identifying which portions of the table you need to
>> zero to get back to a guaranteed-coherent state.
>
> That sounds like progress.
>
> The current mechanism is "truncate complete table". There are clearly
> other mechanisms that would not remove all data.

No doubt. Consider a block B. If the system crashes when block B is
dirty either in the OS cache or shared_buffers, then you must zero B,
or truncate it away. If it was clean in both places, however, it's
good data and you can keep it.

So you can imagine for example a scheme where imagine that the
relation is divided into 8MB chunks, and we WAL-log the first
operation after each checkpoint that touches a chunk. Replay zeroes
the chunk, and we also invalidate all the indexes (the user must
REINDEX to get them working again). I think that would be safe, and
certainly the WAL-logging overhead would be far less than WAL-logging
every change, since we'd need to emit only ~16 bytes of WAL for every
8MB written, rather than ~8MB of WAL for every 8MB written. It
wouldn't allow some of the optimizations that the current unlogged
tables can get away with only because they WAL-log exactly nothing -
and selectively zeroing chunks of a large table might slow down
startup quite a bit - but it might still be useful to someone.

However, I think that the "logged table, unlogged index" idea is
probably the most promising thing to think about doing first. It's
easy to imagine all sorts of uses for that sort of thing even in cases
where people can't afford to have any data get zeroed, and it would
provide a convenient building block for something like the above if we
eventually wanted to go that way.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unlogged tables, persistent kind
Date: 2011-04-25 19:05:29
Message-ID: 18604.1303758329@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> However, I think that the "logged table, unlogged index" idea is
> probably the most promising thing to think about doing first.

+1 for that --- it's clean, has a clear use-case, and would allow us
to manage the current mess around hash indexes more cleanly.
That is, hash indexes would always be treated as unlogged.

(Or of course we could fix the lack of WAL logging for hash indexes,
but I notice a lack of people stepping up to do that.)

regards, tom lane