Re: Proposal for CSN based snapshots

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Bruce Momjian <bruce(at)momjian(dot)us>, Rajeev rastogi <rajeev(dot)rastogi(at)huawei(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Markus Wanner <markus(at)bluegap(dot)ch>
Subject: Re: Proposal for CSN based snapshots
Date: 2014-05-30 18:36:28
Message-ID: 5388CFAC.8020700@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/30/2014 06:27 PM, Andres Freund wrote:
> On 2014-05-30 17:59:23 +0300, Heikki Linnakangas wrote:
>> One thorny issue came up in discussions with other hackers on this in PGCon:
>>
>> When a transaction is committed asynchronously, it becomes visible to other
>> backends before the commit WAL record is flushed. With CSN-based snapshots,
>> the order that transactions become visible is always based on the LSNs of
>> the WAL records. This is a problem when there is a mix of synchronous and
>> asynchronous commits:
>>
>> If transaction A commits synchronously with commit LSN 1, and transaction B
>> commits asynchronously with commit LSN 2, B cannot become visible before A.
>> And we cannot acknowledge B as committed to the client until it's visible to
>> other transactions. That means that B will have to wait for A's commit
>> record to be flushed to disk, before it can return, even though it was an
>> asynchronous commit.
>
>> I personally think that's annoying, but we can live with it. The most common
>> usage of synchronous_commit=off is to run a lot of transactions in that
>> mode, setting it in postgresql.conf. And it wouldn't completely defeat the
>> purpose of mixing synchronous and asynchronous commits either: an
>> asynchronous commit still only needs to wait for any already-logged
>> synchronous commits to be flushed to disk, not the commit record of the
>> asynchronous transaction itself.
>
> I have a hard time believing that users won't hate us for such a
> regression. It's pretty common to mix both sorts of transactions and
> this will - by my guesstimate - dramatically reduce throughput for the
> async backends.

Yeah, it probably would. Not sure how many people would care.

For an asynchronous commit, we could store the current WAL flush
location as the commit LSN, instead of the location of the commit
record. That would break the property that LSN == commit order, but that
property is fundamentally incompatible with having async commits become
visible without flushing previous transactions. Or we could even make it
configurable, it would be fairly easy to support both behaviors.

>> * Logical decoding is broken. I hacked on it enough that it looks roughly
>> sane and it compiles, but didn't spend more time to debug.
>
> I think we can live with it not working for the first few
> iterations. I'll look into it once the patch has stabilized a bit.

Thanks!

>> * I expanded pg_clog to 64-bits per XID, but people suggested keeping
>> pg_clog as is, with two bits per commit, and adding a new SLRU for the
>> commit LSNs beside it. Probably will need to do something like that to avoid
>> bloating the clog.
>
> It also influences how on-disk compatibility is dealt with. So: How are
> you planning to deal with on-disk compatibility?
>
>> * Add some kind of backend-private caching of clog, to make it faster to
>> access. The visibility checks are now hitting the clog a lot more heavily
>> than before, as you need to check the clog even if the hint bits are set, if
>> the XID falls between xmin and xmax of the snapshot.
>
> That'll hurt a lot in concurrent scenarios :/. Have you measured how
> 'wide' xmax-xmin usually is?

That depends entirely on the workload. The worst case is a mix of a
long-running transaction and a lot of short transaction. It could grow
to millions of transactions or more in that case.

> I wonder if we could just copy a range of
> values from the clog when we start scanning....

I don't think that's practical, if the xmin-xmax gap is wide.

Perhaps we should take the bull by the horns and make clog faster to
look up. If we e.g. mmapped the clog file into backend-private address
space, we could all the locking overhead of an SLRU. On platforms with
atomic 64-bit instructions, you could read the clog with just a memory
barrier. Even on other architectures, you'd only need a spinlock.

- Heikki

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2014-05-30 18:38:37 Re: Proposal for CSN based snapshots
Previous Message Edoardo Panfili 2014-05-30 17:56:09 Re: unable to build postgres-9.4 in os x 10.9 with python