Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT

Lists: pgsql-hackers
From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-26 21:47:39
Message-ID: 544D6BFB.4060906@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all,

currently, CREATE DATABASE forces an immediate checkpoint (actually, it
forces two, but the time interval is usually rather small). For
traditional deployments this is not a big deal, because creating a
database is a rare event, and may be planned to off-peak times.

However for shared (cloud-like) deployments, this is not the case. E.g.
we're hosting hundreds (or even thousands) of customer databases on some
clusters, and creating a new database is quite common.

This turns the checkpoints into a significant pain point for us, because
it forces a write of all the dirty buffers from all the databases. No
matter how well we tune the spread checkpoints, this makes it
inefficient and causes significant I/O spikes (especially with larger
shared_buffer values). It also leads to high duration of the CREATE
DATABASE command, making it rather inpractical for 'interactive' use (a
user hitting a button in a UI or something).

Based on the talks from pgconf.eu, where I've seen this mentioned in at
least two talks (and in the hallway track), we're not alone. I'd like to
address this, if possible.

The usual workaround for this is "create the databases in advance" but
that's not always possible (e.g. when having more than handful of
templates, or when the template evolves over time).

After eyeballing the code for an hour or two, I think CREATE DATABASE
should be fine with performing only a 'partial checkpoint' on the
template database - calling FlushDatabaseBuffers and processing unlink
requests, as suggested by the comment in createdb().

It's not exactly trivial change, but it does not seem frighteningly
difficult coding either.

The templates are usually static, so this would minimize both the CREATE
DATABASE duration and disruption to the cluster it causes.

My fear however is that this while the code will work, it will break the
recovery in some subtle way (as illustrated by the comments about 8.0
PITR bugs in createdb).

Am I missing something that makes this dead in the water?

regards
Tomas


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 09:47:05
Message-ID: 544E1499.6070105@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/26/2014 11:47 PM, Tomas Vondra wrote:
> After eyeballing the code for an hour or two, I think CREATE DATABASE
> should be fine with performing only a 'partial checkpoint' on the
> template database - calling FlushDatabaseBuffers and processing unlink
> requests, as suggested by the comment in createdb().

Hmm. You could replace the first checkpoint with that, but I don't think
that's enough for the second. To get any significant performance
benefit, you need to get rid of both checkpoints, because doing two
checkpoints one after another is almost as fast as doing a single
checkpoint; the second checkpoint has very little work to do because the
first checkpoint already flushed out everything.

The second checkpoint, after copying but before commit, is done because
(from the comments in createdb function):

> * #1: When PITR is off, we don't XLOG the contents of newly created
> * indexes; therefore the drop-and-recreate-whole-directory behavior
> * of DBASE_CREATE replay would lose such indexes.
>
> * #2: Since we have to recopy the source database during DBASE_CREATE
> * replay, we run the risk of copying changes in it that were
> * committed after the original CREATE DATABASE command but before the
> * system crash that led to the replay. This is at least unexpected
> * and at worst could lead to inconsistencies, eg duplicate table
> * names.

Doing only FlushDatabaseBuffers would not prevent these issues - you
need a full checkpoint. These issues are better explained here:
http://www.postgresql.org/message-id/28884.1119727671@sss.pgh.pa.us

To solve #1, we could redesign CREATE DATABASE so that replaying the
DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
any files. We could instead just assume that if the transaction commits,
all the files have been copied and fsync'd already, like we assume that
if a CREATE INDEX commits in wal_level=minimal, the underlying file was
fsync'd before the commit.

That would also solve #2, when doing crash recovery. But it would remain
when doing archive recovery. I guess we could still redo the copy when
in archive recovery mode. I believe it would be the first time we have a
WAL record that's replayed differently in crash recovery than in archive
recovery, so here be dragons...

> It's not exactly trivial change, but it does not seem frighteningly
> difficult coding either.
>
> The templates are usually static, so this would minimize both the CREATE
> DATABASE duration and disruption to the cluster it causes.

I wonder if we should bite the bullet and start WAL-logging all the
files that are copied from the template database to the new database.
When the template database is small (template0 is 6.4MB currently), that
wouldn't generate too much WAL. We could perhaps do that only if the
template database is small, and do the checkpoints otherwise, although I
wouldn't like to have subtly different behavior depending on database
size like that.

- Heikki


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 11:06:07
Message-ID: CAOeZVieSJsQwmoz0J40vu-EYRC6GY6aRy_5kpFgLOMk9u3aZ6A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
>>
> To solve #1, we could redesign CREATE DATABASE so that replaying the
> DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
> any files. We could instead just assume that if the transaction commits,
> all the files have been copied and fsync'd already, like we assume that if
> a CREATE INDEX commits in wal_level=minimal, the underlying file was
> fsync'd before the commit.
>

Do you mean that during a recovery, we just let the database directory be
and assume that it is in good shape since the transaction committed
originally?

> I wonder if we should bite the bullet and start WAL-logging all the files
> that are copied from the template database to the new database. When the
> template database is small (template0 is 6.4MB currently), that wouldn't
> generate too much WAL. We could perhaps do that only if the template
> database is small, and do the checkpoints otherwise, although I wouldn't
> like to have subtly different behavior depending on database size like that.

For the sort of workload Tomas described above (creating a lot of databases
on the fly), we may end up with a lot of WAL eventually if we do this.

Regards,

Atri


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 11:14:54
Message-ID: 544E292E.1020605@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/27/2014 01:06 PM, Atri Sharma wrote:
>>
>>
>>>
>> To solve #1, we could redesign CREATE DATABASE so that replaying the
>> DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
>> any files. We could instead just assume that if the transaction commits,
>> all the files have been copied and fsync'd already, like we assume that if
>> a CREATE INDEX commits in wal_level=minimal, the underlying file was
>> fsync'd before the commit.
>>
>
> Do you mean that during a recovery, we just let the database directory be
> and assume that it is in good shape since the transaction committed
> originally?

Right.

>> I wonder if we should bite the bullet and start WAL-logging all the files
>> that are copied from the template database to the new database. When the
>> template database is small (template0 is 6.4MB currently), that wouldn't
>> generate too much WAL. We could perhaps do that only if the template
>> database is small, and do the checkpoints otherwise, although I wouldn't
>> like to have subtly different behavior depending on database size like that.
>
> For the sort of workload Tomas described above (creating a lot of databases
> on the fly), we may end up with a lot of WAL eventually if we do this.

I think writing 6 MB for each CREATE DATABASE would be OK. For
comparison, a pg_switch_xlog() call will waste on average half a
segment, i.e. 8MB. It's a lot cheaper than a checkpoint on a busy
system, for sure. But of course, if the template database is larger,
then you will generate more WAL.

- Heikki


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 11:23:21
Message-ID: CAOeZVicDgcFgwHB99ma9MpQ03HJMPY2ridF-n8GNkRtgk7FEgw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 27, 2014 at 4:44 PM, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com
> wrote:

> On 10/27/2014 01:06 PM, Atri Sharma wrote:
>
>>
>>>
>>>
>>>> To solve #1, we could redesign CREATE DATABASE so that replaying the
>>> DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
>>> any files. We could instead just assume that if the transaction commits,
>>> all the files have been copied and fsync'd already, like we assume that
>>> if
>>> a CREATE INDEX commits in wal_level=minimal, the underlying file was
>>> fsync'd before the commit.
>>>
>>>
>> Do you mean that during a recovery, we just let the database directory be
>> and assume that it is in good shape since the transaction committed
>> originally?
>>
>
> Right.

It does make sense, however, with the checkpoint after creating the files
gone, the window between the creation of files and actual commit might be
increased, increasing the possibility of a crash during that period and
causing an orphan database. However, my understanding of the consequences
of removing the checkpoint might be incorrect, so my fears might be wrong.

Regards,

Atri


From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Cc: "Tomas Vondra" <tv(at)fuzzy(dot)cz>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 12:00:29
Message-ID: 22147331d79514289d417eac645ed588.squirrel@2.emaily.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
> On 10/26/2014 11:47 PM, Tomas Vondra wrote:
>> After eyeballing the code for an hour or two, I think CREATE DATABASE
>> should be fine with performing only a 'partial checkpoint' on the
>> template database - calling FlushDatabaseBuffers and processing unlink
>> requests, as suggested by the comment in createdb().
>
> Hmm. You could replace the first checkpoint with that, but I don't think
> that's enough for the second. To get any significant performance
> benefit, you need to get rid of both checkpoints, because doing two
> checkpoints one after another is almost as fast as doing a single
> checkpoint; the second checkpoint has very little work to do because the
> first checkpoint already flushed out everything.

Yes, that's why I wrote that two checkpoints not too far away are
effectively a single checkpoint. OTOH if the template database is not
small, it may take a while to copy the data, increasing the distance
between the checkpoints.

While our template databases are small (in the order of ~10-100MB), there
are probably people using this to clone much larger databases.

> The second checkpoint, after copying but before commit, is done because
> (from the comments in createdb function):
>
>> * #1: When PITR is off, we don't XLOG the contents of newly created
>> * indexes; therefore the drop-and-recreate-whole-directory behavior
>> * of DBASE_CREATE replay would lose such indexes.
>>
>> * #2: Since we have to recopy the source database during DBASE_CREATE
>> * replay, we run the risk of copying changes in it that were
>> * committed after the original CREATE DATABASE command but before the
>> * system crash that led to the replay. This is at least unexpected
>> * and at worst could lead to inconsistencies, eg duplicate table
>> * names.
>
> Doing only FlushDatabaseBuffers would not prevent these issues - you
> need a full checkpoint. These issues are better explained here:
> http://www.postgresql.org/message-id/28884.1119727671@sss.pgh.pa.us
>
> To solve #1, we could redesign CREATE DATABASE so that replaying the
> DBASE_CREATE record doesn't zap the old directory, and also doesn't copy
> any files. We could instead just assume that if the transaction commits,
> all the files have been copied and fsync'd already, like we assume that
> if a CREATE INDEX commits in wal_level=minimal, the underlying file was
> fsync'd before the commit.
>
> That would also solve #2, when doing crash recovery. But it would remain
> when doing archive recovery. I guess we could still redo the copy when
> in archive recovery mode. I believe it would be the first time we have a
> WAL record that's replayed differently in crash recovery than in archive
> recovery, so here be dragons...

Yeah ...

>> It's not exactly trivial change, but it does not seem frighteningly
>> difficult coding either.
>>
>> The templates are usually static, so this would minimize both the CREATE
>> DATABASE duration and disruption to the cluster it causes.
>
> I wonder if we should bite the bullet and start WAL-logging all the
> files that are copied from the template database to the new database.
> When the template database is small (template0 is 6.4MB currently), that
> wouldn't generate too much WAL. We could perhaps do that only if the
> template database is small, and do the checkpoints otherwise, although I
> wouldn't like to have subtly different behavior depending on database
> size like that.

IMHO writing all the data into a WAL would be the cleanest solution.

Also, what is a small database? I don't think a static value will work,
because the sweet spot between the current approach (forcing two
checkpoints) and writing everything in WAL depends on the amount of dirty
buffers that need to be checkpointed. Which is mostly driven by the size
of shared buffers and write activity - for small shared buffers and/or
mostly-read workload, checkpoints are cheap, so the 'small database'
threshold (when choosing the WAL approach) is much higher than for large
shared buffers or write-heavy workloads.

So maybe if we could determine the amount of data to be checkpointed, and
then base the decision on that, that'd work better? This would also have
to take into account that writing into WAL is sequential, while
checkpoints usually cause random writes all over the datafiles (which is
more expensive).

Another option might be forcing just a "spread" checkpoint, not the
immediate one (which is what we do now). That would not fix the CREATE
DATABASE duration (actually, it would make it longer), but it would lower
the impact on other activity on the machine.

regards
Tomas


From: Atri Sharma <atri(dot)jiit(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 12:50:39
Message-ID: CAOeZVicHLJBD7nAJynMq8QaFfAjy-pyXcL9ZXAcuaKX=RrBVaA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
>
>
> IMHO writing all the data into a WAL would be the cleanest solution.
>
> Also, what is a small database? I don't think a static value will work,
> because the sweet spot between the current approach (forcing two
> checkpoints) and writing everything in WAL depends on the amount of dirty
> buffers that need to be checkpointed. Which is mostly driven by the size
> of shared buffers and write activity - for small shared buffers and/or
> mostly-read workload, checkpoints are cheap, so the 'small database'
> threshold (when choosing the WAL approach) is much higher than for large
> shared buffers or write-heavy workloads.
>

So are you proposing having a heuristic based on the amount of data in
shared buffers and write activity? Do you have something in mind that works
for general workloads as well?

> So maybe if we could determine the amount of data to be checkpointed, and
> then base the decision on that, that'd work better? This would also have
> to take into account that writing into WAL is sequential, while
> checkpoints usually cause random writes all over the datafiles (which is
> more expensive).
>
> Another option might be forcing just a "spread" checkpoint, not the
> immediate one (which is what we do now). That would not fix the CREATE
> DATABASE duration (actually, it would make it longer), but it would lower
> the impact on other activity on the machine.
>
>
>
I believe this to be the cleanest way to reduce the amount of I/O
generated. If I understand correctly, the original problem you mentioned
was not the time CREATE DATABASE is taking but rather the amount of I/O
each one is generating.

This also leads me to think if it makes sense to explore group commits
around the creation of files for a new database (for a same backend, of
course). This might be on call, if the user knows he/she is going to create
a lot of databases in the near future and is fine with a large spike in I/O
at one go. Again, might be even more broken than the current scenario, but
depends on what the user wants...

Regards,

Atri

--
Regards,

Atri
*l'apprenant*


From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Atri Sharma" <atri(dot)jiit(at)gmail(dot)com>
Cc: "Tomas Vondra" <tv(at)fuzzy(dot)cz>, "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 13:09:25
Message-ID: 9ca60241895f4cede3269ea33a70d9bd.squirrel@2.emaily.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dne 27 Říjen 2014, 13:50, Atri Sharma napsal(a):
>>
>>
>>
>> IMHO writing all the data into a WAL would be the cleanest solution.
>>
>> Also, what is a small database? I don't think a static value will work,
>> because the sweet spot between the current approach (forcing two
>> checkpoints) and writing everything in WAL depends on the amount of
>> dirty
>> buffers that need to be checkpointed. Which is mostly driven by the size
>> of shared buffers and write activity - for small shared buffers and/or
>> mostly-read workload, checkpoints are cheap, so the 'small database'
>> threshold (when choosing the WAL approach) is much higher than for large
>> shared buffers or write-heavy workloads.
>>
>
> So are you proposing having a heuristic based on the amount of data in
> shared buffers and write activity? Do you have something in mind that
> works
> for general workloads as well?
>
>
>> So maybe if we could determine the amount of data to be checkpointed,
>> and
>> then base the decision on that, that'd work better? This would also have
>> to take into account that writing into WAL is sequential, while
>> checkpoints usually cause random writes all over the datafiles (which is
>> more expensive).
>>
>> Another option might be forcing just a "spread" checkpoint, not the
>> immediate one (which is what we do now). That would not fix the CREATE
>> DATABASE duration (actually, it would make it longer), but it would
>> lower
>> the impact on other activity on the machine.
>>
>>
>>
> I believe this to be the cleanest way to reduce the amount of I/O
> generated. If I understand correctly, the original problem you mentioned
> was not the time CREATE DATABASE is taking but rather the amount of I/O
> each one is generating.

Not exactly. There are two related issues, both caused by the I/O activity
from the CHECKPOINT.

(a) I/O spike, because of checkpointing everything
(b) long CREATE DATABASE durations

This "spread CREATE DATABASE" only fixes (a), and makes (b) a bit worse.

It however makes the 'create databases in advance' a bit more flexible,
because it allows more frequent runs of the cron job that actually creates
them. Right now we're forced to schedule it rather rarely (say, 1/day) to
minimize the impact, which has downsides (higher probability of running
out of spare dbs, slow response to updated template etc.).

If we can fix both (a) and (b), that'd be great. If we can fix only (a),
so be it.

> This also leads me to think if it makes sense to explore group commits
> around the creation of files for a new database (for a same backend, of
> course). This might be on call, if the user knows he/she is going to
> create
> a lot of databases in the near future and is fine with a large spike in
> I/O
> at one go. Again, might be even more broken than the current scenario, but
> depends on what the user wants...

That seems much more complex than what I proposed to do, but maybe I'm
wrong. If we could get rid of the "checkpoint everything" altogether, we
don't really need the group commit, no?

regards
Tomas


From: "Tomas Vondra" <tv(at)fuzzy(dot)cz>
To: "Heikki Linnakangas" <hlinnakangas(at)vmware(dot)com>
Cc: "Tomas Vondra" <tv(at)fuzzy(dot)cz>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 13:21:58
Message-ID: 429fe6fc1aa1d3a91804af032bbf1b5d.squirrel@2.emaily.eu
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
> On 10/26/2014 11:47 PM, Tomas Vondra wrote:
>> After eyeballing the code for an hour or two, I think CREATE DATABASE
>> should be fine with performing only a 'partial checkpoint' on the
>> template database - calling FlushDatabaseBuffers and processing unlink
>> requests, as suggested by the comment in createdb().
>
> Hmm. You could replace the first checkpoint with that, but I don't think
> that's enough for the second. To get any significant performance
> benefit, you need to get rid of both checkpoints, because doing two
> checkpoints one after another is almost as fast as doing a single
> checkpoint; the second checkpoint has very little work to do because the
> first checkpoint already flushed out everything.
>
> The second checkpoint, after copying but before commit, is done because
> (from the comments in createdb function):
>
>> * #1: When PITR is off, we don't XLOG the contents of newly created
>> * indexes; therefore the drop-and-recreate-whole-directory behavior
>> * of DBASE_CREATE replay would lose such indexes.
>>
>> * #2: Since we have to recopy the source database during DBASE_CREATE
>> * replay, we run the risk of copying changes in it that were
>> * committed after the original CREATE DATABASE command but before the
>> * system crash that led to the replay. This is at least unexpected
>> * and at worst could lead to inconsistencies, eg duplicate table
>> * names.
>
> Doing only FlushDatabaseBuffers would not prevent these issues - you
> need a full checkpoint. These issues are better explained here:
> http://www.postgresql.org/message-id/28884.1119727671@sss.pgh.pa.us

Thinking about this a bit more, do we really need a full checkpoint? That
is a checkpoint of all the databases in the cluster? Why checkpointing the
source database is not enough?

I mean, when we use database A as a template, why do we need to checkpoint
B, C, D and F too? (Apologies if this is somehow obvious, I'm way out of
my comfort zone in this part of the code.)

regards
Tomas


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 13:28:03
Message-ID: 544E4863.2080407@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/27/2014 03:21 PM, Tomas Vondra wrote:
> Dne 27 Říjen 2014, 10:47, Heikki Linnakangas napsal(a):
>> On 10/26/2014 11:47 PM, Tomas Vondra wrote:
>>> After eyeballing the code for an hour or two, I think CREATE DATABASE
>>> should be fine with performing only a 'partial checkpoint' on the
>>> template database - calling FlushDatabaseBuffers and processing unlink
>>> requests, as suggested by the comment in createdb().
>>
>> Hmm. You could replace the first checkpoint with that, but I don't think
>> that's enough for the second. To get any significant performance
>> benefit, you need to get rid of both checkpoints, because doing two
>> checkpoints one after another is almost as fast as doing a single
>> checkpoint; the second checkpoint has very little work to do because the
>> first checkpoint already flushed out everything.
>>
>> The second checkpoint, after copying but before commit, is done because
>> (from the comments in createdb function):
>>
>>> * #1: When PITR is off, we don't XLOG the contents of newly created
>>> * indexes; therefore the drop-and-recreate-whole-directory behavior
>>> * of DBASE_CREATE replay would lose such indexes.
>>>
>>> * #2: Since we have to recopy the source database during DBASE_CREATE
>>> * replay, we run the risk of copying changes in it that were
>>> * committed after the original CREATE DATABASE command but before the
>>> * system crash that led to the replay. This is at least unexpected
>>> * and at worst could lead to inconsistencies, eg duplicate table
>>> * names.
>>
>> Doing only FlushDatabaseBuffers would not prevent these issues - you
>> need a full checkpoint. These issues are better explained here:
>> http://www.postgresql.org/message-id/28884.1119727671@sss.pgh.pa.us
>
> Thinking about this a bit more, do we really need a full checkpoint? That
> is a checkpoint of all the databases in the cluster? Why checkpointing the
> source database is not enough?
>
> I mean, when we use database A as a template, why do we need to checkpoint
> B, C, D and F too? (Apologies if this is somehow obvious, I'm way out of
> my comfort zone in this part of the code.)

A full checkpoint ensures that you always begin recovery *after* the
DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
crash recovery (except when you crash before the transaction commits, in
which case it doesn't matter if the new database's directory is borked).

- Heikki


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 13:46:41
Message-ID: 18678.1414417601@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> On 10/27/2014 03:21 PM, Tomas Vondra wrote:
>> Thinking about this a bit more, do we really need a full checkpoint? That
>> is a checkpoint of all the databases in the cluster? Why checkpointing the
>> source database is not enough?

> A full checkpoint ensures that you always begin recovery *after* the
> DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
> crash recovery (except when you crash before the transaction commits, in
> which case it doesn't matter if the new database's directory is borked).

Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
all the copied files instead of doing a "cp -r"-equivalent directory copy.
That would fix a number of existing replay hazards as well as making it
safe to do what Tomas wants. In the small scale this would cause more I/O
(2 copies of the template database's data) but in production situations
we might well come out ahead by avoiding a forced checkpoint of the rest
of the cluster. Also I guess we could skip WAL-logging if WAL archiving
is off, similarly to the existing optimization for CREATE INDEX etc.

regards, tom lane


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 13:51:41
Message-ID: 20141027135141.GB27636@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-27 09:46:41 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
> > On 10/27/2014 03:21 PM, Tomas Vondra wrote:
> >> Thinking about this a bit more, do we really need a full checkpoint? That
> >> is a checkpoint of all the databases in the cluster? Why checkpointing the
> >> source database is not enough?
>
> > A full checkpoint ensures that you always begin recovery *after* the
> > DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
> > crash recovery (except when you crash before the transaction commits, in
> > which case it doesn't matter if the new database's directory is borked).
>
> Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
> bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
> all the copied files instead of doing a "cp -r"-equivalent directory copy.
> That would fix a number of existing replay hazards as well as making it
> safe to do what Tomas wants. In the small scale this would cause more I/O
> (2 copies of the template database's data) but in production situations
> we might well come out ahead by avoiding a forced checkpoint of the rest
> of the cluster. Also I guess we could skip WAL-logging if WAL archiving
> is off, similarly to the existing optimization for CREATE INDEX etc.

+1.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 16:24:25
Message-ID: 544E71B9.6060107@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/27/2014 03:46 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>> On 10/27/2014 03:21 PM, Tomas Vondra wrote:
>>> Thinking about this a bit more, do we really need a full checkpoint? That
>>> is a checkpoint of all the databases in the cluster? Why checkpointing the
>>> source database is not enough?
>
>> A full checkpoint ensures that you always begin recovery *after* the
>> DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
>> crash recovery (except when you crash before the transaction commits, in
>> which case it doesn't matter if the new database's directory is borked).
>
> Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
> bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
> all the copied files instead of doing a "cp -r"-equivalent directory copy.
> That would fix a number of existing replay hazards as well as making it
> safe to do what Tomas wants. In the small scale this would cause more I/O
> (2 copies of the template database's data) but in production situations
> we might well come out ahead by avoiding a forced checkpoint of the rest
> of the cluster. Also I guess we could skip WAL-logging if WAL archiving
> is off, similarly to the existing optimization for CREATE INDEX etc.

That would be a nasty surprise for anyone who's using CREATE DATABASE as
a fast way to clone a large database. But I would be OK with that, at
least if we can skip the WAL-logging with wal_level=minimal.

- Heikki


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 21:58:31
Message-ID: 544EC007.40702@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 27.10.2014 17:24, Heikki Linnakangas wrote:
> On 10/27/2014 03:46 PM, Tom Lane wrote:
>> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>>> On 10/27/2014 03:21 PM, Tomas Vondra wrote:
>>>> Thinking about this a bit more, do we really need a full checkpoint?
>>>> That
>>>> is a checkpoint of all the databases in the cluster? Why
>>>> checkpointing the
>>>> source database is not enough?
>>
>>> A full checkpoint ensures that you always begin recovery *after* the
>>> DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
>>> crash recovery (except when you crash before the transaction commits, in
>>> which case it doesn't matter if the new database's directory is borked).
>>
>> Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
>> bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
>> all the copied files instead of doing a "cp -r"-equivalent directory
>> copy.
>> That would fix a number of existing replay hazards as well as making it
>> safe to do what Tomas wants. In the small scale this would cause more
>> I/O
>> (2 copies of the template database's data) but in production situations
>> we might well come out ahead by avoiding a forced checkpoint of the rest
>> of the cluster. Also I guess we could skip WAL-logging if WAL archiving
>> is off, similarly to the existing optimization for CREATE INDEX etc.
>
> That would be a nasty surprise for anyone who's using CREATE DATABASE
> as a fast way to clone a large database. But I would be OK with that,
> at least if we can skip the WAL-logging with wal_level=minimal.

That's true. Sadly, I can't think of a solution that would address both
use cases at the same time :-(

The only thing I can think of is having two CREATE DATABASE "flavors".
One keeping the current approach (suitable for fast cloning) and one
with the WAL logging (minimizing the CREATE DATABASE duration the impact
on other backends).

It will probably make the code significantly more complex, which is not
exactly desirable, I guess. Also, if we keep the current code (even if
only as a special case) it won't eliminate the existing replay hazards
(which was one of the Tom's arguments for biting the bullet).

I'm also thinking that for wal_level=archive and large databases, this
won't really eliminate the checkpoint as it will likely generate enough
WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?

That being said, our CREATE DATABASE docs currently say this

Although it is possible to copy a database other than template1 by
specifying its name as the template, this is not (yet) intended as
a general-purpose "COPY DATABASE" facility. The principal
limitation is that no other sessions can be connected to the
template database while it is being copied. CREATE DATABASE will
fail if any other connection exists when it starts; otherwise, new
connections to the template database are locked out until CREATE
DATABASE completes. See Section 21.3 for more information.

I think that this limitation pretty much means no one should use CREATE
DATABASE for cloning live databases in production environment (because
of the locking).

It also seems to me the "general-purpose COPY DATABASE" described in the
docs is what we're describing in this thread.

regards
Tomas


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 22:52:44
Message-ID: 31997.1414450364@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tomas Vondra <tv(at)fuzzy(dot)cz> writes:
> That being said, our CREATE DATABASE docs currently say this

> Although it is possible to copy a database other than template1 by
> specifying its name as the template, this is not (yet) intended as
> a general-purpose "COPY DATABASE" facility. The principal
> limitation is that no other sessions can be connected to the
> template database while it is being copied. CREATE DATABASE will
> fail if any other connection exists when it starts; otherwise, new
> connections to the template database are locked out until CREATE
> DATABASE completes. See Section 21.3 for more information.

> I think that this limitation pretty much means no one should use CREATE
> DATABASE for cloning live databases in production environment (because
> of the locking).

Good point. But the other side of that coin is that if somebody *was*
doing this, the locking would mean that slowing it down would be even
more painful than you might think. Still, personally I'm willing to
accept that downside, given that we've pretty much always had the above
caveat in the docs.

> It also seems to me the "general-purpose COPY DATABASE" described in the
> docs is what we're describing in this thread.

Well, no, because the restriction "nobody can be connected to the source
database" would still apply; relaxing that would be enormously more
complicated, and probably fragile, than what we're talking about here.

regards, tom lane


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 22:57:27
Message-ID: 544ECDD7.1040500@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 10/27/2014 05:58 PM, Tomas Vondra wrote:
> On 27.10.2014 17:24, Heikki Linnakangas wrote:
>> On 10/27/2014 03:46 PM, Tom Lane wrote:
>>> Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> writes:
>>>> On 10/27/2014 03:21 PM, Tomas Vondra wrote:
>>>>> Thinking about this a bit more, do we really need a full checkpoint?
>>>>> That
>>>>> is a checkpoint of all the databases in the cluster? Why
>>>>> checkpointing the
>>>>> source database is not enough?
>>>> A full checkpoint ensures that you always begin recovery *after* the
>>>> DBASE_CREATE record. I.e. you never replay a DBASE_CREATE record during
>>>> crash recovery (except when you crash before the transaction commits, in
>>>> which case it doesn't matter if the new database's directory is borked).
>>> Yeah. After re-reading the 2005 thread, I wonder if we shouldn't just
>>> bite the bullet and redesign CREATE DATABASE as you suggest, ie, WAL-log
>>> all the copied files instead of doing a "cp -r"-equivalent directory
>>> copy.
>>> That would fix a number of existing replay hazards as well as making it
>>> safe to do what Tomas wants. In the small scale this would cause more
>>> I/O
>>> (2 copies of the template database's data) but in production situations
>>> we might well come out ahead by avoiding a forced checkpoint of the rest
>>> of the cluster. Also I guess we could skip WAL-logging if WAL archiving
>>> is off, similarly to the existing optimization for CREATE INDEX etc.
>> That would be a nasty surprise for anyone who's using CREATE DATABASE
>> as a fast way to clone a large database. But I would be OK with that,
>> at least if we can skip the WAL-logging with wal_level=minimal.
> That's true. Sadly, I can't think of a solution that would address both
> use cases at the same time :-(
>
> The only thing I can think of is having two CREATE DATABASE "flavors".
> One keeping the current approach (suitable for fast cloning) and one
> with the WAL logging (minimizing the CREATE DATABASE duration the impact
> on other backends).
>
> It will probably make the code significantly more complex, which is not
> exactly desirable, I guess. Also, if we keep the current code (even if
> only as a special case) it won't eliminate the existing replay hazards
> (which was one of the Tom's arguments for biting the bullet).
>
> I'm also thinking that for wal_level=archive and large databases, this
> won't really eliminate the checkpoint as it will likely generate enough
> WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?
>
> That being said, our CREATE DATABASE docs currently say this
>
> Although it is possible to copy a database other than template1 by
> specifying its name as the template, this is not (yet) intended as
> a general-purpose "COPY DATABASE" facility. The principal
> limitation is that no other sessions can be connected to the
> template database while it is being copied. CREATE DATABASE will
> fail if any other connection exists when it starts; otherwise, new
> connections to the template database are locked out until CREATE
> DATABASE completes. See Section 21.3 for more information.
>
> I think that this limitation pretty much means no one should use CREATE
> DATABASE for cloning live databases in production environment (because
> of the locking).
>
> It also seems to me the "general-purpose COPY DATABASE" described in the
> docs is what we're describing in this thread.
>

Notwithstanding what the docs say, I have seen CREATE DATABASE used
plenty of times, and quite effectively, to clone databases. I don't
think making it do twice the IO in the general case is going to go down
well.

cheers

andrew


From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Andrew Dunstan <andrew(at)dunslane(dot)net>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 23:01:47
Message-ID: 20141027230147.GB2639@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
>
> On 10/27/2014 05:58 PM, Tomas Vondra wrote:
> >On 27.10.2014 17:24, Heikki Linnakangas wrote:
> >I'm also thinking that for wal_level=archive and large databases, this
> >won't really eliminate the checkpoint as it will likely generate enough
> >WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?
> >
> >That being said, our CREATE DATABASE docs currently say this
> >
> > Although it is possible to copy a database other than template1 by
> > specifying its name as the template, this is not (yet) intended as
> > a general-purpose "COPY DATABASE" facility. The principal
> > limitation is that no other sessions can be connected to the
> > template database while it is being copied. CREATE DATABASE will
> > fail if any other connection exists when it starts; otherwise, new
> > connections to the template database are locked out until CREATE
> > DATABASE completes. See Section 21.3 for more information.
> >
> >I think that this limitation pretty much means no one should use CREATE
> >DATABASE for cloning live databases in production environment (because
> >of the locking).
> >
> >It also seems to me the "general-purpose COPY DATABASE" described in the
> >docs is what we're describing in this thread.
> >
>
>
> Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
> of times, and quite effectively, to clone databases. I don't think making it
> do twice the IO in the general case is going to go down well.

I think they're actually more likely to be happy that we wouldn't need
do a immediate checkpoint anymore. The performance penalty from that
likely to be much more severe than the actual IO.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 23:06:20
Message-ID: 544ECFEC.2000306@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On 10/27/2014 07:01 PM, Andres Freund wrote:
> On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
>> On 10/27/2014 05:58 PM, Tomas Vondra wrote:
>>> On 27.10.2014 17:24, Heikki Linnakangas wrote:
>>> I'm also thinking that for wal_level=archive and large databases, this
>>> won't really eliminate the checkpoint as it will likely generate enough
>>> WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?
>>>
>>> That being said, our CREATE DATABASE docs currently say this
>>>
>>> Although it is possible to copy a database other than template1 by
>>> specifying its name as the template, this is not (yet) intended as
>>> a general-purpose "COPY DATABASE" facility. The principal
>>> limitation is that no other sessions can be connected to the
>>> template database while it is being copied. CREATE DATABASE will
>>> fail if any other connection exists when it starts; otherwise, new
>>> connections to the template database are locked out until CREATE
>>> DATABASE completes. See Section 21.3 for more information.
>>>
>>> I think that this limitation pretty much means no one should use CREATE
>>> DATABASE for cloning live databases in production environment (because
>>> of the locking).
>>>
>>> It also seems to me the "general-purpose COPY DATABASE" described in the
>>> docs is what we're describing in this thread.
>>>
>>
>> Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
>> of times, and quite effectively, to clone databases. I don't think making it
>> do twice the IO in the general case is going to go down well.
> I think they're actually more likely to be happy that we wouldn't need
> do a immediate checkpoint anymore. The performance penalty from that
> likely to be much more severe than the actual IO.
>

At the very least that needs to be benchmarked.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Andres Freund <andres(at)2ndquadrant(dot)com>
Cc: Andrew Dunstan <andrew(at)dunslane(dot)net>, Tomas Vondra <tv(at)fuzzy(dot)cz>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-27 23:14:40
Message-ID: 32576.1414451680@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Andres Freund <andres(at)2ndquadrant(dot)com> writes:
> On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
>> Notwithstanding what the docs say, I have seen CREATE DATABASE used plenty
>> of times, and quite effectively, to clone databases. I don't think making it
>> do twice the IO in the general case is going to go down well.

> I think they're actually more likely to be happy that we wouldn't need
> do a immediate checkpoint anymore. The performance penalty from that
> likely to be much more severe than the actual IO.

Note that currently, CREATE DATABASE requires fsync'ing each file written
into the new database. With the proposed new implementation, we'd write
out that data to the kernel *but not have to fsync it*. Instead, we'd
fsync just the WAL. At least on spinning rust, that could be a
considerable win, for exactly the same reasons that we don't fsync
anything but WAL for ordinary transaction commits (ie, way fewer seeks).
Maybe not by enough to counteract doubling the write volume, but I think
we'd need some benchmarks before concluding that it's completely horrid.

regards, tom lane


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-28 00:01:24
Message-ID: 544EDCD4.1030005@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 28.10.2014 00:06, Andrew Dunstan wrote:
>
> On 10/27/2014 07:01 PM, Andres Freund wrote:
>> On 2014-10-27 18:57:27 -0400, Andrew Dunstan wrote:
>>> On 10/27/2014 05:58 PM, Tomas Vondra wrote:
>>>> On 27.10.2014 17:24, Heikki Linnakangas wrote:
>>>> I'm also thinking that for wal_level=archive and large databases, this
>>>> won't really eliminate the checkpoint as it will likely generate enough
>>>> WAL to hit checkpoint_segments and trigger a checkpoint anyway. No?
>>>>
>>>> That being said, our CREATE DATABASE docs currently say this
>>>>
>>>> Although it is possible to copy a database other than template1 by
>>>> specifying its name as the template, this is not (yet) intended as
>>>> a general-purpose "COPY DATABASE" facility. The principal
>>>> limitation is that no other sessions can be connected to the
>>>> template database while it is being copied. CREATE DATABASE will
>>>> fail if any other connection exists when it starts; otherwise, new
>>>> connections to the template database are locked out until CREATE
>>>> DATABASE completes. See Section 21.3 for more information.
>>>>
>>>> I think that this limitation pretty much means no one should use CREATE
>>>> DATABASE for cloning live databases in production environment (because
>>>> of the locking).
>>>>
>>>> It also seems to me the "general-purpose COPY DATABASE" described in
>>>> the
>>>> docs is what we're describing in this thread.
>>>>
>>>
>>> Notwithstanding what the docs say, I have seen CREATE DATABASE used
>>> plenty
>>> of times, and quite effectively, to clone databases. I don't think
>>> making it
>>> do twice the IO in the general case is going to go down well.
>> I think they're actually more likely to be happy that we wouldn't need
>> do a immediate checkpoint anymore. The performance penalty from that
>> likely to be much more severe than the actual IO.
>>
>
>
> At the very least that needs to be benchmarked.

The question is what workload are we going to benchmark. It's unlikely
one of the approaches to be a clear winner in all cases, so we'll have
to decide which ones are more common / important, or somehow combine
both approaches (and thus not getting some of the WAL-only benefits).

I'm pretty sure we'll see about three main cases:

(1) read-mostly workloads

Current approach wins, because checkpoint is cheap and the
WAL-based approach results in 2x the I/O.

The difference is proportional to template database size. For
small databases it's negligible, for large databases it's more
significant.

(2) write-heavy workloads / small template database

WAL-based approach wins, because it does not require explicit
checkpoint and for small databases the I/O generated by WAL-logging
everything is lower than checkpoint (which is more random).

This is the case of our PostgreSQL clusters.

(3) write-heavy workloads / large template database

Current approach wins, for two reasons: (a) for large databases the
WAL-logging overhead may generate much more I/O than a checkpoint,
and (b) it may generate so many WAL segments it eventually triggers
a checkpoint anyway (even repeatedly).

The exact boundary between the cases really depends on multiple things:

(a) shared_buffers size (the larger the more expensive checkpoint)
(b) read-write activity (more writes => more expensive checkpoint)
(c) hardware (especially how well it handles random I/O)

Not sure how to decide which case is more important, and I agree that
there are people using CREATE DATABASE to clone databases - maybe not in
production, but e.g. for testing purposes (still, it'd be rather
unfortunate to make it considerably slower for them). Not sure how to
balance this :-/

So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
Heikki's idea from to abandon the full checkpoint and instead assume
that once the transaction commits, all the files were fsynced OK. Of
couse, this will do nothing about the replay hazards.

regards
Tomas


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-28 00:17:27
Message-ID: 1787.1414455447@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tomas Vondra <tv(at)fuzzy(dot)cz> writes:
> So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
> Heikki's idea from to abandon the full checkpoint and instead assume
> that once the transaction commits, all the files were fsynced OK. Of
> couse, this will do nothing about the replay hazards.

Well, I'm not insisting on any particular method of getting there, but
if we're going to touch this area at all then I think "fix the replay
hazards" should be a non-negotiable requirement. We'd never have accepted
such hazards if CREATE DATABASE were being introduced for the first time;
it's only like this because nobody felt like rewriting a Berkeley-era
kluge.

regards, tom lane


From: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-28 00:33:41
Message-ID: 1414456421505-5824522.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tomas Vondra wrote
> I mean, when we use database A as a template, why do we need to checkpoint
> B, C, D and F too? (Apologies if this is somehow obvious, I'm way out of
> my comfort zone in this part of the code.)

IIUC you have to checkpoint the whole cluster because it is not possible to
do checkpoint individual databases. There is only a single WAL stream and
while it could have source database markers I don't believe it does so there
is no way to have separate checkpoint locations recorded for different
databases.

Adding such seems to introduce a lot of book-keeping and both reload and
file size overhead for little practical gain.

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/proposal-CREATE-DATABASE-vs-partial-CHECKPOINT-tp5824343p5824522.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-28 00:56:51
Message-ID: 1414457811023-5824526.post@n5.nabble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane-2 wrote
> Tomas Vondra &lt;

> tv@

> &gt; writes:
>> So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
>> Heikki's idea from to abandon the full checkpoint and instead assume
>> that once the transaction commits, all the files were fsynced OK. Of
>> couse, this will do nothing about the replay hazards.
>
> Well, I'm not insisting on any particular method of getting there, but
> if we're going to touch this area at all then I think "fix the replay
> hazards" should be a non-negotiable requirement. We'd never have accepted
> such hazards if CREATE DATABASE were being introduced for the first time;
> it's only like this because nobody felt like rewriting a Berkeley-era
> kluge.

Maybe adding a ToDo:

"Fix replay hazards in CREATE DATABASE"

and listing them explicitly would be a good start.

Not sure if WAL or CREATE would be more appropriate but WAL seems like a
better fit.

To the topic at hand would "CREATE DATABASE name WITH LOGGED = true" work?
As with UNLOGGED tables giving the user the choice of WAL/fsync/checkpoint
behavior seems reasonable. As Thomas said there a couple of diametrically
opposed use-cases here and it seems like we've already implemented the more
difficult one.

Addressing the reply hazards in the existing implementation could be
considered separately. What I'm not sure is whether the logging version in
fact already addresses them and in the interest of safe and sane defaults
whether we should implement and make it default and have the user specify
logged = false to get the old behavior with warnings emitted as to the
replay hazards that could occur should the database crash during the create.

Is it possible to emit a warning before the command completes (or, rather,
begins)?

David J.

--
View this message in context: http://postgresql.1045698.n5.nabble.com/proposal-CREATE-DATABASE-vs-partial-CHECKPOINT-tp5824343p5824526.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>, <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-28 11:36:03
Message-ID: 544F7FA3.5090006@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/28/2014 02:56 AM, David G Johnston wrote:
> Tom Lane-2 wrote
>>> So maybe we shouldn't cling to the WAL-logging approach too much. Maybe
>>> Heikki's idea from to abandon the full checkpoint and instead assume
>>> that once the transaction commits, all the files were fsynced OK. Of
>>> couse, this will do nothing about the replay hazards.
>>
>> Well, I'm not insisting on any particular method of getting there, but
>> if we're going to touch this area at all then I think "fix the replay
>> hazards" should be a non-negotiable requirement. We'd never have accepted
>> such hazards if CREATE DATABASE were being introduced for the first time;
>> it's only like this because nobody felt like rewriting a Berkeley-era
>> kluge.
>
> Maybe adding a ToDo:
>
> "Fix replay hazards in CREATE DATABASE"
>
> and listing them explicitly would be a good start.
>
> Not sure if WAL or CREATE would be more appropriate but WAL seems like a
> better fit.

My opinion is that we should do the WAL-logging, unless
wal_level=minimal, in which case we just fsync everything before commit.
That would hurt people who are currently using CREATE DATABASE with a
large template database, with WAL archiving enabled, but for everyone
else it would be a win or not noticeable.

If that's not acceptable, perhaps we could assume that a database with
!datallowconn=false can be copied without WAL-logging, i.e. assume that
a database with !datallowconn won't be modified. Of course, that's a
shaky assumption because someone might just switch datallowconn back on
after the CREATE DATABASE. Maybe we could force a checkpoint when you do
that, instead..

> To the topic at hand would "CREATE DATABASE name WITH LOGGED = true" work?
> As with UNLOGGED tables giving the user the choice of WAL/fsync/checkpoint
> behavior seems reasonable. As Thomas said there a couple of diametrically
> opposed use-cases here and it seems like we've already implemented the more
> difficult one.

Yeah, that would be another way to provide an escape hatch if someone
absolutely needs to avoid the extra I/O. I wouldn't like to maintain two
different ways of doing the same thing, though. It would be a feature
that would be used rarely, so even if we get it right, subtle bugs could
easily creep into it over the years.

- Heikki


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-29 17:31:27
Message-ID: CA+TgmobAzgsoGcu1Kf+zzSnd0oS+xLicP+UtteMAFVS_0Wur9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Oct 27, 2014 at 8:01 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> (3) write-heavy workloads / large template database
>
> Current approach wins, for two reasons: (a) for large databases the
> WAL-logging overhead may generate much more I/O than a checkpoint,
> and (b) it may generate so many WAL segments it eventually triggers
> a checkpoint anyway (even repeatedly).

I would tend not to worry too much about this case. I'm skeptical
that there are a lot of people using large template databases. But if
there are, or if some particular one of those people hits this
problem, then they can raise checkpoint_segments to avoid it. The
reverse problem, which you are encountering, cannot be fixed by
adjusting settings.

(This reminds me, yet again, that it would be really nice to something
smarter than checkpoint_segments. If there is little WAL activity
between one checkpoint and the next, we should reduce the number of
segments we're keeping around to free up disk space and ensure that
we're recycling a file new enough that it's likely to still be in
cache. Recycling files long-since evicted from cache is poor. But
then we should also let the number of WAL files ratchet back up if the
system again becomes busy. Isn't this more or less what Heikki's
soft-WAL-limit patch did? Why did we reject that, again?)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Tomas Vondra <tv(at)fuzzy(dot)cz>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-29 18:36:07
Message-ID: 54513397.8060509@fuzzy.cz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 29.10.2014 18:31, Robert Haas wrote:
> On Mon, Oct 27, 2014 at 8:01 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> (3) write-heavy workloads / large template database
>>
>> Current approach wins, for two reasons: (a) for large databases the
>> WAL-logging overhead may generate much more I/O than a checkpoint,
>> and (b) it may generate so many WAL segments it eventually triggers
>> a checkpoint anyway (even repeatedly).
>
> I would tend not to worry too much about this case. I'm skeptical
> that there are a lot of people using large template databases. But
> if there are, or if some particular one of those people hits this
> problem, then they can raise checkpoint_segments to avoid it. The
> reverse problem, which you are encountering, cannot be fixed by
> adjusting settings.

That however solves "only" the checkpoint, not the double amount of I/O
due to writing both the files and WAL, no? But maybe that's OK.

Also, all this is concern only with 'wal_level != minimal', but ISTM 'on
wal_level=minimal it's fast' is a rather poor argument.

>
> (This reminds me, yet again, that it would be really nice to something
> smarter than checkpoint_segments. If there is little WAL activity
> between one checkpoint and the next, we should reduce the number of
> segments we're keeping around to free up disk space and ensure that
> we're recycling a file new enough that it's likely to still be in
> cache. Recycling files long-since evicted from cache is poor. But
> then we should also let the number of WAL files ratchet back up if the
> system again becomes busy. Isn't this more or less what Heikki's
> soft-WAL-limit patch did? Why did we reject that, again?)

What about simply reusing the files in a different way? Instead of
looping through the files in a round robin manner, couldn't we just use
the last recently used file, instead of going all the way back to the
first one? This won't free the disk space, but IMHO that's not a problem
because noone is going to use that space anyway (as it would be a risk
once all the segments will be used again).

Tomas


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-30 18:51:54
Message-ID: CA+TgmobUTo6h8wA5-A8jiXeDtXj8tEmGysFvJ98bSjSn5kvQvQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Oct 29, 2014 at 2:36 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>> I would tend not to worry too much about this case. I'm skeptical
>> that there are a lot of people using large template databases. But
>> if there are, or if some particular one of those people hits this
>> problem, then they can raise checkpoint_segments to avoid it. The
>> reverse problem, which you are encountering, cannot be fixed by
>> adjusting settings.
>
> That however solves "only" the checkpoint, not the double amount of I/O
> due to writing both the files and WAL, no? But maybe that's OK.

I mean, it's not unimaginable that it's going to hurt somebody, but
the current situation is pretty bad too. You don't have to be the
world's foremost PostgreSQL performance expert to know that extra
checkpoints are really bad for performance. Write volume is of course
also a problem, but I bet there are a lot more people using small
template databases (where the write volume isn't really an issue,
because as Heikki points out the checkpoint wastes half a segment
anyway, but the checkpoint may very well be a issue) than large ones
(where either could be an issue).

> Also, all this is concern only with 'wal_level != minimal', but ISTM 'on
> wal_level=minimal it's fast' is a rather poor argument.

I think the argument here is more that there's no such thing as a free
lunch. If someone wants to come up with a way to make this work
without WAL-logging every block -AND- without triggering a checkpoint,
great. If that works well, perhaps we can apply it to other cases
like ALTER TABLE .. SET TABLESPACE which are currently quite painful
and which seem to me to have more or less the same problem. But I
don't really know why we should expect that such a solution exists at
all or is easy to engineer correctly. All we're doing here is
applying the same solution that's been found to be robust in other
situations to some old, crufty code that isn't.

>> (This reminds me, yet again, that it would be really nice to something
>> smarter than checkpoint_segments. If there is little WAL activity
>> between one checkpoint and the next, we should reduce the number of
>> segments we're keeping around to free up disk space and ensure that
>> we're recycling a file new enough that it's likely to still be in
>> cache. Recycling files long-since evicted from cache is poor. But
>> then we should also let the number of WAL files ratchet back up if the
>> system again becomes busy. Isn't this more or less what Heikki's
>> soft-WAL-limit patch did? Why did we reject that, again?)
>
> What about simply reusing the files in a different way? Instead of
> looping through the files in a round robin manner, couldn't we just use
> the last recently used file, instead of going all the way back to the
> first one? This won't free the disk space, but IMHO that's not a problem
> because noone is going to use that space anyway (as it would be a risk
> once all the segments will be used again).

Hmm, interesting idea. I have no idea whether that would work out to
a win or not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-30 18:56:43
Message-ID: 20141030185643.GA28295@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-30 14:51:54 -0400, Robert Haas wrote:
> On Wed, Oct 29, 2014 at 2:36 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
> >> I would tend not to worry too much about this case. I'm skeptical
> >> that there are a lot of people using large template databases. But
> >> if there are, or if some particular one of those people hits this
> >> problem, then they can raise checkpoint_segments to avoid it. The
> >> reverse problem, which you are encountering, cannot be fixed by
> >> adjusting settings.
> >
> > That however solves "only" the checkpoint, not the double amount of I/O
> > due to writing both the files and WAL, no? But maybe that's OK.
>
> I mean, it's not unimaginable that it's going to hurt somebody, but
> the current situation is pretty bad too. You don't have to be the
> world's foremost PostgreSQL performance expert to know that extra
> checkpoints are really bad for performance. Write volume is of course
> also a problem, but I bet there are a lot more people using small
> template databases (where the write volume isn't really an issue,
> because as Heikki points out the checkpoint wastes half a segment
> anyway, but the checkpoint may very well be a issue) than large ones
> (where either could be an issue).

Agreed. The current behaviour is a pretty ugly that just failed to fail
recently. I actually think we should *always* use the new code and not
add a separate wal_level=minimal branch. Maintaining this twice just
isn't worth the effort. minimal is used *far* less these days.

Greetings,

Andres Freund


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-30 19:10:43
Message-ID: 54528D33.1000702@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/30/2014 08:51 PM, Robert Haas wrote:
> On Wed, Oct 29, 2014 at 2:36 PM, Tomas Vondra <tv(at)fuzzy(dot)cz> wrote:
>>> I would tend not to worry too much about this case. I'm skeptical
>>> that there are a lot of people using large template databases. But
>>> if there are, or if some particular one of those people hits this
>>> problem, then they can raise checkpoint_segments to avoid it. The
>>> reverse problem, which you are encountering, cannot be fixed by
>>> adjusting settings.
>>
>> That however solves "only" the checkpoint, not the double amount of I/O
>> due to writing both the files and WAL, no? But maybe that's OK.
>
> I mean, it's not unimaginable that it's going to hurt somebody, but
> the current situation is pretty bad too. You don't have to be the
> world's foremost PostgreSQL performance expert to know that extra
> checkpoints are really bad for performance. Write volume is of course
> also a problem, but I bet there are a lot more people using small
> template databases (where the write volume isn't really an issue,
> because as Heikki points out the checkpoint wastes half a segment
> anyway, but the checkpoint may very well be a issue) than large ones
> (where either could be an issue).

Nitpick: I didn't say that a a checkpoint wastes half a segment. An xlog
switch does, but a checkpoint doesn't automatically cause an xlog switch.

But I agree with the sentiment in general.

- Heikki


From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-30 19:13:41
Message-ID: 54528DE5.90804@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/30/2014 08:56 PM, Andres Freund wrote:
> I actually think we should *always* use the new code and not
> add a separate wal_level=minimal branch. Maintaining this twice just
> isn't worth the effort. minimal is used *far* less these days.

I wouldn't go that far. Doing the wal_level=minimal optimization should
be pretty straightforward. Note that it would be implemented more like
CREATE INDEX et al with wal_level=minimal, not the way CREATE DATABASE
currently works. It would not involve any extra checkpoints.

- Heikki


From: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-30 23:06:02
Message-ID: 5452C45A.1060607@BlueTreble.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 10/30/14, 2:13 PM, Heikki Linnakangas wrote:
> On 10/30/2014 08:56 PM, Andres Freund wrote:
>> I actually think we should *always* use the new code and not
>> add a separate wal_level=minimal branch. Maintaining this twice just
>> isn't worth the effort. minimal is used *far* less these days.
>
> I wouldn't go that far. Doing the wal_level=minimal optimization should be pretty straightforward. Note that it would be implemented more like CREATE INDEX et al with wal_level=minimal, not the way CREATE DATABASE currently works. It would not involve any extra checkpoints.

+1

At my previous job, we used createdb -T copy_from_production new_dev_database, because that was far faster than re-loading the raw SQL dump all the time. It'd be a shame to have that need to write the copied data 2x. IIRC that database was around 20MB.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


From: Andres Freund <andres(at)anarazel(dot)de>
To: Jim Nasby <Jim(dot)Nasby(at)BlueTreble(dot)com>
Cc: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Tomas Vondra <tv(at)fuzzy(dot)cz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: proposal: CREATE DATABASE vs. (partial) CHECKPOINT
Date: 2014-10-30 23:34:52
Message-ID: 20141030233452.GM17724@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2014-10-30 18:06:02 -0500, Jim Nasby wrote:
> On 10/30/14, 2:13 PM, Heikki Linnakangas wrote:
> >On 10/30/2014 08:56 PM, Andres Freund wrote:
> >>I actually think we should *always* use the new code and not
> >>add a separate wal_level=minimal branch. Maintaining this twice just
> >>isn't worth the effort. minimal is used *far* less these days.
> >
> >I wouldn't go that far. Doing the wal_level=minimal optimization should be pretty straightforward. Note that it would be implemented more like CREATE INDEX et al with wal_level=minimal, not the way CREATE DATABASE currently works. It would not involve any extra checkpoints.

It's probably not that hard. I agree. Imo it's up to the person doing
this conversion. We imo shouldn't require that person to develop both
versions, but if they're interested in doing it: fine with me.

> At my previous job, we used createdb -T copy_from_production new_dev_database, because that was far faster than re-loading the raw SQL dump all the time. It'd be a shame to have that need to write the copied data 2x. IIRC that database was around 20MB.

At that size not doing two immediate checkpoints will still be an order
of magnitude or so bigger win.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services