Re: [PATCH] pg_upgrade: support for btrfs copy-on-write clones

From: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To: Oskari Saarenmaa <os(at)ohmu(dot)fi>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, Larry Rosenman <ler(at)lerctr(dot)org>, Josh Berkus <josh(at)agliodbs(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: [PATCH] pg_upgrade: support for btrfs copy-on-write clones
Date: 2013-11-15 08:40:20
Message-ID: 5285DDF4.9050402@vmware.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05.10.2013 16:57, Oskari Saarenmaa wrote:
> 05.10.2013 16:38, Bruce Momjian kirjoitti:
>> On Fri, Oct 4, 2013 at 10:42:46PM +0300, Oskari Saarenmaa wrote:
>>> Thanks for the offers, but it looks like ZFS doesn't actually implement
>>> a similar file level clone operation. See
>>> https://github.com/zfsonlinux/zfs/issues/405 for discussion on a feature
>>> request for it.
>>>
>>> ZFS does support cloning entire datasets which seem to be similar to
>>> btrfs subvolume snapshots and could be used to set up a new data
>>> directory for a new $PGDATA. This would require the original $PGDATA
>>> to be a dataset/subvolume of its own and quite a bit different logic
>>> (than just another file copy method in pg_upgrade) to initialize the new
>>> version's $PGDATA as a snapshot/clone of the original. The way this
>>> would work is that the original $PGDATA dataset/subvolume gets cloned to
>>> a new location after which we move the files out of the way of the new
>>> PG installation and run pg_upgrade in link mode. I'm not sure if
>>> there's a good way to integrate this into pg_upgrade or if it's just
>>> something that could be documented as a fast way to run pg_upgrade
>>> without touching original files.
>>>
>>> With btrfs tooling the sequence would be something like this:
>>>
>>> btrfs subvolume snapshot /srv/pg92 /srv/pg93
>>> mv /srv/pg93/data /srv/pg93/data92
>>> initdb /data/pg93/data
>>> pg_upgrade --link \
>>> --old-datadir=/data/pg93/data92 \
>>> --new-datadir=/data/pg93/data
>>
>> Does btrfs support file system snapshots? If so, shouldn't people just
>> take a snapshot of the old data directory before the ugprade, rather
>> than using cloning?
>
> Yeah, it's possible to clone an existing subvolume, but this requires
> that $PGDATA is a subvolume of its own and would be a bit difficult to
> integrate into existing pg_upgrade scripts.
>
> The BTRFS_IOC_CLONE ioctl operates on file level and can be used to
> clone files anywhere in a btrfs filesystem.

Hmm, you can also do

cp --reflog -r data92 data-tmp
pg_upgrade --link --old-datadir=data92-copy --new-datadir=data-tmp
rm -rf data-tmp

That BTRFS_IOC_CLONE ioctl seems so hacky that I'd rather not get that
in our source tree. cp --reflog is much more likely to get that magic
incantation right, since it gets a lot more attention and testing than
pg_upgrade.

I'm not in favor of adding filesystem-specific tricks into pg_upgrade.
It would be nice to list these tricks in the docs, though.

- Heikki

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-11-15 09:43:15 Re: init_sequence spill to hash table
Previous Message Alexander Korotkov 2013-11-15 07:19:09 Re: GIN improvements part2: fast scan