Re: Proposal: Incremental Backup

From: Claudio Freire <klaussfreire(at)gmail(dot)com>
To: Marco Nenciarini <marco(dot)nenciarini(at)2ndquadrant(dot)it>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Proposal: Incremental Backup
Date: 2014-07-29 16:35:46
Message-ID: CAGTBQpbhaPdKaWG_MHmzA3vawN1zfYFU7KMu-MUq8UR9EDTDTg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jul 29, 2014 at 1:24 PM, Marco Nenciarini
<marco(dot)nenciarini(at)2ndquadrant(dot)it> wrote:
>> On Fri, Jul 25, 2014 at 10:14 AM, Marco Nenciarini
>> <marco(dot)nenciarini(at)2ndquadrant(dot)it> wrote:
>>> 1. Proposal
>>> =================================
>>> Our proposal is to introduce the concept of a backup profile. The backup
>>> profile consists of a file with one line per file detailing tablespace,
>>> path, modification time, size and checksum.
>>> Using that file the BASE_BACKUP command can decide which file needs to
>>> be sent again and which is not changed. The algorithm should be very
>>> similar to rsync, but since our files are never bigger than 1 GB per
>>> file that is probably granular enough not to worry about copying parts
>>> of files, just whole files.
>>
>> That wouldn't nearly as useful as the LSN-based approach mentioned before.
>>
>> I've had my share of rsyncing live databases (when resizing
>> filesystems, not for backup, but the anecdotal evidence applies
>> anyhow) and with moderately write-heavy databases, even if you only
>> modify a tiny portion of the records, you end up modifying a huge
>> portion of the segments, because the free space choice is random.
>>
>> There have been patches going around to change the random nature of
>> that choice, but none are very likely to make a huge difference for
>> this application. In essence, file-level comparisons get you only a
>> mild speed-up, and are not worth the effort.
>>
>> I'd go for the hybrid file+lsn method, or nothing. The hybrid avoids
>> the I/O of inspecting the LSN of entire segments (necessary
>> optimization for huge multi-TB databases) and backups only the
>> portions modified when segments do contain changes, so it's the best
>> of both worlds. Any partial implementation would either require lots
>> of I/O (LSN only) or save very little (file only) unless it's an
>> almost read-only database.
>>
>
> From my experience, if a database is big enough and there is any kind of
> historical data in the database, the "file only" approach works well.
> Moreover it has the advantage of being simple and easily verifiable.

I don't see how that would be true if it's not full of read-only or
append-only tables.

Furthermore, even in that case, you need to have the database locked
while performing the file-level backup, and computing all the
checksums means processing the whole thing. That's a huge amount of
time to be locked for multi-TB databases, so how is that good enough?

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marco Nenciarini 2014-07-29 16:35:51 Re: Proposal: Incremental Backup
Previous Message Robert Haas 2014-07-29 16:35:35 Re: Re: [GENERAL] pg_dump behaves differently for different archive formats