Re: Orphaned files in base/[oid]

Lists: pgsql-hackers
From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Orphaned files in base/[oid]
Date: 2017-08-14 12:12:22
Message-ID: CAN-RpxDBA7HbTsJPq4t4VznmRFJkssP2SNEMuG=oNJ+=sxLQew@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi all;

I am trying to track down a problem we are seeing that looks very similar
to bug #12050, and would certainly consider trying to contribute a fix if
we agree on one. (I am not sure we can, so absent that, the next question
is whether it makes sense to create a utility to fix the problem when it
comes up so that a dump/restore is not needed).

The system:
PostgreSQL 9.6.3
Gentoo Linux.

Problem:
The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
with disk space. Querying pg_database_size, as well as du on the
subdirectory of base/ show total usage to be around 3.8TB. Summing up the
size of the relations in pg_class though shows around 2.1TB.

Initial troubleshooting found around 150 GB of space in pg_temp which had
never been cleared and was at least several days old. Restarting the
server cleared these up.

Poking around the base/[oid] directory, I found a large number of files
which did not correspond with a pg_class entry. One of the apparent
relations was nearly 1TB in size.

What I think happened:
I think various pg_temp/* and orphaned relation files (In base/[oid]) were
created when PostgreSQL crashed due to running out of space in various
operations including creating materialised views.

So my question is if there is a way we can safely clean these up on server
restart? If not does it make sense to try to create a utility that can
connect to PostgreSQL, seek out valid files, and delete the rest?

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-14 16:28:48
Message-ID: CAN-RpxBRXdwzzuUHgp8ORR8yy4_BSUZYVz65xQREA5=SUG15Tw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Aug 14, 2017 14:12, "Chris Travers" <chris(dot)travers(at)adjust(dot)com> wrote:

Hi all;

I am trying to track down a problem we are seeing that looks very similar
to bug #12050, and would certainly consider trying to contribute a fix if
we agree on one. (I am not sure we can, so absent that, the next question
is whether it makes sense to create a utility to fix the problem when it
comes up so that a dump/restore is not needed).

The system:
PostgreSQL 9.6.3
Gentoo Linux.

Problem:
The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
with disk space. Querying pg_database_size, as well as du on the
subdirectory of base/ show total usage to be around 3.8TB. Summing up the
size of the relations in pg_class though shows around 2.1TB.

Initial troubleshooting found around 150 GB of space in pg_temp which had
never been cleared and was at least several days old. Restarting the
server cleared these up.

Poking around the base/[oid] directory, I found a large number of files
which did not correspond with a pg_class entry. One of the apparent
relations was nearly 1TB in size.

What I think happened:
I think various pg_temp/* and orphaned relation files (In base/[oid]) were
created when PostgreSQL crashed due to running out of space in various
operations including creating materialised views.

So my question is if there is a way we can safely clean these up on server
restart? If not does it make sense to try to create a utility that can
connect to PostgreSQL, seek out valid files, and delete the rest?

Ok I have identified one case where symptoms I am seeing can be
reproduced. I am currently working on a Mac so there may be quirks in my
repro. However....

When the WAL writer runs out of disk space no cleanup is done.

So I will be looking at possible solutions next.

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 <+49%20162%209037210> | Skype: einhverfr |
www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


From: Andres Freund <andres(at)anarazel(dot)de>
To: Chris Travers <chris(dot)travers(at)adjust(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-14 16:33:48
Message-ID: 20170814163348.u4562wuiy34idfrx@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hi,

On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> Problem:
> The system this came up on is PostgreSQL 9.6.3 and has had repeated trouble
> with disk space. Querying pg_database_size, as well as du on the
> subdirectory of base/ show total usage to be around 3.8TB. Summing up the
> size of the relations in pg_class though shows around 2.1TB.
>
> Initial troubleshooting found around 150 GB of space in pg_temp which had
> never been cleared and was at least several days old. Restarting the
> server cleared these up.
>
> Poking around the base/[oid] directory, I found a large number of files
> which did not correspond with a pg_class entry. One of the apparent
> relations was nearly 1TB in size.
>
> What I think happened:
> I think various pg_temp/* and orphaned relation files (In base/[oid]) were
> created when PostgreSQL crashed due to running out of space in various
> operations including creating materialised views.
>
> So my question is if there is a way we can safely clean these up on server
> restart? If not does it make sense to try to create a utility that can
> connect to PostgreSQL, seek out valid files, and delete the rest?

I think the fix here is to call RemovePgTempFiles() during
crash-restarts, instead of just full starts. The previously stated need
to be able to inspect temp files after a crash can be less impactfully
fulfilled with restart_after_crash = false.

Greetings,

Andres Freund


From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-14 18:18:50
Message-ID: CAN-RpxAshr3Nmsht7Bva6cgBXJdFg6amXeBetZ9L4hVq=-y+vQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

> Hi,
>
> On 2017-08-14 14:12:22 +0200, Chris Travers wrote:
> > Problem:
> > The system this came up on is PostgreSQL 9.6.3 and has had repeated
> trouble
> > with disk space. Querying pg_database_size, as well as du on the
> > subdirectory of base/ show total usage to be around 3.8TB. Summing up
> the
> > size of the relations in pg_class though shows around 2.1TB.
> >
> > Initial troubleshooting found around 150 GB of space in pg_temp which had
> > never been cleared and was at least several days old. Restarting the
> > server cleared these up.
> >
> > Poking around the base/[oid] directory, I found a large number of files
> > which did not correspond with a pg_class entry. One of the apparent
> > relations was nearly 1TB in size.
> >
> > What I think happened:
> > I think various pg_temp/* and orphaned relation files (In base/[oid])
> were
> > created when PostgreSQL crashed due to running out of space in various
> > operations including creating materialised views.
> >
> > So my question is if there is a way we can safely clean these up on
> server
> > restart? If not does it make sense to try to create a utility that can
> > connect to PostgreSQL, seek out valid files, and delete the rest?
>
> I think the fix here is to call RemovePgTempFiles() during
> crash-restarts, instead of just full starts. The previously stated need
> to be able to inspect temp files after a crash can be less impactfully
> fulfilled with restart_after_crash = false.
>
> But that only clears temp files right?

I am less concerned about the temp files because a restart clears them.

The bigger issue I see are with the orphaned base files. It looks like
files in base/[oid] don't get cleaned up either if I read my output
correctly and it would explain why we saw 1.7TB of discrepancy between
relations and database size. Safety-wise it seems like the best way out of
that is a dump/restore but doing that with a 2.1TB database is annoying.

> Greetings,
>
> Andres Freund
>

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Chris Travers <chris(dot)travers(at)adjust(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-14 18:40:46
Message-ID: 29046.1502736046@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Chris Travers <chris(dot)travers(at)adjust(dot)com> writes:
> On Mon, Aug 14, 2017 at 6:33 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
>> I think the fix here is to call RemovePgTempFiles() during
>> crash-restarts, instead of just full starts. The previously stated need
>> to be able to inspect temp files after a crash can be less impactfully
>> fulfilled with restart_after_crash = false.

> But that only clears temp files right?
> I am less concerned about the temp files because a restart clears them.

It will clear temp files, and also temp tables.

> The bigger issue I see are with the orphaned base files.

It would be possible to have orphaned non-temp tables if you'd suffered
a crash during the transaction that created those tables. Ordinarily
a newly-created table file wouldn't be that large, but if your workflow
created tables and shoved boatloads of data into them in the same
transaction, it's not so hard to see this becoming an issue.

The core problem with zapping non-temp table files is that you can't
do that unless you're sure you have consistent, up-to-date pg_class
data that nobody else is busy adding to. It's hard to see an external
application being able to do that safely. You certainly can't do it
at the point in the postmaster startup cycle where we currently do
the other things --- for those, we rely only on filesystem naming
conventions to identify what to zap.

regards, tom lane


From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chris Travers <chris(dot)travers(at)adjust(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-14 18:56:32
Message-ID: 20170814185632.zodm5qykgss7ud32@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-08-14 14:40:46 -0400, Tom Lane wrote:
> The core problem with zapping non-temp table files is that you can't
> do that unless you're sure you have consistent, up-to-date pg_class
> data that nobody else is busy adding to. It's hard to see an external
> application being able to do that safely. You certainly can't do it
> at the point in the postmaster startup cycle where we currently do
> the other things --- for those, we rely only on filesystem naming
> conventions to identify what to zap.

I think there are some possibilities to close the gap here. We could
e.g. have <relfilenode>.delete_on_crash marker files that get installed
when creating a new persistent relfilenode. If we set up things so they
get deleted post commit, but inside the critical section, we could rely
on them being present in case of crash, but consistently removed during
WAL replay. At the end of recovery, iterate over the whole datadir and
nuke all relations with marker files present.

I first thought that'd cost an additional fsync per relation
created. But I think we actually can delay that to a pre-commit phase,
if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync
them just before checkpoint (via the usual delayed fsync mechanism).
That'd still require an XLogFlush(), but that seems hard to avoid unless
we just don't create relations on FS level until buffers are
evicted and/or BufferSync().

Alternatively we could do something without marker files, with some
added complexity: Keep track of all "uncommitted new files" in memory,
and log them every checkpoint. Commit/abort records clear elements of
that list. Since we always start replay at the beginning of a
checkpoint, we'd always reach a moment with such an up2date list of
pending-action files before reaching end-of-recovery. At end-of-recovery
we can delete all unconfirmed files. To avoid out-of-memory due to too
many tracked relations, we'd possibly still have to have marker files...

Regards,

Andres


From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-14 19:27:26
Message-ID: CAN-RpxCtEQjzdfopZiQvo2bE1=OYCReenFjXjXupXaaxz4Xk3g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 14, 2017 at 8:40 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

>
>
> It would be possible to have orphaned non-temp tables if you'd suffered
> a crash during the transaction that created those tables. Ordinarily
> a newly-created table file wouldn't be that large, but if your workflow
> created tables and shoved boatloads of data into them in the same
> transaction, it's not so hard to see this becoming an issue.
>

I think the working theory is that these were very like a number of very
large (multi-hundred-GB materialised views).

>
> The core problem with zapping non-temp table files is that you can't
> do that unless you're sure you have consistent, up-to-date pg_class
> data that nobody else is busy adding to. It's hard to see an external
> application being able to do that safely. You certainly can't do it
> at the point in the postmaster startup cycle where we currently do
> the other things --- for those, we rely only on filesystem naming
> conventions to identify what to zap.

Yeah that occurred to me. At this point I would settle for something I
could run with Postgres in single user mode. Although that is very far
from ideal. So what I wonder is if at least a short-term solution might be
a utility that starts Postgres in single user mode and we insist that
PostgreSQL is otherwise not running before the run.

I am certainly not feeling qualified at present for more advanced solutions
but that I might be able to do.

>
> regards, tom lane
>

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


From: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Chris Travers <chris(dot)travers(at)adjust(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-15 02:57:42
Message-ID: CAB7nPqT3__WF0BN5hZYKVch4HrvOeTzi+9B7c33kqZnGTbcDNw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 15, 2017 at 3:56 AM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think there are some possibilities to close the gap here. We could
> e.g. have <relfilenode>.delete_on_crash marker files that get installed
> when creating a new persistent relfilenode. If we set up things so they
> get deleted post commit, but inside the critical section, we could rely
> on them being present in case of crash, but consistently removed during
> WAL replay. At the end of recovery, iterate over the whole datadir and
> nuke all relations with marker files present.

I agree that an approach like that has value for the problem defined
in this thread.

> I first thought that'd cost an additional fsync per relation
> created. But I think we actually can delay that to a pre-commit phase,
> if we have XLOG_SMGR_CREATE create those markers via a flag, and fsync
> them just before checkpoint (via the usual delayed fsync mechanism).
> That'd still require an XLogFlush(), but that seems hard to avoid unless
> we just don't create relations on FS level until buffers are
> evicted and/or BufferSync().

Yeah, that should work as well.
--
Michael


From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-15 06:24:37
Message-ID: CAN-RpxBNY5ra19Z2YLQ=v3wA17Tmed9f9_=yaXBAua_CxsGkhg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

There's another side to this and that I am not sure it is a backend crash.

Here is what I did to reproduce:

2 virtual disk images: 100mb for main data, 40 MB for WAL. work_mem set
to 256MB. The idea is to test different out of space conditions.

Create table as ...; drop table; select
pg_size_pretty(pg_current_xlog_location() - '0/0');

I played around with parameters to determine how different kinds of out of
space errors were handled.

1. running out of temp space was cleaned up without a server restart
needed.
2. A relation running out of disk space *seemed* to get cleaned up.
3. Running out of WAL space left *both* temp and non-temp files.

I wonder about a different solution. Would it be possible to special case
vacuum to check for and remove (or just move to where they can be removed)
files when vacuuming pg_class? At the point we are vacuuming pg_class, we
ought to be able to know that a relfilenode shouldn't be used anymore,
right?

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Chris Travers <chris(dot)travers(at)adjust(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-15 13:32:30
Message-ID: 25537.1502803950@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Chris Travers <chris(dot)travers(at)adjust(dot)com> writes:
> I wonder about a different solution. Would it be possible to special case
> vacuum to check for and remove (or just move to where they can be removed)
> files when vacuuming pg_class? At the point we are vacuuming pg_class, we
> ought to be able to know that a relfilenode shouldn't be used anymore,
> right?

I don't think so. It's not clear to me whether you have in mind "scan
pg_class, collect relfilenodes from all live tuples, then zap all files
not in that set" or "when removing a dead tuple, zap the relfilenode
it mentions". But neither one works. The first case has a race condition
against new pg_class entries. As for the second, the existence of a dead
tuple bearing relfilenode N isn't evidence that some other live tuple
can't have relfilenode N.

Another problem for the second solution is that in the case you're worried
about (ie, PANIC due to out-of-WAL-space during relation's creating
transaction), there's no very good reason to expect that the relation's
pg_class tuple ever made it to disk at all.

A traditional low-tech answer to this has been to keep the WAL on a
separate volume from the main data store, so that it's protected from
out-of-space conditions in the main store and temp areas. The space
needs for WAL proper are generally much more predictable than the main
store, so it's easier to keep the dedicated space from overflowing.
(Stalled replication/archiving processes can be hazardous to your
health in this scenario, though, if they prevent prompt recycling of
WAL files.)

regards, tom lane


From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-15 14:03:41
Message-ID: CAN-RpxAAjKwCMb6iQ7uvOReZOXqTzPs3jT_3zo1xmWXBerWM7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Tue, Aug 15, 2017 at 3:32 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> Chris Travers <chris(dot)travers(at)adjust(dot)com> writes:
> > I wonder about a different solution. Would it be possible to special
> case
> > vacuum to check for and remove (or just move to where they can be
> removed)
> > files when vacuuming pg_class? At the point we are vacuuming pg_class,
> we
> > ought to be able to know that a relfilenode shouldn't be used anymore,
> > right?
>
> I don't think so. It's not clear to me whether you have in mind "scan
> pg_class, collect relfilenodes from all live tuples, then zap all files
> not in that set" or "when removing a dead tuple, zap the relfilenode
> it mentions". But neither one works. The first case has a race condition
> against new pg_class entries. As for the second, the existence of a dead
> tuple bearing relfilenode N isn't evidence that some other live tuple
> can't have relfilenode N.
>

Ah because if the file never made it on to disk the number could be
re-used.

>
> Another problem for the second solution is that in the case you're worried
> about (ie, PANIC due to out-of-WAL-space during relation's creating
> transaction), there's no very good reason to expect that the relation's
> pg_class tuple ever made it to disk at all.
>
> A traditional low-tech answer to this has been to keep the WAL on a
> separate volume from the main data store, so that it's protected from
> out-of-space conditions in the main store and temp areas. The space
> needs for WAL proper are generally much more predictable than the main
> store, so it's easier to keep the dedicated space from overflowing.
> (Stalled replication/archiving processes can be hazardous to your
> health in this scenario, though, if they prevent prompt recycling of
> WAL files.)
>

Yeah, most of our dbs here have wal on a separate volume but not this
system. This system is also unusual in that disk usage varies wildly (and
I am not 100% sure that this is the only case which causes it though I can
reproduce it consistently in the case of the wal writer running out of disk
space with symptoms exactly what I found).

So for now that leaves my fallback approach as a way to fix it when I see
it.

I have written a shell script which does as follows:
1. starts Postgres in single user mode with a data directory or dies
(won't run if Postgres seems to be already running)
2. gets the old of the current database
3. lists all files consisting of only digits in the base/[dboid] directory
4. asks Postgres (In single user mode again) for all relfilenodes and oids
of tables (In my testing both were required because there were some cases
where relfilenodes were not set in some system
5. Loops through the file nodes gathered, checks against the relfilenode
entries, and zaps $f, $f_*, and $f.*. Currently for testing "zaps" has
been to move to a lostnfound folder for inspection following the script.
The logic here is not perfect and is very slightly under inclusive, but
better that than the other way.

Then we can start Postgres again. I cannot find a better way to avoid race
conditions, I guess. At any rate it sounds like preventing the problem more
generally may be something beyond what I would feel comfortable trying to
do as a patch at my current level of familiarity with he source code.

The full script is included inline below my signature in case it is of
interest to anyone on the list.

> regards, tom lane
>

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com

Saarbrücker Straße 37a, 10405 Berlin

---

#!/bin/bash

datadir=$1
database=$2

pg_ctl -D $datadir stop

dboidfile="$PWD/cleanupdb.oid"
reloidfile="$PWD/refilenodes.list"

echo "COPY (select oid from pg_database where datname = current_database())
TO '$dboidfile'" | postgres --single -D $datadir $database > /dev/null

if (($?))
then
echo "FATAL: Could not start Postgres in single user mode"
exit 1
fi

dboid=`cat $dboidfile`
filenodes=`(cd test/base/$dboid; ls [0-9]*[0-9] | grep -v '\.' | sort -n)`
#echo $filenodes

echo "COPY (select relfilenode from pg_class union select oid as
relfilenode from pg_class) TO '$reloidfile'" | postgres --single -D
$datadir $database > /dev/null
relfilenodes=`cat $reloidfile`
#echo $relfilenodes
if [[ -z relfilenodes ]]
then
echo "FATAL: did not get any relfilenodes"
exit 2
fi

mkdir lostnfound;
for f in $filenodes
do
if [[ -z `echo $relfilenodes | grep -w $f` ]]
then
echo moving $f to lostnfound
mv $datadir/base/$dboid/$f lostnfound
mv $datadir/base/$dboid/${f}_* lostnfound 2> /dev/null
mv $datadir/base/$dboid/${f}.* lostnfound 2> /dev/null
fi
done
rm $dboidfile
rm $reloidfile


From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-16 12:20:02
Message-ID: CAN-RpxDPCVRUcams6i1ALTDQdy31578KeAECMKBbfEDeiAOCFw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

So having throught about this a bit more, and having had some real-world
experience with the script now, I have an idea that might work and some
questions to make it succeed.

My thinking is to add a new form of vacuum called VACUUM FSCK.

This would:
1. lock pg_class in exclusive mode (or do I need exclusive access?), as
this is needed to solve the race conditions. As I see, this seems to bring
the database to a screeching halt concurrency-wise (but unlike my script
would allow other databases to be accessed normally).
2. read the files where the name consists of only digits out of the
filesystem and compare with oids from pg_class and relfilenodes
3. Any file not found in that list would then unlink it, as well as any
files with the patter followed by an underscore or period.

This would mean that the following cases would not be handled:

If you have the first extent gone but later extents are present we check on
the first extant, and so would not see the later ones. Same goes for
visibility maps and other helper files.

If you add a file in the directory which has a name like 34F3A222BC, that
would never get cleaned up because it contains non-digits.

So this leads to the following questions:

1. Is locking pg_class enough to avoid race conditions? Is exclusive mode
sufficient or do I need exclusive access mode?
2. would it be preferable to move the file to a directory rather than
unlinking it?
3. Should I perform any sort of check on the tables at the end to make
sure everything is ok?

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Chris Travers <chris(dot)travers(at)adjust(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-16 14:59:30
Message-ID: CA+Tgmoadvr3tQOG-EttHDaLpYPY10YNF8knAzTxYYTuPiKt7Qw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Aug 14, 2017 at 2:56 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> I think there are some possibilities to close the gap here. We could
> e.g. have <relfilenode>.delete_on_crash marker files that get installed
> when creating a new persistent relfilenode. If we set up things so they
> get deleted post commit, but inside the critical section, we could rely
> on them being present in case of crash, but consistently removed during
> WAL replay. At the end of recovery, iterate over the whole datadir and
> nuke all relations with marker files present.

At the risk of being predictable, I think we should add an undo
subsystem instead of continuing to create ad-hoc solutions to problems
like this. (Of course, that's being worked on by Thomas, Amit, and
others.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


From: Andres Freund <andres(at)anarazel(dot)de>
To: Chris Travers <chris(dot)travers(at)adjust(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-16 17:15:51
Message-ID: 20170816171551.moi5hzgomtmvql6q@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 2017-08-16 14:20:02 +0200, Chris Travers wrote:
> So having throught about this a bit more, and having had some real-world
> experience with the script now, I have an idea that might work and some
> questions to make it succeed.
>
> My thinking is to add a new form of vacuum called VACUUM FSCK.
>
> This would:
> 1. lock pg_class in exclusive mode (or do I need exclusive access?), as
> this is needed to solve the race conditions. As I see, this seems to bring
> the database to a screeching halt concurrency-wise (but unlike my script
> would allow other databases to be accessed normally).
> 2. read the files where the name consists of only digits out of the
> filesystem and compare with oids from pg_class and relfilenodes
> 3. Any file not found in that list would then unlink it, as well as any
> files with the patter followed by an underscore or period.
>
> This would mean that the following cases would not be handled:
>
> If you have the first extent gone but later extents are present we check on
> the first extant, and so would not see the later ones. Same goes for
> visibility maps and other helper files.
>
> If you add a file in the directory which has a name like 34F3A222BC, that
> would never get cleaned up because it contains non-digits.
>
> So this leads to the following questions:
>
> 1. Is locking pg_class enough to avoid race conditions? Is exclusive mode
> sufficient or do I need exclusive access mode?
> 2. would it be preferable to move the file to a directory rather than
> unlinking it?
> 3. Should I perform any sort of check on the tables at the end to make
> sure everything is ok?

I think this entirely is the wrong approach. We shouldn't add weird
check commands that require locks on pg_class, we should avoid leaving
the orphaned files in the first place. I've upthread outlined
approached how to do so.

Greetings,

Andres Freund


From: Chris Travers <chris(dot)travers(at)adjust(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2017-08-17 05:09:24
Message-ID: CAN-RpxA1NwmfQ_MboT=v=KuSBvc=1xwdp76trr4VxvGH48bizA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Wed, Aug 16, 2017 at 7:15 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:

>
>
> I think this entirely is the wrong approach. We shouldn't add weird
> check commands that require locks on pg_class, we should avoid leaving
> the orphaned files in the first place. I've upthread outlined
> approached how to do so.
>

And in the mean time I suppose there is no reason that the approach I have
outlined cannot be used by an external program. I think it is sane to
draw the line at having the db responsible for cleaning up after itself
when something goes wrong and leave external programs to do after-the-fact
review and cleanup.

>
> Greetings,
>
> Andres Freund
>

--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com
Saarbrücker Straße 37a, 10405 Berlin


From: Alexey Gordeev <goa(at)arenadata(dot)io>
To: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chris Travers <chris(dot)travers(at)adjust(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Orphaned files in base/[oid]
Date: 2023-05-12 09:45:42
Message-ID: 214401683884627@mail.yandex.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

<div><div><br /><div> </div><div>12.05.2023, 14:17, "Andres Freund" &lt;andres(at)anarazel(dot)de&gt;:</div><blockquote><p>Alternatively we could do something without marker files, with some<br />added complexity: Keep track of all "uncommitted new files" in memory,<br />and log them every checkpoint. Commit/abort records clear elements of<br />that list. Since we always start replay at the beginning of a<br />checkpoint, we'd always reach a moment with such an up2date list of<br />pending-action files before reaching end-of-recovery. At end-of-recovery<br />we can delete all unconfirmed files. To avoid out-of-memory due to too<br />many tracked relations, we'd possibly still have to have marker files...</p></blockquote><div><div><div><div><div><div><div>Hi, hackers.</div><div> </div><div>I'm sorry, but I want to bump this thread, because there is still no good solution to solve the problem. I see there are few threads with undo-based approaches, which looks preferable, but have some pitfalls. Is there any chance to return to non-undo approaches partially discussed here? What do you think about the following solutions?</div><div>1) Make `pendingDeletes` shared and let postmaster clean all garbage in case of child process dying. Cons: Not works in case of postmaster dying. Should care about `pendingDeletes` pointers validity.</div><div>2) Catch and store all records with relfilenode during WAL replay, delete all orphaned nodes at the end of replaying. Cons: The final delete may use an incomplete list of nodes, as there was something before the latest checkpoint. The general opacity - we remove something without a corresponded WAL record (or possibly do it in wrong place in general).</div><div>3) This way is close to one I quoted and a combination of two above. `pendingDeletes` is shared. Each checkpoint creates a WAL record with a list of open transactions and created nodes. WAL replaying can use this list as base, adding nodes to it from newer records. The final delete operation has a complete list of orphaned nodes. Cons: Complexity(?). Others(?). </div><div> </div><div>Can it work? Are any of this approaches still relevant?</div></div></div></div></div></div></div></div></div><div><span style="background-color:#ffffff;color:#0c2037;float:none;font-family:'epilogue';font-size:16px;font-style:normal;font-weight:400;text-align:left;text-decoration-color:initial;text-decoration-style:initial;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">-- </span></div><div><span style="background-color:#ffffff;color:#0c2037;float:none;font-family:'epilogue';font-size:16px;font-style:normal;font-weight:400;text-align:left;text-decoration-color:initial;text-decoration-style:initial;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">Regards,</span></div><div><span style="background-color:#ffffff;color:#0c2037;float:none;font-family:'epilogue';font-size:16px;font-style:normal;font-weight:400;text-align:left;text-decoration-color:initial;text-decoration-style:initial;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px">Alex Go, C developer<br /><a href="mailto:goa(at)arenadata(dot)io" rel="noopener noreferrer">goa(at)arenadata(dot)io</a>, www.arenadata.tech</span></div><div> </div>

Attachment Content-Type Size
unknown_filename text/html 3.2 KB

From: Alexey Gordeev <goa(at)arenadata(dot)io>
To: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Chris Travers <chris(dot)travers(at)adjust(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Orphaned files in base/[oid]
Date: 2023-05-12 10:06:07
Message-ID: b34ff9a3-55f3-6875-4bf7-45fdff896949@arenadata.io
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 14.08.2017 23:56, Andres Freund wrote:
> Alternatively we could do something without marker files, with some
> added complexity: Keep track of all "uncommitted new files" in memory,
> and log them every checkpoint. Commit/abort records clear elements of
> that list. Since we always start replay at the beginning of a
> checkpoint, we'd always reach a moment with such an up2date list of
> pending-action files before reaching end-of-recovery. At end-of-recovery
> we can delete all unconfirmed files. To avoid out-of-memory due to too
> many tracked relations, we'd possibly still have to have marker files...
>

Hi, hackers.

I'm sorry, but I want to bump this thread, because there is still no
good solution to solve the problem. I see there are few threads with
undo-based approaches, which looks preferable, but have some pitfalls.
Is there any chance to return to non-undo approaches partially discussed
here? What do you think about the following solutions?
1) Make `pendingDeletes` shared and let postmaster clean all garbage in
case of child process dying. Cons: Not works in case of postmaster
dying. Should care about `pendingDeletes` pointers validity.
2) Catch and store all records with relfilenode during WAL replay,
delete all orphaned nodes at the end of replaying. Cons: The final
delete may use an incomplete list of nodes, as there was something
before the latest checkpoint. The general opacity - we remove something
without a corresponded WAL record (or possibly do it in wrong place in
general).
3) This way is close to one I quoted and a combination of two above.
`pendingDeletes` is shared. Each checkpoint creates a WAL record with a
list of open transactions and created nodes. WAL replaying can use this
list as base, adding nodes to it from newer records. The final delete
operation has a complete list of orphaned nodes. Cons: Complexity(?).
Others(?).

Can it work? Are any of this approaches still relevant?

I'm sorry for the last html-formatted message. Our web-based app is too
smart.

--
Regards,
Alex Go, C developer
goa(at)arenadata(dot)io, www.arenadata.tech