Quick Links

New Linux xfs/reiser file systems

Lists:	pgsql-hackers

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Subject:	New Linux xfs/reiser file systems
Date:	2001-05-02 17:35:37
Message-ID:	200105021735.f42HZbY16618@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I was talking to a Linux user yesterday, and he said that performance
using the xfs file system is pretty bad. He believes it has to do with
the fact that fsync() on log-based file systems requires more writes.

With a standard BSD/ext2 file system, WAL writes can stay on the same
cylinder to perform fsync. Is that true of log-based file systems?

I know xfs and reiser are both log based. Do we need to be concerned
about PostgreSQL performance on these file systems? I use BSD FFS with
soft updates here, so it doesn't affect me.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
+ If your life is a hard drive, | 830 Blythe Avenue
+ Christ can be your backup. | Drexel Hill, Pennsylvania 19026

From:	Alfred Perlstein <bright(at)wintelcom(dot)net>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-02 21:28:07
Message-ID:	20010502142807.T18676@fw.wintelcom.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> [010502 14:01] wrote:
> I was talking to a Linux user yesterday, and he said that performance
> using the xfs file system is pretty bad. He believes it has to do with
> the fact that fsync() on log-based file systems requires more writes.
>
> With a standard BSD/ext2 file system, WAL writes can stay on the same
> cylinder to perform fsync. Is that true of log-based file systems?
>
> I know xfs and reiser are both log based. Do we need to be concerned
> about PostgreSQL performance on these file systems? I use BSD FFS with
> soft updates here, so it doesn't affect me.

The "problem" with log based filesystems is that they most likely
do not know the consequences of a write so an fsync on a file may
require double writing to both the log and the "real" portion of
the disk. They can also exhibit the problem that an fsync may
cause all pending writes to require scheduling unless the log is
constructed on the fly rather than incrementally.

There was also the problem that was brought up recently that
certain versions (maybe all?) of Linux perform fsync() in a very
non-optimal manner, if the user is able to use the O_FSYNC option
rather than fsync he may see a performance increase.

But his guess is probably nearly as good as mine. :)

--
-Alfred Perlstein - [alfred(at)freebsd(dot)org]
http://www.egr.unlv.edu/~slumos/on-netbsd.html

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Alfred Perlstein <bright(at)wintelcom(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-02 21:36:45
Message-ID:	200105022136.f42Lajl01886@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> The "problem" with log based filesystems is that they most likely
> do not know the consequences of a write so an fsync on a file may
> require double writing to both the log and the "real" portion of
> the disk. They can also exhibit the problem that an fsync may
> cause all pending writes to require scheduling unless the log is
> constructed on the fly rather than incrementally.

Yes, this double-writing is a problem. Suppose you have your WAL on a
separate drive. You can fsync() WAL with zero head movement. With a
log based file system, you need two head movements, so you have gone
from zero movements to two.

From:	Alfred Perlstein <bright(at)wintelcom(dot)net>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-02 23:06:02
Message-ID:	20010502160602.X18676@fw.wintelcom.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> [010502 15:20] wrote:
> > The "problem" with log based filesystems is that they most likely
> > do not know the consequences of a write so an fsync on a file may
> > require double writing to both the log and the "real" portion of
> > the disk. They can also exhibit the problem that an fsync may
> > cause all pending writes to require scheduling unless the log is
> > constructed on the fly rather than incrementally.
>
> Yes, this double-writing is a problem. Suppose you have your WAL on a
> separate drive. You can fsync() WAL with zero head movement. With a
> log based file system, you need two head movements, so you have gone
> from zero movements to two.

It may be worse depending on how the filesystem actually does
journalling. I wonder if an fsync() may cause ALL pending
meta-data to be updated (even metadata not related to the
postgresql files).

Do you know if reiser or xfs have this problem?

--
-Alfred Perlstein - [alfred(at)freebsd(dot)org]
Daemon News Magazine in your snail-mail! http://magazine.daemonnews.org/

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Alfred Perlstein <bright(at)wintelcom(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-03 00:18:21
Message-ID:	200105030018.f430ILA02066@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > Yes, this double-writing is a problem. Suppose you have your WAL on a
> > separate drive. You can fsync() WAL with zero head movement. With a
> > log based file system, you need two head movements, so you have gone
> > from zero movements to two.
>
> It may be worse depending on how the filesystem actually does
> journalling. I wonder if an fsync() may cause ALL pending
> meta-data to be updated (even metadata not related to the
> postgresql files).
>
> Do you know if reiser or xfs have this problem?

I don't know, but the Linux user reported xfs was really slow.

From:	mlw <markw(at)mohawksoft(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-03 12:09:01
Message-ID:	3AF14A5D.3CC8C031@mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
>
> I was talking to a Linux user yesterday, and he said that performance
> using the xfs file system is pretty bad. He believes it has to do with
> the fact that fsync() on log-based file systems requires more writes.
>
> With a standard BSD/ext2 file system, WAL writes can stay on the same
> cylinder to perform fsync. Is that true of log-based file systems?
>
> I know xfs and reiser are both log based. Do we need to be concerned
> about PostgreSQL performance on these file systems? I use BSD FFS with
> soft updates here, so it doesn't affect me.

I did see poor performance on reiserfs, I have not as yet ventured into using
xfs.

I occurs to me that journalizing file systems will almost always be slower on
an application such as postgres. The journalizing file system is trying to
maintain data integrity for an application which is also trying to maintain
data integrity. There will always be extra work involved.

This behavior raises the question about file system usage in Postgres. Many
databases, such as Oracle, create table space files and operate directly on the
raw blocks, bypassing the file system altogether.

On one hand, Postgres is easy to use and maintain because it cooperates with
the native file system, on the other hand it incurs the overhead of whatever
silliness the file system wants to do.

I would bet it is a huge amount of work to use a "table space" system and no
one wants that. lol. However, it should be noted that a bit more control over
database layout would make some great performance improvements.

The ability to put indexes on a separate volume from data.
The ability to put different tables on different volumes.
And so on.

In the short term, I think poor performance on a journalizing file system is to
be expected, unless there is an IOCTL to tell the FS to leave the files alone
(and postgres calls it). A Linux HOWTO which informs people that certain file
systems will have performance issues and why should handle the problem.

Perhaps we can convince the Linux community to create a "dbfs" which is a
stripped down simple no nonsense file system designed for applications like
databases?

--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

From:	Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org>
To:	mlw <markw(at)mohawksoft(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-03 12:23:02
Message-ID:	Pine.LNX.4.30.0105031316200.20478-100000@sphinx.mythic-beasts.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 3 May 2001, mlw wrote:

> I would bet it is a huge amount of work to use a "table space" system
> and no one wants that.

From some stracing of 7.1, the most common syscall issued by
postgres is an lseek() to the end of the file, presumably to
find its length, which seems to happen up to about a dozen
times per (pgbench) transaction.

Tablespaces would solve this (not that lseek is a particularly
expensive operation, of course).

> Perhaps we can convince the Linux community to create a "dbfs" which
> is a stripped down simple no nonsense file system designed for
> applications like databases?

Sync-metadata ext2 should be fine. Filesystems fsck pretty
quick when they contain only a few large files.

Otherwise, something like "smugfs" (now obsolete) might do.

Matthew.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org>
Cc:	Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-03 13:33:11
Message-ID:	23904.988896791@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org> writes:
> From some stracing of 7.1, the most common syscall issued by
> postgres is an lseek() to the end of the file, presumably to
> find its length, which seems to happen up to about a dozen
> times per (pgbench) transaction.

> Tablespaces would solve this (not that lseek is a particularly
> expensive operation, of course).

No, they wouldn't; or at least they'd just create a different problem.
The reason for the lseek is that the file length may have changed since
the current backend last checked it. To avoid lseek we'd need some
shared data structure that maintains the current length of every active
table, which would be a nuisance to maintain and probably a source of
contention delays.

(Of course, such a data structure would just be the tip of the iceberg
of what we'd have to maintain for ourselves if we couldn't depend on the
kernel to do it for us. Reimplementing a filesystem doesn't strike me
as a profitable use of our time.)

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	mlw <markw(at)mohawksoft(dot)com>
Cc:	Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-03 15:41:24
Message-ID:	200105031541.f43FfOi27094@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > I know xfs and reiser are both log based. Do we need to be concerned
> > about PostgreSQL performance on these file systems? I use BSD FFS with
> > soft updates here, so it doesn't affect me.
>
> I did see poor performance on reiserfs, I have not as yet ventured into using
> xfs.
>
> I occurs to me that journalizing file systems will almost always be slower on
> an application such as postgres. The journalizing file system is trying to
> maintain data integrity for an application which is also trying to maintain
> data integrity. There will always be extra work involved.

Yes, the problem is that extra work is required on PostgreSQL's part.
Log-based file systems make sure all the changes get onto the disk in an
orderly way, but I believe it can delay what gets written to the drive.
PostgreSQL wants to be sure all the data is on the disk, period.
Unfortunately, the _orderly_ part makes the _fsync_ part do more work.
By going from ext2 to a log-based file system, we are getting _farther_
from a raw device that if we just sayed with ext2.

ext2 has serious problems with corrupt file systems after a crash, so I
understand the need to move to another file system type. I have been
waitin for Linux to get a more modern file system. Unfortunately, the
new ones seem to be worse for PostgreSQL.

OK, we have considered this, but frankly, the new, modern file systems
like FFS/softupdates have i/o rates near raw speed, with all the
advantages a file system gives us. I believe most commercial dbs are
moving away from raw devices and toward file systems. In the old days
the SysV file system was pretty bad at i/o & fragmentation, so they used
raw devices.

> The ability to put indexes on a separate volume from data.
> The ability to put different tables on different volumes.
> And so on.

We certainly need that, but raw devices would not make this any easier,
I think.

> In the short term, I think poor performance on a journalizing file system is to
> be expected, unless there is an IOCTL to tell the FS to leave the files alone
> (and postgres calls it). A Linux HOWTO which informs people that certain file
> systems will have performance issues and why should handle the problem.
>
> Perhaps we can convince the Linux community to create a "dbfs" which is a
> stripped down simple no nonsense file system designed for applications like
> databases?

It could become a serious problem as people start using reiser/xfs for
their file systems and don't understand the performance problems. Even
more likely is that they will turn off fsync, thinking reiser doesn't
need it, when in fact, I think it does.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-03 15:42:18
Message-ID:	200105031542.f43FgIX27124@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Matthew Kirkwood <matthew(at)hairy(dot)beasts(dot)org> writes:
> > From some stracing of 7.1, the most common syscall issued by
> > postgres is an lseek() to the end of the file, presumably to
> > find its length, which seems to happen up to about a dozen
> > times per (pgbench) transaction.
>
> > Tablespaces would solve this (not that lseek is a particularly
> > expensive operation, of course).
>
> No, they wouldn't; or at least they'd just create a different problem.
> The reason for the lseek is that the file length may have changed since
> the current backend last checked it. To avoid lseek we'd need some
> shared data structure that maintains the current length of every active
> table, which would be a nuisance to maintain and probably a source of
> contention delays.

Seems we should cache the file lengths somehow. Not sure how to do it
because our file system cache is local to each backend.

> (Of course, such a data structure would just be the tip of the iceberg
> of what we'd have to maintain for ourselves if we couldn't depend on the
> kernel to do it for us. Reimplementing a filesystem doesn't strike me
> as a profitable use of our time.)

Ditto. The database is complicated enough.

From:	bpalmer <bpalmer(at)crimelabs(dot)net>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-03 18:20:50
Message-ID:	Pine.BSO.4.30.0105031415360.27393-100000@mizer.crimelabs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > This behavior raises the question about file system usage in Postgres. Many
> > databases, such as Oracle, create table space files and operate directly on the
> > raw blocks, bypassing the file system altogether.
>
> OK, we have considered this, but frankly, the new, modern file systems
> like FFS/softupdates have i/o rates near raw speed, with all the
> advantages a file system gives us. I believe most commercial dbs are
> moving away from raw devices and toward file systems. In the old days
> the SysV file system was pretty bad at i/o & fragmentation, so they used
> raw devices.

I'm starting to like the idea of raw FS for a few reasons:

1) Considering that postgresql now does WAL, the need for a logging FS
for the database doesn't seem as needed (is it needed at all?).

2) Given the fact that postgresql is trying to support many OSs,
depending on, for example, XFS on a linux system will cause many
problems. What about solaris? How about BSD? Etc.. Using raw db MAY be
easier than dealing with the problems that will arise from supporting
multiple filesystems.

That said, the ability to use the system's FS does have it's advantages
(backup, moving files, etc).

Just some thoughts..

- Brandon

b. palmer, bpalmer(at)crimelabs(dot)net
pgp: www.crimelabs.net/bpalmer.pgp5

From:	Kaare Rasmussen <kar(at)webline(dot)dk>
To:	Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-03 19:07:17
Message-ID:	01050321071702.22526@bering
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > kernel to do it for us. Reimplementing a filesystem doesn't strike me
> > as a profitable use of our time.)
> Ditto. The database is complicated enough.

Maybe some kind of recommendation would be a good thing. That is, if the
PostgreSQL community has enough knowledge.

A section in the docs that discusses various file systems, so people can make
an intelligent choice.

--
Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582
Kaki Data tshirts, merchandize Fax: 3816 2501
Howitzvej 75 Åben 14.00-18.00 Web: www.suse.dk
2000 Frederiksberg Lørdag 11.00-17.00 Email: kar(at)webline(dot)dk

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	mlw <markw(at)mohawksoft(dot)com>
Cc:	Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-03 23:37:14
Message-ID:	Pine.LNX.4.21.0105040931540.24297-100000@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 3 May 2001, mlw wrote:

> This behavior raises the question about file system usage in Postgres. Many
> databases, such as Oracle, create table space files and operate directly on the
> raw blocks, bypassing the file system altogether.
>
> On one hand, Postgres is easy to use and maintain because it cooperates with
> the native file system, on the other hand it incurs the overhead of whatever
> silliness the file system wants to do.

It is not *that* hard to write a 'postgresfs' but you have to look at
the problems it creates. One of the biggest problems facing sys admins of
large sites is that the Oracle/DB2/etc DBA, having created the
purpose-build database filesystem, has not allowed enough room for
growth. Like I said, a basic file system is not difficult, but volume
management tools and the maintenance of the whole thing is. Currently,
postgres administrators are not faced with such a problem.

There is, of course, the argument that pgfs need not been enforced. The
problem is that many people would probably use it so as to have a
'superior' installation. This then entails the problems above, creating
more work for core developers.

Gavin

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	"mlw" <markw(at)mohawksoft(dot)com>, "Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>, "Hackers List" <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 01:08:56
Message-ID:	ECEHIKNFIMMECLEBJFIGAEHFCAAA.chriskl@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Just put a note in the installation docs that the place where the database
is initialised to should be on a non-Reiser, non-XFS mount...

Chris

-----Original Message-----
From: pgsql-hackers-owner(at)postgresql(dot)org
[mailto:pgsql-hackers-owner(at)postgresql(dot)org]On Behalf Of mlw
Sent: Thursday, 3 May 2001 8:09 PM
To: Bruce Momjian; Hackers List
Subject: [HACKERS] Re: New Linux xfs/reiser file systems

I did see poor performance on reiserfs, I have not as yet ventured into
using
xfs.

I occurs to me that journalizing file systems will almost always be slower
on
an application such as postgres. The journalizing file system is trying to
maintain data integrity for an application which is also trying to maintain
data integrity. There will always be extra work involved.

This behavior raises the question about file system usage in Postgres. Many
databases, such as Oracle, create table space files and operate directly on
the
raw blocks, bypassing the file system altogether.

On one hand, Postgres is easy to use and maintain because it cooperates with
the native file system, on the other hand it incurs the overhead of whatever
silliness the file system wants to do.

I would bet it is a huge amount of work to use a "table space" system and no
one wants that. lol. However, it should be noted that a bit more control
over
database layout would make some great performance improvements.

The ability to put indexes on a separate volume from data.
The ability to put different tables on different volumes.
And so on.

In the short term, I think poor performance on a journalizing file system is
to
be expected, unless there is an IOCTL to tell the FS to leave the files
alone
(and postgres calls it). A Linux HOWTO which informs people that certain
file
systems will have performance issues and why should handle the problem.

Perhaps we can convince the Linux community to create a "dbfs" which is a
stripped down simple no nonsense file system designed for applications like
databases?

--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)

From:	<john(at)mwk(dot)co(dot)nz>
To:	"Hackers List" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Reiser and XFS -- tell the maintainers
Date:	2001-05-04 01:39:35
Message-ID:	004b01c0d43b$140e14c0$1401a8c0@MWK.co.nz
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

There might be a problem, but if no one mentions it to the maintainers of
those
fs's, it will not get fixed...

Regards
John

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	mlw <markw(at)mohawksoft(dot)com>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 01:42:17
Message-ID:	200105040142.f441gH113060@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Just put a note in the installation docs that the place where the database
> is initialised to should be on a non-Reiser, non-XFS mount...

Sure, we can do that now. What do we do when these are the default file
systems for Linux? We can tell them to create other types of file
systems, but that is a pretty big hurdle. I wonder if it would be
easier to get reiser/xfs to make some modifications.

From:	"Christopher Kings-Lynne" <chriskl(at)familyhealth(dot)com(dot)au>
To:	"Bruce Momjian" <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	"mlw" <markw(at)mohawksoft(dot)com>, "Hackers List" <pgsql-hackers(at)postgresql(dot)org>
Subject:	RE: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 01:49:39
Message-ID:	ECEHIKNFIMMECLEBJFIGAEHGCAAA.chriskl@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Well, arguably if you're setting up a database server then a reasonable DBA
should think about such things...

(My 2c)

Chris

-----Original Message-----
From: Bruce Momjian [mailto:pgman(at)candle(dot)pha(dot)pa(dot)us]
Sent: Friday, 4 May 2001 9:42 AM
To: Christopher Kings-Lynne
Cc: mlw; Hackers List
Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems

> Just put a note in the installation docs that the place where the database
> is initialised to should be on a non-Reiser, non-XFS mount...

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	mlw <markw(at)mohawksoft(dot)com>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 01:55:43
Message-ID:	200105040155.f441thG13609@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Well, arguably if you're setting up a database server then a reasonable DBA
> should think about such things...

Yes, but people have trouble installing PostgreSQL. I can't imagine
walking them through a newfs.

>
> (My 2c)
>
> Chris
>
> -----Original Message-----
> From: Bruce Momjian [mailto:pgman(at)candle(dot)pha(dot)pa(dot)us]
> Sent: Friday, 4 May 2001 9:42 AM
> To: Christopher Kings-Lynne
> Cc: mlw; Hackers List
> Subject: Re: [HACKERS] Re: New Linux xfs/reiser file systems
>
>
> > Just put a note in the installation docs that the place where the database
> > is initialised to should be on a non-Reiser, non-XFS mount...
>
> Sure, we can do that now. What do we do when these are the default file
> systems for Linux? We can tell them to create other types of file
> systems, but that is a pretty big hurdle. I wonder if it would be
> easier to get reiser/xfs to make some modifications.
>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 853-3000
> + If your life is a hard drive, | 830 Blythe Avenue
> + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
>
>

From:	mlw <markw(at)mohawksoft(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 03:20:45
Message-ID:	3AF2200D.922E5723@mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
>
> > Just put a note in the installation docs that the place where the database
> > is initialised to should be on a non-Reiser, non-XFS mount...
>
> Sure, we can do that now. What do we do when these are the default file
> systems for Linux? We can tell them to create other types of file
> systems, but that is a pretty big hurdle. I wonder if it would be
> easier to get reiser/xfs to make some modifications.

I have looked at Reiser, and I don't think it is a file system suited for very
large files, or applications such as postgres. The Linux crowd should lobby
against any such trend. It is ok for many moderately small files. ReiserFS
would be great for a cddb server, but poor for a database box.

XFS is a real big file system project, I'd bet that there are file properties
or management tools to tell it to leave directories and files alone. They
should have addressed that years ago.

One last mention..

Having better control over WHERE various files in a database are located can
make it easier to deal with these things.

Just a thought. ;-)

--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

From:	Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>
To:	mlw <markw(at)mohawksoft(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 07:09:23
Message-ID:	3AF255A3.8080904@ics.olemiss.edu
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

mlw wrote:

>Bruce Momjian wrote:
>
>>>Just put a note in the installation docs that the place where the database
>>>is initialised to should be on a non-Reiser, non-XFS mount...
>>>
>>Sure, we can do that now. What do we do when these are the default file
>>systems for Linux? We can tell them to create other types of file
>>systems, but that is a pretty big hurdle. I wonder if it would be
>>easier to get reiser/xfs to make some modifications.
>>
>
>
>I have looked at Reiser, and I don't think it is a file system suited for very
>large files, or applications such as postgres. The Linux crowd should lobby
>against any such trend. It is ok for many moderately small files. ReiserFS
>would be great for a cddb server, but poor for a database box.
>
>XFS is a real big file system project, I'd bet that there are file properties
>or management tools to tell it to leave directories and files alone. They
>should have addressed that years ago.
>
>One last mention..
>
>Having better control over WHERE various files in a database are located can
>make it easier to deal with these things.
>
I think it's worth noting that Oracle has been petitioning the kernel
developers for better raw device support: in other words, the ability to
write directly to the hard disk and bypassing the filesystem all
together.

If the db is going to assume the responsibility of disk write
verification it seems reasonable to assume you might want to investigate
the raw disk i/o options.

Telling your installers that a major performance gain is attainable by
doing so might be a start in the opposite direction. I've monitored a
lot of discussions and from what I can gather, postgresql does it's own
set of journaling operations. I don't think that it's necessary for
writes to be double journalled anyway.

Again, just my two cents worth...

From:	Michael Samuel <michael(at)miknet(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 11:35:34
Message-ID:	20010504213534.A4596@miknet.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, May 03, 2001 at 11:41:24AM -0400, Bruce Momjian wrote:
> ext2 has serious problems with corrupt file systems after a crash, so I
> understand the need to move to another file system type. I have been
> waitin for Linux to get a more modern file system. Unfortunately, the
> new ones seem to be worse for PostgreSQL.

If you fsync() a directory in Linux, all the metadata within that directory
will be written out to disk.

As for filesystem corruption, I can say the e2fsck is among the best fsck
programs out there, and I've only ever had 1 occasion where I've lost any
data on an ext2 filesystem, and that was due to bad sectors causing me to
lose the root directory. (Well, apart from human errors, but that doesn't
count)

> OK, we have considered this, but frankly, the new, modern file systems
> like FFS/softupdates have i/o rates near raw speed, with all the
> advantages a file system gives us. I believe most commercial dbs are
> moving away from raw devices and toward file systems. In the old days
> the SysV file system was pretty bad at i/o & fragmentation, so they used
> raw devices.

And Solaris' 1/01 media has better support for O_DIRECT (?), which they claim
gives you 93% of the speed of a raw device. (Or something like that; I read
this in marketing material a couple of months ago)

Raw devices are designed to have filesystems on them. The only excuses for
userland tools accessing them, are fs-specific tools (eg. dump, fsck, etc),
or for non-unix filesystem tools, where the unix VFS doesn't handle things
properly (hfstools).

> > The ability to put indexes on a separate volume from data.
> > The ability to put different tables on different volumes.
> > And so on.
>
> We certainly need that, but raw devices would not make this any easier,
> I think.

It would be cool if either at compile time or at database creation time, we
could specify a printf-like format for placing tables, indexes, etc.

> It could become a serious problem as people start using reiser/xfs for
> their file systems and don't understand the performance problems. Even
> more likely is that they will turn off fsync, thinking reiser doesn't
> need it, when in fact, I think it does.

ReiserFS only supports metadata logging. The performance slowdown must be
due to logging things like mtime or atime, because otherwise ReiserFS is a
very high performance FS. (Although, I admittedly haven't used it since it
was early in it's development)

--
Michael Samuel <michael(at)miknet(dot)net>

From:	mlw <markw(at)mohawksoft(dot)com>
To:	Michael Samuel <michael(at)miknet(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 12:02:17
Message-ID:	3AF29A49.102B57E4@mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Samuel wrote:
>
> ReiserFS only supports metadata logging. The performance slowdown must be
> due to logging things like mtime or atime, because otherwise ReiserFS is a
> very high performance FS. (Although, I admittedly haven't used it since it
> was early in it's development)

The way I understand it is that ReiserFS does not attempt to separate files at
the block level. Multiple files can live in the same disk block. This is cool
if you have many small files, but the extra overhead for large files such as
those used by a database, is a bit much.

I read some stuff about a year ago, and my impressions forced me to conclude
that ReiserFS was geared toward applications. Which is a pretty good thing for
applications, but not for databases.

I really think a simple low down dirty file system is just what the doctor
ordered for postgres.

Remember, general purpose file systems must do for files what Postgres is
already doing for records. You will always have extra work. I am seriously
thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
or if there is just something fundamentally stupid about FAT32 that will make
it worse?

--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

From:	teg(at)redhat(dot)com (Trond Eivind =?iso-8859-1?q?Glomsr=F8d?=)
To:	mlw <markw(at)mohawksoft(dot)com>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 13:33:07
Message-ID:	xuyk83xxmr0.fsf@halden.devel.redhat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

mlw <markw(at)mohawksoft(dot)com> writes:

> I have looked at Reiser, and I don't think it is a file system suited for very
> large files, or applications such as postgres.

What's the problem with big files? ReiserFS v2 doesn't seem to support
it, while v3 seems just fine (of the ondisk format)

That said, I'm certainly looking forward to xfs - I believe it will be
the most widely used of the current batch of journaling file systems
(reiserfs, jfs, XFS and ext3, the latter mainly focusing on an easy
migration path for existing system)

--
Trond Eivind Glomsrød
Red Hat, Inc.

From:	Michael Samuel <michael(at)miknet(dot)net>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 13:50:22
Message-ID:	20010504235022.B4596@miknet.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote:
> The way I understand it is that ReiserFS does not attempt to separate files at
> the block level. Multiple files can live in the same disk block. This is cool
> if you have many small files, but the extra overhead for large files such as
> those used by a database, is a bit much.

It should be at least as fast as other filesystems for large files. I suspect
that it would be faster in fact. The only catch is that the performance of
reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems)

You can read about all this stuff at http://www.namesys.com/

> I really think a simple low down dirty file system is just what the doctor
> ordered for postgres.

Traditional BSD FFS or Solaris UFS is probably the best bet for postgres.

> Remember, general purpose file systems must do for files what Postgres is
> already doing for records. You will always have extra work. I am seriously
> thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
> or if there is just something fundamentally stupid about FAT32 that will make
> it worse?

Well, for a starters, file permissions...

Ext2 would kick arse over FAT32 for performance.

--
Michael Samuel <michael(at)miknet(dot)net>

From:	Roland Roberts <roland(at)astrofoto(dot)org>
To:	Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 14:24:53
Message-ID:	m21yq5b39m.fsf@tycho.rlent.pnet
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>>>> "Bruce" == Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

>> Well, arguably if you're setting up a database server then a
>> reasonable DBA should think about such things...

Bruce> Yes, but people have trouble installing PostgreSQL. I
Bruce> can't imagine walking them through a newfs.

In most of linux-land, the DBA is probably also the sysadmin. In
bigger shops, and those which currently run, say Oracle or Sybase, the
two roles are separate. When they are separate, you don't have to
walk the DBA through it; he just walks over to the sysadmin and says
"I need X megabytes of space on a new Y filesystem."

roland
--
PGP Key ID: 66 BC 3B CD
Roland B. Roberts, PhD RL Enterprises
roland(at)rlenter(dot)com 76-15 113th Street, Apt 3B
rbroberts(at)acm(dot)org Forest Hills, NY 11375

From:	teg(at)redhat(dot)com (Trond Eivind =?iso-8859-1?q?Glomsr=F8d?=)
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, sct(at)redhat(dot)com
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 15:04:30
Message-ID:	xuyhez1p341.fsf@halden.devel.redhat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I got some information from Stephen Tweedie on this - please keep him
"Cc:" as he's not on this list

************************************************************************
Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

> I was talking to a Linux user yesterday, and he said that performance
> using the xfs file system is pretty bad. He believes it has to do with
> the fact that fsync() on log-based file systems requires more writes.

Performance doing what? XFS has known performance problems doing
unlinks and truncates, but not synchronous IO. The user should be
using fdatasync() for databases, btw, not fsync().

First, XFS, ext3 and reiserfs are *NOT* log-based filesystems. They
are journaling filesystems. They have a log, but they are not
log-based because they do not store data permanently in a log
structure. Berkeley LFS, Sprite and Spiralog are log-based
filesystems.

> With a standard BSD/ext2 file system, WAL writes can stay on the same
> cylinder to perform fsync. Is that true of log-based file systems?

Not true on ext2 or BSD. Write-aheads are _usually_ close to the
inode, but not always. For true log-based filesystems, writes are
always completely sequential, so the issue just goes away. For
journaling filesystems, depending on the setup there may be a seek to
the journal involved, but some journaling filesystems can use a
separate disk for the journal so no seek is required.

> I know xfs and reiser are both log based. Do we need to be concerned
> about PostgreSQL performance on these file systems? I use BSD FFS with
> soft updates here, so it doesn't affect me.

A database normally preallocates its data files and then performs most
of its writes using update-in-place. In such cases, fsync() is almost
always the wrong thing to be doing --- the data writes have changed
nothing in the inode except for the timestamps, and there's no need to
flush the timestamps to disk for every write. fdatasync() is
designed for this --- if the only inode change is timestamps,
fdatasync() will skip the seek to the inode and will only update the
data. If any significant inode fields have been changed, then a full
flush is done.

Using fdatasync, most filesystems will incur no seeks for data flush,
regardless of whether the filesystem is journaling or not.

Cheers,
Stephen
************************************************************************

--
Trond Eivind Glomsrød
Red Hat, Inc.

From:	Kaare Rasmussen <kar(at)webline(dot)dk>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	mlw <markw(at)mohawksoft(dot)com>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 15:59:18
Message-ID:	01050417591803.23259@bering
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Sure, we can do that now. What do we do when these are the default file
> systems for Linux? We can tell them to create other types of file

What is a 'default file system' ? I know that untill now, everybody is using
ext2. But that's only because there hasn't been anything comparable. Now we
se ReiserFS, and my SuSE installation offers the choice. In the future, I
believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more.

> systems, but that is a pretty big hurdle. I wonder if it would be
> easier to get reiser/xfs to make some modifications.

No, I don't think it's a big hurdle. If you just want to play with
PostgreSQL, you wont care. If you're serious, you'll repartition.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Michael Samuel <michael(at)miknet(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 16:48:58
Message-ID:	200105041648.f44GmwB24795@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Fri, May 04, 2001 at 08:02:17AM -0400, mlw wrote:
> > The way I understand it is that ReiserFS does not attempt to separate files at
> > the block level. Multiple files can live in the same disk block. This is cool
> > if you have many small files, but the extra overhead for large files such as
> > those used by a database, is a bit much.
>
> It should be at least as fast as other filesystems for large files. I suspect
> that it would be faster in fact. The only catch is that the performance of
> reiserfs sucks when it gets past 85% or so full. (ext2 has similar problems)

That is pretty standard for most modern file systems. They need that
free space to optimize.

>
> You can read about all this stuff at http://www.namesys.com/
>
> > I really think a simple low down dirty file system is just what the doctor
> > ordered for postgres.
>
> Traditional BSD FFS or Solaris UFS is probably the best bet for postgres.

That is my opinion. BSD FFS seems to be general enough to give good
performance for a large scale of application needs. It is not as fast
as XFS for streaming large files (media), and it doesn't optimize small
files below the 1k size (fragments), and it does require fsck on reboot.

However, looking at all those for PostgreSQL, the costs of the new Linux
file systems seems pretty high, especially considering our need for
fsync().

What I am really concerned about is when xfs/reiser become the default
file systems for Linux, and people complain about PostgreSQL
performance. And if we require special file systems, we lose some of
our ability to easily grow. Because of ext2's problems with crash
recovery, who is going to want to put other data on that file system
when they have xfs/reiser available. And boots are going to have to
fsck that ext2 file system.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Kaare Rasmussen <kar(at)webline(dot)dk>
Cc:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>, mlw <markw(at)mohawksoft(dot)com>, Hackers List <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-04 16:50:39
Message-ID:	200105041650.f44Godm24867@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[ Charset ISO-8859-1 unsupported, converting... ]
> > Sure, we can do that now. What do we do when these are the default file
> > systems for Linux? We can tell them to create other types of file
>
> What is a 'default file system' ? I know that untill now, everybody is using
> ext2. But that's only because there hasn't been anything comparable. Now we
> se ReiserFS, and my SuSE installation offers the choice. In the future, I
> believe that people can choose from ext2, ReiserFS,xfs, ext3 and maybe more.

But some day the default will be a log-based file system, and people
will have to hunt around to create a non-log based one.

> > systems, but that is a pretty big hurdle. I wonder if it would be
> > easier to get reiser/xfs to make some modifications.
>
> No, I don't think it's a big hurdle. If you just want to play with
> PostgreSQL, you wont care. If you're serious, you'll repartition.

Yes, but we could get a reputation for slowness on these log-based file
systems.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Trond Eivind Glomsrød <teg(at)redhat(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, sct(at)redhat(dot)com
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 17:49:54
Message-ID:	200105041749.f44HnsJ29002@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

[ Charset ISO-8859-1 unsupported, converting... ]
> I got some information from Stephen Tweedie on this - please keep him
> "Cc:" as he's not on this list
>
> ************************************************************************
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>
> > I was talking to a Linux user yesterday, and he said that performance
> > using the xfs file system is pretty bad. He believes it has to do with
> > the fact that fsync() on log-based file systems requires more writes.
>
>
> Performance doing what? XFS has known performance problems doing
> unlinks and truncates, but not synchronous IO. The user should be
> using fdatasync() for databases, btw, not fsync().

This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by
default it is available on a platform.

> First, XFS, ext3 and reiserfs are *NOT* log-based filesystems. They
> are journaling filesystems. They have a log, but they are not
> log-based because they do not store data permanently in a log
> structure. Berkeley LFS, Sprite and Spiralog are log-based
> filesystems.

Sorry, I get those mixed up.

> > With a standard BSD/ext2 file system, WAL writes can stay on the same
> > cylinder to perform fsync. Is that true of log-based file systems?
>
> Not true on ext2 or BSD. Write-aheads are _usually_ close to the
> inode, but not always. For true log-based filesystems, writes are
> always completely sequential, so the issue just goes away. For
> journaling filesystems, depending on the setup there may be a seek to
> the journal involved, but some journaling filesystems can use a
> separate disk for the journal so no seek is required.
>
> > I know xfs and reiser are both log based. Do we need to be concerned
> > about PostgreSQL performance on these file systems? I use BSD FFS with
> > soft updates here, so it doesn't affect me.
>
> A database normally preallocates its data files and then performs most
> of its writes using update-in-place. In such cases, fsync() is almost
> always the wrong thing to be doing --- the data writes have changed
> nothing in the inode except for the timestamps, and there's no need to
> flush the timestamps to disk for every write. fdatasync() is
> designed for this --- if the only inode change is timestamps,
> fdatasync() will skip the seek to the inode and will only update the
> data. If any significant inode fields have been changed, then a full
> flush is done.

We do pre-allocate our log file space in chunks to avoid inode/block
index writes.

> Using fdatasync, most filesystems will incur no seeks for data flush,
> regardless of whether the filesystem is journaling or not.

Thanks. That is a big help. I wonder if people reporting performance
problems were using 7.0.3. We only added fdatasync() in 7.1.

From:	mlw <markw(at)mohawksoft(dot)com>
To:	Michael Samuel <michael(at)miknet(dot)net>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 17:54:26
Message-ID:	3AF2ECD2.7F7EA430@mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Michael Samuel wrote:

>
> > Remember, general purpose file systems must do for files what Postgres is
> > already doing for records. You will always have extra work. I am seriously
> > thinking of trying a FAT32 as pg_xlog. I wonder if it will improve performance,
> > or if there is just something fundamentally stupid about FAT32 that will make
> > it worse?
>
> Well, for a starters, file permissions...
>
> Ext2 would kick arse over FAT32 for performance.

OK, I'll bite.

In a database environment where file creation is not such an issue, why would ext2
be faster?

The FAT file system has, AFAIK, very little overhead for file writes. It simply
writes the two FAT tables on file extension, and data. Depending on cluster size,
there is probably even less happening there.

I don't think that anyone is saying that FAT is the answer in a production
environment, but maybe we can do a comparison of various file systems and see if any
performance issues show up.

I mentioned FAT only because I was thinking about how postgres would perform on a
very simple file system, one which bypasses most of the normal stuff a "good"
general purpose file system would do. While I was thinking this, it occurred to me
that FAT was about he cheesiest simple file system one could find, short of a ram
disk, and maybe we could use it to test the assumptions about performance impact of
the file system on postgres.

Just a thought. If you know of some reason why ext2 would perform better in the
postgres environment, I would love to hear why, I'm very curious.

From:	"Stephen C(dot) Tweedie" <sct(at)redhat(dot)com>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Trond Eivind Glomsr?d <teg(at)redhat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, sct(at)redhat(dot)com
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 18:03:05
Message-ID:	20010504190305.O4077@redhat.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

On Fri, May 04, 2001 at 01:49:54PM -0400, Bruce Momjian wrote:
> >
> > Performance doing what? XFS has known performance problems doing
> > unlinks and truncates, but not synchronous IO. The user should be
> > using fdatasync() for databases, btw, not fsync().
>
> This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by
> default it is available on a platform.

Good --- fdatasync is defined in SingleUnix, so it's probably safe to
probe for it and use it by default if it is there.

The 2.2 Linux kernel does not have fdatasync implemented, but glibc
will fall back to fsync if that's all that the kernel supports. 2.4
implements both with the required semantics.

--Stephen

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	"Stephen C(dot) Tweedie" <sct(at)redhat(dot)com>
Cc:	"Trond Eivind Glomsr?d" <teg(at)redhat(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-04 18:33:24
Message-ID:	200105041833.f44IXOT02371@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Hi,
>
> On Fri, May 04, 2001 at 01:49:54PM -0400, Bruce Momjian wrote:
> > >
> > > Performance doing what? XFS has known performance problems doing
> > > unlinks and truncates, but not synchronous IO. The user should be
> > > using fdatasync() for databases, btw, not fsync().
> >
> > This is hugely helpful. In PostgreSQL 7.1, we do use fdatasync() by
> > default it is available on a platform.
>
> Good --- fdatasync is defined in SingleUnix, so it's probably safe to
> probe for it and use it by default if it is there.
>
> The 2.2 Linux kernel does not have fdatasync implemented, but glibc
> will fall back to fsync if that's all that the kernel supports. 2.4
> implements both with the required semantics.

OK, that is something we found too, that fdatasync() was there on some
platforms, but was really just an fsync(). I believe some HPUX
platforms had that.

OK, so they need a 2.4 kernel to properly test performance of Reiser/xfs
with fdatasync().

From:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
To:	Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-05 17:07:51
Message-ID:	3.0.5.32.20010506010751.011ce210@192.228.128.13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At 02:09 AM 5/4/01 -0500, Thomas Swan wrote:
> I think it's worth noting that Oracle has been petitioning the
> kernel developers for better raw device support: in other words,
> the ability to write directly to the hard disk and bypassing the
> filesystem all together.

But there could be other reasons why Oracle would want to do raw stuff.

1) They have more things to sell - management modules/software. More
training courses. Certified blahblahblah. More features in brochure.
2) It just helps make things more proprietary. Think lock in.

All that for maybe 10% performance increase?

I think it's more advantageous for Postgresql to keep the filesystem layer
of abstraction, than to do away with it, and later reinvent certain parts
of it along with new bugs.

What would be useful is if one can specify where the tables, indexes, WAL
and other files go. That feature would probably help improve performance
far more.

For example: you could then stick the WAL on a battery backed up RAM disk.
How much total space does a WAL log need?

A battery backed RAM disk might even be cheaper than Brand X RDBMS
Proprietary Feature #5.

Cheerio,
Link.

From:	mlw <markw(at)mohawksoft(dot)com>
To:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
Cc:	Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-05 17:16:43
Message-ID:	3AF4357B.ACB054CE@mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Lincoln Yeoh wrote:
>
> At 02:09 AM 5/4/01 -0500, Thomas Swan wrote:
> > I think it's worth noting that Oracle has been petitioning the
> > kernel developers for better raw device support: in other words,
> > the ability to write directly to the hard disk and bypassing the
> > filesystem all together.
>
> But there could be other reasons why Oracle would want to do raw stuff.
>
> 1) They have more things to sell - management modules/software. More
> training courses. Certified blahblahblah. More features in brochure.
> 2) It just helps make things more proprietary. Think lock in.
>
> All that for maybe 10% performance increase?
>
> I think it's more advantageous for Postgresql to keep the filesystem layer
> of abstraction, than to do away with it, and later reinvent certain parts
> of it along with new bugs.

I just did a test of putting pg_xlog on a FAT file system, and my first rough
tests (pgbench) show an approximate 20% performance increase over ext2 with
fsync enabled.

--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

From:	thomas graichen <list-pgsql(dot)hackers(at)spoiled(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-05 19:41:25
Message-ID:	news2mail-20010505194125.617DA38E.NOFFLE@gray.example.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> wrote:
>> > Yes, this double-writing is a problem. Suppose you have your WAL on a
>> > separate drive. You can fsync() WAL with zero head movement. With a
>> > log based file system, you need two head movements, so you have gone
>> > from zero movements to two.
>>
>> It may be worse depending on how the filesystem actually does
>> journalling. I wonder if an fsync() may cause ALL pending
>> meta-data to be updated (even metadata not related to the
>> postgresql files).
>>
>> Do you know if reiser or xfs have this problem?

> I don't know, but the Linux user reported xfs was really slow.

i think this should be tested in more detail: i once tried this
lightly (running pgbench against postgresql 7.1beta4) with
different filesystems: ext2, reiserfs and XFS and reproducable
i got about 15% better results running on XFS ... ok - it's
not a very big test, but i think it might be worth to really
do an a/b test before seing it as a fact that postgresql is
slow on XFS (and maybe reiserfs too ... but reiserfs has had
performance problems in certain situations anyway)

XFS is a journaling fs, but it does all it's work in a very
clever way (delayed allocation etc.) - so usually you should
under normal conditions get decent performance out of it -
otherwise it might be worth sending a mail to the XFS
mailinglist (resierfs maybe dito)

--
thomas graichen <tgr(at)spoiled(dot)org> ... perfection is reached, not
when there is no longer anything to add, but when there is no
longer anything to take away. --- antoine de saint-exupery

From:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
To:	mlw <markw(at)mohawksoft(dot)com>
Cc:	Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-06 11:24:43
Message-ID:	3.0.5.32.20010506192443.011512e0@192.228.128.13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At 01:16 PM 5/5/01 -0400, mlw wrote:
>Lincoln Yeoh wrote:
>>
>> All that for maybe 10% performance increase?
>>
>> I think it's more advantageous for Postgresql to keep the filesystem layer
>> of abstraction, than to do away with it, and later reinvent certain parts
>> of it along with new bugs.
>
>I just did a test of putting pg_xlog on a FAT file system, and my first rough
>tests (pgbench) show an approximate 20% performance increase over ext2 with
>fsync enabled.

OK. I slouch corrected :). It's more than 10%.

However in the same message I did also say:
>What would be useful is if one can specify where the tables, indexes, WAL
>and other files go. That feature would probably help improve performance
>far more.
>
>For example: you could then stick the WAL on a battery backed up RAM disk.
>How much total space does a WAL log need?
>
>A battery backed RAM disk might even be cheaper than Brand X RDBMS
>Proprietary Feature #5.

And your experiments do help show that it is useful to be able to specify
where things go, that putting just the WAL somewhere else makes things 20%
faster. So you don't have to put everything on a pgfs. Just the WAL on some
other FS (even FAT32, ick ;) ).

---
OK we can do that with symlinks, but is there a PGSQL Recommended or
Standard way to do it, so as to reduce administrative errors, and at least
help improve consistency with multiadmin pgsql installations?

The WAL and DBs are in separate directories, so this makes things easy. But
the object names are now all numbers so that makes things a bit harder -
and what to do with temp tables?

Would it be good to have tables in one directory and indexes in another? Or
most people optimize on a specific table/index basis? Where does PGSQL do
the on-disk sorts?

How about naming the DB objects <object ID>.<object name>?
e.g

121575.testtable
125575.testtableindex

(or the other way round - name.OID - harder for DB, easier for admin?)

They'll still be unique, but now they're admin readable. Slower? e.g. at
that code point, pgsql no longer knows the object's name, and wants to
refer to everything by just numbers?

I apologize if there was already a long discussion on this. I seem to
recall Bruce saying that the developers agonized over this.

Cheerio,
Link.

From:	Hannu Krosing <hannu(at)tm(dot)ee>
To:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
Cc:	mlw <markw(at)mohawksoft(dot)com>, Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-06 12:04:28
Message-ID:	3AF53DCC.A1E3646@tm.ee
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Lincoln Yeoh wrote:
>
> At 01:16 PM 5/5/01 -0400, mlw wrote:
> >Lincoln Yeoh wrote:
> >>
> >> All that for maybe 10% performance increase?
> >>
> >> I think it's more advantageous for Postgresql to keep the filesystem layer
> >> of abstraction, than to do away with it, and later reinvent certain parts
> >> of it along with new bugs.
> >
> >I just did a test of putting pg_xlog on a FAT file system, and my first rough
> >tests (pgbench) show an approximate 20% performance increase over ext2 with
> >fsync enabled.
>
> OK. I slouch corrected :). It's more than 10%.
>
> However in the same message I did also say:
> >What would be useful is if one can specify where the tables, indexes, WAL
> >and other files go. That feature would probably help improve performance
> >far more.
> >
> >For example: you could then stick the WAL on a battery backed up RAM disk.
> >How much total space does a WAL log need?
> >
> >A battery backed RAM disk might even be cheaper than Brand X RDBMS
> >Proprietary Feature #5.
>
> And your experiments do help show that it is useful to be able to specify
> where things go, that putting just the WAL somewhere else makes things 20%
> faster. So you don't have to put everything on a pgfs. Just the WAL on some
> other FS (even FAT32, ick ;) ).

So you propose pgwalfs ? ;)

It may be much easier to implement than a full fs.

How hard would it be to let wal reside on a (raw) device ?

If we already pre-allocate a required number of fixed-size files would
it be too
hard to replace them with plain (raw) devices and test for possible
performance gains ?

>
> How about naming the DB objects <object ID>.<object name>?
> e.g
>
> 121575.testtable
> 125575.testtableindex
>

This sure seems to be an elegant solution for the problem that seems to
be impossible
to solve with symlinks and such. Even the IMHO hardest to solve problem
- RENAME - can
probably be done in a transaction-safe manner by doing a
link(oid.<newname>) in the
beginning and selective unlink(oid.<newname/oldname>) at commit time.

--------------------
Hannu

From:	mlw <markw(at)mohawksoft(dot)com>
To:	Hannu Krosing <hannu(at)tm(dot)ee>
Cc:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>, Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: New Linux xfs/reiser file systems
Date:	2001-05-06 12:53:56
Message-ID:	3AF54963.E78D1F1B@mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
>
> Lincoln Yeoh wrote:
> >
> > At 01:16 PM 5/5/01 -0400, mlw wrote:
> > >Lincoln Yeoh wrote:
> > >>
> > >> All that for maybe 10% performance increase?
> > >>
> > >> I think it's more advantageous for Postgresql to keep the filesystem layer
> > >> of abstraction, than to do away with it, and later reinvent certain parts
> > >> of it along with new bugs.
> > >
> > >I just did a test of putting pg_xlog on a FAT file system, and my first rough
> > >tests (pgbench) show an approximate 20% performance increase over ext2 with
> > >fsync enabled.
> >
> > OK. I slouch corrected :). It's more than 10%.
> >
> > However in the same message I did also say:
> > >What would be useful is if one can specify where the tables, indexes, WAL
> > >and other files go. That feature would probably help improve performance
> > >far more.
> > >
> > >For example: you could then stick the WAL on a battery backed up RAM disk.
> > >How much total space does a WAL log need?
> > >
> > >A battery backed RAM disk might even be cheaper than Brand X RDBMS
> > >Proprietary Feature #5.
> >
> > And your experiments do help show that it is useful to be able to specify
> > where things go, that putting just the WAL somewhere else makes things 20%
> > faster. So you don't have to put everything on a pgfs. Just the WAL on some
> > other FS (even FAT32, ick ;) ).
>
> So you propose pgwalfs ? ;)

I don't know about a "pgwalfs" too much work. I have had some time to grapple
with my feelings about FAT, and you know what? I don't hate the idea. I would,
of course, like to look through the driver code and see if there are any
technical reasons why it should be excluded.

FAT is almost perfect for WAL, and if I can figure out how to get the "base"
directory to get the same performance, I'd think about putting it there as
well.

The ReiserFS issues touched on some vague suspicions I had about fsync. Maybe
I'm over reacting, but there are reasons why the oracles manage their own table
spaces.

Back to FAT. FAT is probably the most simple file system I can think of. As
long as it writes to disk when it gets synched, and doesn't loose things, its
perfect. Postgres maintains much of the coherency issues, there is no real
problem with permissions because it will be owned by the postgres super user,
etc. I would never suggest FAT as a general purpose file system, but, geez, as
a special purpose single user (postgres) it seems an ideal answer to what will
be an increasingly hard problem of advanced file systems.

Aside from a general, and well deserved, disdain for FAT. What are the
technical "cons" of such a proposal. If we can get the Linux kernel (and other
unices) to accept IOCTLs to direct space allocation, and/or write up a white
paper on how to use this for postgres, why wouldn't it be a reasonable
strategy?

--
I'm not offering myself as an example; every life evolves by its own laws.
------------------------
http://www.mohawksoft.com

From:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
To:	Hannu Krosing <hannu(at)tm(dot)ee>
Cc:	mlw <markw(at)mohawksoft(dot)com>, Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-06 16:02:38
Message-ID:	3.0.5.32.20010507000238.009f8230@192.228.128.13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>Lincoln Yeoh wrote:
>>
>> >Lincoln Yeoh wrote:
>> >For example: you could then stick the WAL on a battery backed up RAM disk.
>> >How much total space does a WAL log need?
>> >
>> >A battery backed RAM disk might even be cheaper than Brand X RDBMS
>> >Proprietary Feature #5.
>>
>> And your experiments do help show that it is useful to be able to specify
>> where things go, that putting just the WAL somewhere else makes things 20%
>> faster. So you don't have to put everything on a pgfs. Just the WAL on some
>> other FS (even FAT32, ick ;) ).

At 02:04 PM 5/6/01 +0200, Hannu Krosing wrote:
>So you propose pgwalfs ? ;)

Nah. I'm proposing the opposite in fact.

I'm saying so far there appears to be no real need to come up with a
special filesystem. Stick to using existing/future filesystems. Just make
it easy and safe enough for DBA's to put the objects on whatever filesystem
they choose. So long as the O/S kernel/driver people support the hardware
or filesystem, postgresql will take advantage of it with little if any
extra work.

In fact as mlw's experiments show, you can put the WAL on FAT (FAT16?) for
a 20% performance increase. How much better would a raw device be? Would it
really be worth all that hassle? For instance if you need to resize the FAT
partition, you could probably use fips, Partition Magic or some other cost
effective solution - no need for pgsql developers or anybody to reinvent
anything.

My proposed but untested idea is that you could get a significant
performance increase by putting the WAL on popular filesystems running on
battery backed RAM drives (or other special hardware). 128MB RAM should be
enough for small setups?

Don't know how much these things cost, but I believe that when you need the
speed, they'll be more worthwhile than a special proprietary filesystem.

Ok, just found:
http://www.expressdata.com.au/Products/ProductsList.asp?SUPPLIER_NAME=PLATYP
US+TECHNOLOGY&SUBCATEGORY_NAME=QikDrive2#PRODUCTTITLE

AUD$1,624.70 = USD843.06. Not cheap but not way out of reach. Haven't found
other competing products yet. Must be somewhere.

Cheerio,
Link.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Hannu Krosing <hannu(at)tm(dot)ee>
Cc:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>, mlw <markw(at)mohawksoft(dot)com>, Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-06 16:03:41
Message-ID:	28452.989165021@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing <hannu(at)tm(dot)ee> writes:
> Even the IMHO hardest to solve problem
> - RENAME - can
> probably be done in a transaction-safe manner by doing a
> link(oid.<newname>) in the
> beginning and selective unlink(oid.<newname/oldname>) at commit time.

Nope. Consider

begin;
rename a to b;
rename b to a;
end;

And don't tell me you'll solve this by ignoring failures from link().
That's a recipe for losing your data...

I would ask people who think they have a solution to please go back and
reread the very long discussions we have had on this point in the past.
Nobody particularly likes numeric filenames, but there really isn't any
other workable answer.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
Cc:	mlw <markw(at)mohawksoft(dot)com>, Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-06 16:05:25
Message-ID:	28481.989165125@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my> writes:
> OK we can do that with symlinks, but is there a PGSQL Recommended or
> Standard way to do it, so as to reduce administrative errors, and at least
> help improve consistency with multiadmin pgsql installations?

Not yet. There should be support for this. See
doc/TODO.detail/tablespaces.

regards, tom lane

From:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Hannu Krosing <hannu(at)tm(dot)ee>
Cc:	mlw <markw(at)mohawksoft(dot)com>, Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: Re: New Linux xfs/reiser file systems
Date:	2001-05-06 17:56:18
Message-ID:	3.0.5.32.20010507015618.0081ee20@192.228.128.13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

At 12:03 PM 5/6/01 -0400, Tom Lane wrote:
>Hannu Krosing <hannu(at)tm(dot)ee> writes:
>> Even the IMHO hardest to solve problem
>> - RENAME - can
>> probably be done in a transaction-safe manner by doing a
>> link(oid.<newname>) in the
>> beginning and selective unlink(oid.<newname/oldname>) at commit time.
>
>Nope. Consider
>
> begin;
> rename a to b;
> rename b to a;
> end;
>
>And don't tell me you'll solve this by ignoring failures from link().
>That's a recipe for losing your data...
>
>I would ask people who think they have a solution to please go back and
>reread the very long discussions we have had on this point in the past.
>Nobody particularly likes numeric filenames, but there really isn't any
>other workable answer.

OK. Found one of the discussions at:
http://postgresql.readysetnet.com/mhonarc/pgsql-hackers/2000-03/threads.html
#00088

Conclusion calling stuff oid.relname doesn't really work. Sorry to have
brought it up again.

Another idea that's probably more messy than it's worth:

Main object still called <oid> with a symlink called <oid.originalrelname>.
DB really just uses <oid>.

Rename= adds symlink called <oid.newrelname>, doesn't remove symlinks
(symlinks more for show!).

Committed drop table does what 7.1 does with the main oid entry.

Vacuum cleans up the symlinks leaving just a single valid one or zaps all
if the table has been dropped.

For windows create empty files named oid.relname instead of symlinks.
Windows will definitely like .verylongrelname extensions ;).

Kinda messy and kludgy. Throw in the performance reduction and Ick!

I probably have to think harder :), maybe there's just no good way :(.

Ah well,
Link.

From:	Hannu Krosing <hannu(at)tm(dot)ee>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Lincoln Yeoh <lyeoh(at)pop(dot)jaring(dot)my>, mlw <markw(at)mohawksoft(dot)com>, Thomas Swan <tswan(at)ics(dot)olemiss(dot)edu>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Subject:	Re: TABLE RENAME/NUMERIC FILENAMES (Was: New Linux xfs/reiser file systems)
Date:	2001-05-07 08:12:32
Message-ID:	3AF658F0.7077D447@tm.ee
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
>
> Hannu Krosing <hannu(at)tm(dot)ee> writes:
> > Even the IMHO hardest to solve problem
> > - RENAME - can
> > probably be done in a transaction-safe manner by doing a
> > link(oid.<newname>) in the
> > beginning and selective unlink(oid.<newname/oldname>) at commit time.
>
> Nope. Consider
>
> begin;
> rename a to b;
> rename b to a;
> end;
>
> And don't tell me you'll solve this by ignoring failures from link().
> That's a recipe for losing your data...

I guess link() failures can be safely ignored _as long as_ we check that
we have the right link after doing it. I can't see how it will lose
data.

> I would ask people who think they have a solution to please go back and
> reread the very long discussions we have had on this point in the past.

I think I have now (No way to guarantee I have read _everything_ about
it,
but I did hit about ~10 messages on oid_relname naming scheme).

the most serious objection seemed to be that we need to remember the
postgres tablename while it would be much easier to use only oids .

I guess we could hit some system limits here (running out of directory
entries or reaching the maximum number of links to a file) but at least
on
linux i was able to make >10000 links to one file with no problems.

now that i think of it I have one concern - it would require extra work
to use tablenames like "/etc/passwd" or others that use characters that
are
reserved in filenames which are ok to use in 7.1.

hannu=# create table "/etc/passwd"(
hannu(# login text,
hannu(# uid int,
hannu(# gid int
hannu(# );
CREATE
hannu=# \dt
List of relations
Name | Type | Owner
-------------+-------+-------
/etc/passwd | table | hannu

So if people start using names like these it will not be easy to go back
;)

> Nobody particularly likes numeric filenames, but there really isn't any
> other workable answer.

At least we could put links on system relations, so it would be
easier to find them.

I guess one is not supposed to rename/drop system tables ?

---------------------
Hannu