Quick Links

Re: [HACKERS] Tablespaces

Lists:	pgsql-hackerspgsql-hackers-win32

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Tablespaces
Date:	2004-02-26 10:07:49
Message-ID:	Pine.LNX.4.58.0402261942440.16702@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Hi all,

I've been looking at implementing table spaces for 7.5. Some notes and
implementation details follow.

------

Type of table space:

There are many different table space implementations in relational
database management systems. In my implementation, a table space in
PostgreSQL will be the location of a directory on the file system in
which files backing database objects can be stored. Global tables and
non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
$PGDATA/base will be the default table space.

A given table space will be identified by a unique table space name. I
haven't decided if 'unique' should mean database-wide unique or
cross-database unique. It seems to me that we might run into problems
with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
uniqueness of table spaces is limited to the database level.

A table space parameter will be added to DDL commands which create
physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
CREATE SCHEMA. The associated routines, as well as the corresponding DROP
commands will need to be updated. Adding the ability to ALTER <object>
TABLESPACE <name> seems a little painful. Would people use it? Comments?

When an object is created the system will resolve the table space the
object is stored in as follows: if the table space paramater is passed to
the DDL command, then the object is stored in that table space (given
validation of the table space, etc). If it is not passed, the object
inherits its "parent's" table space where the parent/child hierarchy is as
follows: database > schema > table > [index|sequence]. So, if you issued:

create table foo.bar (...);

We would first not that there is no TABLESPACE <name>, then cascade to
the table space for the schema 'foo' (and possibly cascade to the table
space for the database). A database which wasn't created with an explicit
table space will be created under the default table space. This ensures
backward compatibility.

Creating a table space:

A table space is a directory structure. The directory structure is as
follows:

[swm(at)dev /path/to/tblspc]$ ls
OID1/ OID2/

OID1 and OID2 are the OIDs of databases which have created a table space
against this file system location. In this respect, a table space
resembles $PGDATA/base. I thought it useful to keep this kind of
namespace mechanism in place so that administrators do not need to create
hierarchies of names on different partitions if they want multiple
databases to use the same partition.

The actual creation of the table space will be done with:

CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Before creating the table space we must:

1) Check if the directory exists. If it does, create a sub directory as
the OID of the current database.

2) Alternatively, if the directory doesn't exist, attempt to create it,
then the sub directory.

I wonder if a file, such as PG_TBLSPC, should be added to the table space
directory so that, in the case of an existing non-empty directory, we can
attempt to test if the directory is being used for something else and
error out. Seems like:

CREATE TABLESPACE tbl1 LOCATION '/var/'

which will result in something like '/var/123443' is a bad idea. Then
again, the user should know better. Comments?

If everything goes well, we add an entry to pg_tablespace with the table
space location and name (and and OID).

Tying it all together:

The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
field. This will be the OID of the table space the object resides in, or 0
(default table space). Since we can then resolve relid/relname, schema and
database to a tablespace, there aren't too many cases when extra logic
needs to be added to the IO framework. In fact, most of it is taken care
of because of the abstraction of relpath().

The creation of table spaces will need to be recorded in xlog in the same
way that files are in heap_create() with the corresponding delete logic
incase of ABORT.

Postmaster startup:

Ideally, the postmaster at startup should go into each tblspc/databaseoid
directory and check for a postmaster.pid file to see if some other
instance is touching the files we're interested in. This will require a
control file listing tblspc/databaseoid paths and it will need to plug
into WAL in case we die during CREATE TABLESPACE. Comments?

Creating a database

I think that createdb() is going to have to be reworked if pg_tablespace
isn't shared (ie, tablespaces are only database unique). The reason being
that if we create a database which has a table space, pg_tablespace in the
new database will have to be updated and that cannot be done atomically
with the `cp` based mechanism we currently use.

I think I'm going to have to get my hands dirty before I can tell the
extent to which createdb() will need reworking.

pg_dump

Obviously pg_dump will need to be able to dump table spaces. pg_dump
running against <7.5 will DDL commands without a table space parameter and
as such the database's physical layout, when loaded into 7.5, will be the
same as for <7.5.

---

Comments? Questions? Suggestions?

Thanks,

Gavin

From:	Shridhar Daithankar <shridhar(at)frodo(dot)hserus(dot)net>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 10:38:50
Message-ID:	200402261608.50626.shridhar@frodo.hserus.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thursday 26 February 2004 15:37, Gavin Sherry wrote:
> Tying it all together:
>
> The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
> field. This will be the OID of the table space the object resides in, or 0
> (default table space). Since we can then resolve relid/relname, schema and
> database to a tablespace, there aren't too many cases when extra logic
> needs to be added to the IO framework. In fact, most of it is taken care
> of because of the abstraction of relpath().
>
> The creation of table spaces will need to be recorded in xlog in the same
> way that files are in heap_create() with the corresponding delete logic
> incase of ABORT.

Is tablespace some sort of copyrighted? Last I remembered, the discussion was
about location/storage

Just a thought..

Shridhar

From:	Richard Huxton <dev(at)archonet(dot)com>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 11:25:23
Message-ID:	200402261125.23730.dev@archonet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thursday 26 February 2004 10:07, Gavin Sherry wrote:
>
> CREATE TABLESPACE tbl1 LOCATION '/var/'
>
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?

The LOCATION should have the same owner and permissions as $PGDATA - that
should catch mistyping.

Unless you're running as root, of course. In which case you clearly know
better than everyone else, so off you go!

> Comments? Questions? Suggestions?

Presumably I'm using this to deal with performance/space issues, so there
clearly needs to be something in the pg_stat_xxx system to show figures based
on tablespace - not sure what you'd measure though - disk I/O, number of
nodes?

Perhaps something in contrib/ too - tablespace_disk_usage.pl or some such.
--
Richard Huxton
Archonet Ltd

From:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 11:25:24
Message-ID:	403DD7A4.6020804@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Is it possible to put WALs and CLOGs into different tablespaces? (maybe
different RAID systems). Some companies want that ...

> A given table space will be identified by a unique table space name. I
> haven't decided if 'unique' should mean database-wide unique or
> cross-database unique. It seems to me that we might run into problems
> with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
> uniqueness of table spaces is limited to the database level.

I strongly vote for database cluster wide unique names because somebody
could have a tablespace "webusers" or something like that. To me this
makes far more sense.

> A table space parameter will be added to DDL commands which create
> physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
> CREATE SCHEMA. The associated routines, as well as the corresponding DROP
> commands will need to be updated. Adding the ability to ALTER <object>
> TABLESPACE <name> seems a little painful. Would people use it? Comments?

I think people won't need it in first place because this seems to be
really painful.
What really matters is that the number of tablespaces and file /
tablespace is unlimited. SAP DB has limited the number of devspaces to
32 (I think). This is real bull.... because if your database grows
unexpectedly you are in deep trouble (expert database design by SAP,
MySQL and 100000....0000 others).

> When an object is created the system will resolve the table space the
> object is stored in as follows: if the table space paramater is passed to
> the DDL command, then the object is stored in that table space (given
> validation of the table space, etc). If it is not passed, the object
> inherits its "parent's" table space where the parent/child hierarchy is as
> follows: database > schema > table > [index|sequence]. So, if you issued:
>
> create table foo.bar (...);
>
> We would first not that there is no TABLESPACE <name>, then cascade to
> the table space for the schema 'foo' (and possibly cascade to the table
> space for the database). A database which wasn't created with an explicit
> table space will be created under the default table space. This ensures
> backward compatibility.

Will users automatically be assigned to a certain table space? How is
this going to work?

> Creating a table space:
>
> A table space is a directory structure. The directory structure is as
> follows:
>
> [swm(at)dev /path/to/tblspc]$ ls
> OID1/ OID2/
>
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place so that administrators do not need to create
> hierarchies of names on different partitions if they want multiple
> databases to use the same partition.
>
> The actual creation of the table space will be done with:
>
> CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
>
> Before creating the table space we must:
>
> 1) Check if the directory exists. If it does, create a sub directory as
> the OID of the current database.
>
> 2) Alternatively, if the directory doesn't exist, attempt to create it,
> then the sub directory.
>
> I wonder if a file, such as PG_TBLSPC, should be added to the table space
> directory so that, in the case of an existing non-empty directory, we can
> attempt to test if the directory is being used for something else and
> error out. Seems like:
>
> CREATE TABLESPACE tbl1 LOCATION '/var/'
>
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?
>
> If everything goes well, we add an entry to pg_tablespace with the table
> space location and name (and and OID).
>
>
> Tying it all together:
>
> The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
> field. This will be the OID of the table space the object resides in, or 0
> (default table space). Since we can then resolve relid/relname, schema and
> database to a tablespace, there aren't too many cases when extra logic
> needs to be added to the IO framework. In fact, most of it is taken care
> of because of the abstraction of relpath().
>
> The creation of table spaces will need to be recorded in xlog in the same
> way that files are in heap_create() with the corresponding delete logic
> incase of ABORT.
>
>
> Postmaster startup:
>
> Ideally, the postmaster at startup should go into each tblspc/databaseoid
> directory and check for a postmaster.pid file to see if some other
> instance is touching the files we're interested in. This will require a
> control file listing tblspc/databaseoid paths and it will need to plug
> into WAL in case we die during CREATE TABLESPACE. Comments?
>
>
> Creating a database
>
> I think that createdb() is going to have to be reworked if pg_tablespace
> isn't shared (ie, tablespaces are only database unique). The reason being
> that if we create a database which has a table space, pg_tablespace in the
> new database will have to be updated and that cannot be done atomically
> with the `cp` based mechanism we currently use.
>
> I think I'm going to have to get my hands dirty before I can tell the
> extent to which createdb() will need reworking.
>
>
> pg_dump
>
> Obviously pg_dump will need to be able to dump table spaces. pg_dump
> running against <7.5 will DDL commands without a table space parameter and
> as such the database's physical layout, when loaded into 7.5, will be the
> same as for <7.5.
>
> ---
>
> Comments? Questions? Suggestions?
>
> Thanks,
>
> Gavin

Do you plan support for limiting the size of a tablespace? ISPs will
vote for that because they can limit the size of a database on the
database level rather than on the operating system level. Of course
this can and (should???) be done on the operation system level but
people will definitely ask for that.
If sizing is not supported we should definitely provide minor
documentation which tells people how to do that on the operating system
level (at least poting to some useful information).

Best regards,

Hans

--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Hans-Jrgen_Schnig <postgres(at)cybertec(dot)at>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 11:58:25
Message-ID:	Pine.LNX.4.58.0402262255140.17527@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

> Is it possible to put WALs and CLOGs into different tablespaces? (maybe
> different RAID systems). Some companies want that ...

I wasn't going to look at that just yet.

There is of course the temporary hack of symlinking WAL else where.

I'd be interested to see the performance difference between WAL and data
on the same RAID/controller and WAL and data on different RAID/controller
with Jan's improvements to the buffer management.

Gavin

From:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 12:12:41
Message-ID:	Pine.LNX.4.44.0402261306110.10106-100000@zigo.dhs.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thu, 26 Feb 2004, Gavin Sherry wrote:

> Comments? Questions? Suggestions?

Is that plan that in the future one can split a single table into
different table spaces? Like storing all rows with year < 1999 in one
tablespace and the rest in another?

With the rule system and two underlying tables one could make it work by
hand I think.

I've never used tablespaces in oracle so I don't know what it can offer. I
though it could do things like the above. True? What is the syntax and
for example, how does it effect indexes (not at all maybe).

If you don't want to discuss this now, I understand. It's not part of the
design as it is now. I'm just curious at what direction we are moving and
what is possible to do.

--
/Dennis Björklund

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 12:22:28
Message-ID:	Pine.LNX.4.58.0402262314590.17602@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thu, 26 Feb 2004, Dennis Bjorklund wrote:

> On Thu, 26 Feb 2004, Gavin Sherry wrote:
>
> > Comments? Questions? Suggestions?
>
> Is that plan that in the future one can split a single table into
> different table spaces? Like storing all rows with year < 1999 in one
> tablespace and the rest in another?

These are called partitions in oracle. You can approximate this with table
spaces by using a partial index and putting it in a different table space.
The problem, of course, is seq scans.

>
> With the rule system and two underlying tables one could make it work by
> hand I think.
>
> I've never used tablespaces in oracle so I don't know what it can offer. I

Certainly, table spaces are used in many ways in oracle, db2, etc. You can
mirror data across them, have different buffer sizes for example.
In some implementations, they can be raw disk partitions (no file system).
I don't intend going this far, however.

> If you don't want to discuss this now, I understand. It's not part of the
> design as it is now. I'm just curious at what direction we are moving and
> what is possible to do.

Well, partitions are something else entirely. Mirroring would be
interesting, but RAID designers are better at parallelisation of IO than
(some) database developers. Might be better to keep the problem seperate.

Gavin

From:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
To:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Tablespaces
Date:	2004-02-26 12:25:28
Message-ID:	Pine.LNX.4.44.0402260424530.14032-100000@hosting.commandprompt.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

> Is it possible to put WALs and CLOGs into different tablespaces? (maybe
> different RAID systems). Some companies want that ...

You can do this now, but it would be nice to be able to have it actually
configurable versus the hacked linked method.

--
Co-Founder
Command Prompt, Inc.
The wheel's spinning but the hamster's dead

From:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 12:33:54
Message-ID:	403DE7B2.1030600@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin Sherry wrote:
>>Is it possible to put WALs and CLOGs into different tablespaces? (maybe
>>different RAID systems). Some companies want that ...
>
>
> I wasn't going to look at that just yet.
>
> There is of course the temporary hack of symlinking WAL else where.

that's what we do now.
we symlink databases and wals ...

> I'd be interested to see the performance difference between WAL and data
> on the same RAID/controller and WAL and data on different RAID/controller
> with Jan's improvements to the buffer management.
>
> Gavin

yes, that's what i am looking for. i should do some testing.

in case of enough i/o power additional cpus scale almost linearily
(depending on the application of course; i have done some testing on a
customer's aix box ...).
it would be interesting to see what jan's buffer strategy does (and bg
writer) ...

--
Cybertec Geschwinde u Schoenig
Schoengrabern 134, A-2020 Hollabrunn, Austria
Tel: +43/2952/30706 or +43/664/233 90 75
www.cybertec.at, www.postgresql.at, kernel.cybertec.at

From:	"Alex J(dot) Avriette" <alex(at)posixnap(dot)net>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 16:14:37
Message-ID:	20040226161437.GU8646@posixnap.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:

> Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> mirror data across them, have different buffer sizes for example.
> In some implementations, they can be raw disk partitions (no file system).
> I don't intend going this far, however.

Perhaps now would be a good time to bring up my directio on Solaris question
from a year or so back? Is there any interest in the ability to use raw
disk?

Alex (who is overjoyed to hear discussion of tablespaces again)

--
alex(at)posixnap(dot)net
Alex J. Avriette, Unix Systems Gladiator
Sep 25 12:52:39 buggle /bsd: wsdisplay0 at vga1: removing /dev/radio/*

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 16:46:13
Message-ID:	12624.1077813973@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> A table space is a directory structure. The directory structure is as
> follows:
> [swm(at)dev /path/to/tblspc]$ ls
> OID1/ OID2/
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place ...

Actually, this is *necessary* AFAICT. The case that forces it is DROP
DATABASE. Since you have to execute that from another database, there's
no reasonable way to look into the target database's catalogs. That
means that the OID of the database has to be sufficient information to
get rid of all its files. You can do this fairly easily if in each
tablespace (whose locations you know from the shared pg_tablespace
table) you can look for a subdirectory matching the target database's
OID. If we tried to put the database's files just "loose" in each
tablespace directory then we'd be in trouble.

I think this is also an implementation reason for favoring cluster-wide
tablespaces over database-local ones. I'm not sure how you drop a
database from outside if you can't see where its tablespaces are.

I believe that it will be necessary to expand RelFileNode to three OIDs
(tablespace, database, relation). I had once hoped that it could be
kept at two (tablespace, relation) but with a physical layout like this
you more or less have to have three.

One issue that needs to be agreed on early is how the low-level file
access code finds a tablespace. What I would personally like is for
$PGDATA to contain symlinks to the tablespace top directories. The
actual access path for any relation could then be built trivially from
its RelFileNode:
$PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
-------------------------
The underlined part references a symlink that leads to the directory
containing the per-database subdirectories.

I am expecting to hear some bleating about this from people whose
preferred platforms don't support symlinks ;-). However, if we don't
rely on symlinks for this then the tablespace-OID-to-physical-path
mapping has to be explicitly known at very low levels of the system
(md.c in particular). We can't expect md.c to get that information by
reading pg_tablespace. It would have to rely on some backdoor path,
such as a flat text file it could read at backend startup. I think
this approach will leave us fighting a lot of problems with locking
and out-of-date information.

Speaking of locking, can we do anything to prevent people from shooting
themselves in the foot by changing active tablespaces? Are we even
going to have a DROP TABLESPACE command, and if so what would it do?

regards, tom lane

From:	Barry Lind <blind(at)xythos(dot)com>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 16:53:24
Message-ID:	403E2484.8090009@xythos.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin,

After creating a tablespace what (if any) changes can be done to it.
Can you DROP a tablespace, or once created will it always exist? Can
you RENAME a tablespace? Can you change the location of a tablespace
(i.e you did a disk reorg and move the contents to a different location
and now want to point to the new location)? What are the permissions
necessary to create a tablespace (can any use connected to the database
create a tablespace, or only superuser, or ...)?

Overall this will be a great addition to postgres. I am looking forward
to this feature.

thanks,
--Barry

Gavin Sherry wrote:
> Hi all,
>
> I've been looking at implementing table spaces for 7.5. Some notes and
> implementation details follow.
>
> ------
>
> Type of table space:
>
> There are many different table space implementations in relational
> database management systems. In my implementation, a table space in
> PostgreSQL will be the location of a directory on the file system in
> which files backing database objects can be stored. Global tables and
> non 'data' objects (WAL, CLOG, config files) will all remain in $PGDATA.
> $PGDATA/base will be the default table space.
>
> A given table space will be identified by a unique table space name. I
> haven't decided if 'unique' should mean database-wide unique or
> cross-database unique. It seems to me that we might run into problems
> with CREATE DATABASE ... TEMPLATE = <database with table spaces> if the
> uniqueness of table spaces is limited to the database level.
>
> A table space parameter will be added to DDL commands which create
> physical database objects (CREATE DATABASE/INDEX/TABLE/SEQUENCE) and to
> CREATE SCHEMA. The associated routines, as well as the corresponding DROP
> commands will need to be updated. Adding the ability to ALTER <object>
> TABLESPACE <name> seems a little painful. Would people use it? Comments?
>
> When an object is created the system will resolve the table space the
> object is stored in as follows: if the table space paramater is passed to
> the DDL command, then the object is stored in that table space (given
> validation of the table space, etc). If it is not passed, the object
> inherits its "parent's" table space where the parent/child hierarchy is as
> follows: database > schema > table > [index|sequence]. So, if you issued:
>
> create table foo.bar (...);
>
> We would first not that there is no TABLESPACE <name>, then cascade to
> the table space for the schema 'foo' (and possibly cascade to the table
> space for the database). A database which wasn't created with an explicit
> table space will be created under the default table space. This ensures
> backward compatibility.
>
>
> Creating a table space:
>
> A table space is a directory structure. The directory structure is as
> follows:
>
> [swm(at)dev /path/to/tblspc]$ ls
> OID1/ OID2/
>
> OID1 and OID2 are the OIDs of databases which have created a table space
> against this file system location. In this respect, a table space
> resembles $PGDATA/base. I thought it useful to keep this kind of
> namespace mechanism in place so that administrators do not need to create
> hierarchies of names on different partitions if they want multiple
> databases to use the same partition.
>
> The actual creation of the table space will be done with:
>
> CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
>
> Before creating the table space we must:
>
> 1) Check if the directory exists. If it does, create a sub directory as
> the OID of the current database.
>
> 2) Alternatively, if the directory doesn't exist, attempt to create it,
> then the sub directory.
>
> I wonder if a file, such as PG_TBLSPC, should be added to the table space
> directory so that, in the case of an existing non-empty directory, we can
> attempt to test if the directory is being used for something else and
> error out. Seems like:
>
> CREATE TABLESPACE tbl1 LOCATION '/var/'
>
> which will result in something like '/var/123443' is a bad idea. Then
> again, the user should know better. Comments?
>
> If everything goes well, we add an entry to pg_tablespace with the table
> space location and name (and and OID).
>
>
> Tying it all together:
>
> The catalogs pg_database, pg_namespace, and pg_class will have a 'tblspc'
> field. This will be the OID of the table space the object resides in, or 0
> (default table space). Since we can then resolve relid/relname, schema and
> database to a tablespace, there aren't too many cases when extra logic
> needs to be added to the IO framework. In fact, most of it is taken care
> of because of the abstraction of relpath().
>
> The creation of table spaces will need to be recorded in xlog in the same
> way that files are in heap_create() with the corresponding delete logic
> incase of ABORT.
>
>
> Postmaster startup:
>
> Ideally, the postmaster at startup should go into each tblspc/databaseoid
> directory and check for a postmaster.pid file to see if some other
> instance is touching the files we're interested in. This will require a
> control file listing tblspc/databaseoid paths and it will need to plug
> into WAL in case we die during CREATE TABLESPACE. Comments?
>
>
> Creating a database
>
> I think that createdb() is going to have to be reworked if pg_tablespace
> isn't shared (ie, tablespaces are only database unique). The reason being
> that if we create a database which has a table space, pg_tablespace in the
> new database will have to be updated and that cannot be done atomically
> with the `cp` based mechanism we currently use.
>
> I think I'm going to have to get my hands dirty before I can tell the
> extent to which createdb() will need reworking.
>
>
> pg_dump
>
> Obviously pg_dump will need to be able to dump table spaces. pg_dump
> running against <7.5 will DDL commands without a table space parameter and
> as such the database's physical layout, when loaded into 7.5, will be the
> same as for <7.5.
>
> ---
>
> Comments? Questions? Suggestions?
>
> Thanks,
>
> Gavin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	"Alex J(dot) Avriette" <alex(at)posixnap(dot)net>
Cc:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 21:22:25
Message-ID:	Pine.LNX.4.58.0402270817040.20164@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thu, 26 Feb 2004, Alex J. Avriette wrote:

> On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:
>
> > Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> > mirror data across them, have different buffer sizes for example.
> > In some implementations, they can be raw disk partitions (no file system).
> > I don't intend going this far, however.
>
> Perhaps now would be a good time to bring up my directio on Solaris question
> from a year or so back? Is there any interest in the ability to use raw
> disk?

I do not intend to undertake raw disk tablespaces for 7.5. I'd be
interested if anyone could provide some real world benchmarking of file
system vs. raw disk. Postgres benefits a lot from kernel file system cache
at the moment. Also, I believe that database designers have traditionally
made bad file system designers. Raw database partitions often lack the
tools essential to a scalable environment. For example, the ability to
resize partitions.

Gavin

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 21:30:28
Message-ID:	Pine.LNX.4.58.0402270823180.20164@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thu, 26 Feb 2004, Tom Lane wrote:

> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> > A table space is a directory structure. The directory structure is as
> > follows:
> > [swm(at)dev /path/to/tblspc]$ ls
> > OID1/ OID2/
> > OID1 and OID2 are the OIDs of databases which have created a table space
> > against this file system location. In this respect, a table space
> > resembles $PGDATA/base. I thought it useful to keep this kind of
> > namespace mechanism in place ...
>
> Actually, this is *necessary* AFAICT. The case that forces it is DROP
> DATABASE. Since you have to execute that from another database, there's
> no reasonable way to look into the target database's catalogs. That
> means that the OID of the database has to be sufficient information to
> get rid of all its files. You can do this fairly easily if in each
> tablespace (whose locations you know from the shared pg_tablespace
> table) you can look for a subdirectory matching the target database's
> OID. If we tried to put the database's files just "loose" in each
> tablespace directory then we'd be in trouble.

Ahhh. Yes.

>
> I think this is also an implementation reason for favoring cluster-wide
> tablespaces over database-local ones. I'm not sure how you drop a
> database from outside if you can't see where its tablespaces are.

Naturally.

>
> I believe that it will be necessary to expand RelFileNode to three OIDs
> (tablespace, database, relation). I had once hoped that it could be
> kept at two (tablespace, relation) but with a physical layout like this
> you more or less have to have three.

Yes. I agree.

>
> One issue that needs to be agreed on early is how the low-level file
> access code finds a tablespace. What I would personally like is for
> $PGDATA to contain symlinks to the tablespace top directories. The
> actual access path for any relation could then be built trivially from
> its RelFileNode:
> $PGDATA/tablespaces/TBOID/DBOID/RELFILENODE
> -------------------------
> The underlined part references a symlink that leads to the directory
> containing the per-database subdirectories.
>
> I am expecting to hear some bleating about this from people whose
> preferred platforms don't support symlinks ;-). However, if we don't

Actually, I think that's a pretty good idea :-). I'd solves a bunch of
issues in the backend (postmaster start up can recurse through
$PGDATA/tablespaces looking for postmaster.pid files) and will also assist
admins with complex configurations (perhaps).

> Speaking of locking, can we do anything to prevent people from shooting
> themselves in the foot by changing active tablespaces? Are we even
> going to have a DROP TABLESPACE command, and if so what would it do?

Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
drop a table space until the directory is empty. We will need a shared
invalidation message so that backends do not attempt to create an object
just after we drop the table space.

Thanks,

Gavin

From:	James Rogers <jrogers(at)neopolitan(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 21:41:34
Message-ID:	1077831694.20980.12.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thu, 2004-02-26 at 13:22, Gavin Sherry wrote:
> Postgres benefits a lot from kernel file system cache
> at the moment.

With the implementation of much smarter and more adaptive cache
replacement algorithm i.e. ARC, I would expect the benefit of using the
kernel file system cache to diminish significantly. It appears to me,
and I could be wrong, that the reason Postgres has depended on the
kernel file system cache isn't that this is obviously better in some
absolute sense (though it might be depending on the deployment
scenario), but that the original cache replacement algorithm in Postgres
was sufficiently poor that the better cache replacement algorithms in
the kernel cache more than offset any sub-optimality that might result
from doing so.

I would expect that with ARC and the redesign of some of the buffer
management bits for more scalability, you might very well get better
performance by allocating most of the memory to the buffer cache rather
than leaving it to the kernel file cache.

I'm actually fairly curious to see what the new buffer management scheme
will mean in terms of real world performance and parameter tuning.

-James Rogers
jrogers(at)neopolitan(dot)com

From:	"Simon Riggs" <simon(at)2ndquadrant(dot)com>
To:	"'Gavin Sherry'" <swm(at)linuxworld(dot)com(dot)au>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Tablespaces
Date:	2004-02-26 21:56:49
Message-ID:	002801c3fcb3$6fc28370$0200000a@LaptopDellXP
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

>Gavin Sherry
> The creation of table spaces will need to be recorded in xlog in the
same
> way that files are in heap_create() with the corresponding delete
logic
> incase of ABORT.

Overall, sounds very cool.

Please could we record the OID of the tablespace in the WAL logs, not
the path to the tablespace? That way, we run no risks of having the WAL
logs not work correctly should things change slightly...

.. need to record drop tablespaces in the WAL logs also.

I'm sure you meant both of those, just checking.

Can drop tablespace require a specific privelege too? It's too easy to
drop parts of a database without thinking...

Best Regards, Simon Riggs

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 22:05:13
Message-ID:	14714.1077833113@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
>> Speaking of locking, can we do anything to prevent people from shooting
>> themselves in the foot by changing active tablespaces? Are we even
>> going to have a DROP TABLESPACE command, and if so what would it do?

> Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> drop a table space until the directory is empty.

How would it get to be empty? Are you thinking of some sort of "connect
database to tablespace" and "disconnect database from tablespace"
commands that would respectively create and delete the per-database
subdirectory? That seems moderately reasonable to me. We could then
invent a locking protocol that requires backends to lock a tablespace
before they can execute either of these operations (or delete the
tablespace of course).

regards, tom lane

From:	"Alex J(dot) Avriette" <alex(at)posixnap(dot)net>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 22:28:41
Message-ID:	20040226222841.GV8646@posixnap.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Fri, Feb 27, 2004 at 08:22:25AM +1100, Gavin Sherry wrote:

> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment. Also, I believe that database designers have traditionally
> made bad file system designers. Raw database partitions often lack the
> tools essential to a scalable environment. For example, the ability to
> resize partitions.

The only reason I mentioned it to begin with was the recommendation of
directio for databases in the Sun Blueprint, _Tuning Databases on the
Solaris Platform_ (and being a Solaris geek, I asked, but apparently
nobody else is worried enough about performance or not using Solaris
enough to care).

It's not critical, of course. I think, however, that many of us would
like to see some of the features of Oracle and DB2 available to users
of postgres. Some of these features are raw disk, tablespaces, and
replication. We're getting there, and making really terrific progress
(I fully expect replication to be ready for primetime in the 8-12 mos
timeframe), but we're not quite there yet.

As I said, I'm very glad to hear tablespaces mentioned again and see
what looks like work being done on it.

Thanks!
Alex

--
alex(at)posixnap(dot)net
Alex J. Avriette, Solaris Artillery Officer
"Among the many misdeeds of the British rule in India, history will look upon the act of depriving a whole nation of arms, as the blackest." - Mahatma Gandhi

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	James Rogers <jrogers(at)neopolitan(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-26 22:42:47
Message-ID:	15041.1077835367@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

James Rogers <jrogers(at)neopolitan(dot)com> writes:
> With the implementation of much smarter and more adaptive cache
> replacement algorithm i.e. ARC, I would expect the benefit of using the
> kernel file system cache to diminish significantly. It appears to me,
> and I could be wrong, that the reason Postgres has depended on the
> kernel file system cache isn't that this is obviously better in some
> absolute sense (though it might be depending on the deployment
> scenario), but that the original cache replacement algorithm in Postgres
> was sufficiently poor that the better cache replacement algorithms in
> the kernel cache more than offset any sub-optimality that might result
> from doing so.

The question of optimality of replacement algorithm is only one of the
arguments for using a small buffer cache. IMHO a considerably stronger
argument is that the kernel's memory management is more flexible: it can
use that memory for either disk cache or program workspace, and it can
change the allocation on-the-fly as load demands. If you dedicate most
of RAM to Postgres buffers then you are likely to be wasting RAM or
swapping heavily. Possibly both :-(

Another gotcha is that unless the OS allows you to lock shared memory
into RAM, the shared buffers themselves could get swapped out, which is
a no-win scenario by any measure. Keeping the shared buffer arena small
helps prevent that by ensuring all the buffers are "hot".

Of course, this is all speculation until we get some real-world
experience with ARC. But I don't expect it to be a magic bullet.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-27 05:26:06
Message-ID:	87ad35gkk1.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

> > I am expecting to hear some bleating about this from people whose
> > preferred platforms don't support symlinks ;-). However, if we don't

Well, one option would be to have the low level filesystem storage (md.c?)
routines implement a kind of symlink themselves. Just a file with a special
magic number followed by a path.

I'm normally against reimplementing OS services but symlinks are really a very
simple concept and simple to implement. Especially if you can make a few
simplifying assumptions: they only ever need to appear as the final path
element not as parent directories and tablespaces don't need symlinks pointing
to symlinks. Ideally postgres also doesn't need to implement relative links
either.

--
greg

From:	"scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	"Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Tablespaces
Date:	2004-02-27 15:28:31
Message-ID:	Pine.LNX.4.33.0402270827300.13785-100000@css120.ihs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Fri, 27 Feb 2004, Gavin Sherry wrote:

> On Thu, 26 Feb 2004, Alex J. Avriette wrote:
>
> > On Thu, Feb 26, 2004 at 11:22:28PM +1100, Gavin Sherry wrote:
> >
> > > Certainly, table spaces are used in many ways in oracle, db2, etc. You can
> > > mirror data across them, have different buffer sizes for example.
> > > In some implementations, they can be raw disk partitions (no file system).
> > > I don't intend going this far, however.
> >
> > Perhaps now would be a good time to bring up my directio on Solaris question
> > from a year or so back? Is there any interest in the ability to use raw
> > disk?
>
> I do not intend to undertake raw disk tablespaces for 7.5. I'd be
> interested if anyone could provide some real world benchmarking of file
> system vs. raw disk. Postgres benefits a lot from kernel file system cache
> at the moment. Also, I believe that database designers have traditionally
> made bad file system designers. Raw database partitions often lack the
> tools essential to a scalable environment. For example, the ability to
> resize partitions.

Is possible / reasonable / smart and or dumb to look at implementing the
tablespaces as riding atop the initlocation handled stuff. I.e.
postgresql can only create tablespaces in areas that are created by
initlocation, thus keeping it in its box, so to speak?

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-27 17:42:15
Message-ID:	20517.1077903735@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

"scott.marlowe" <scott(dot)marlowe(at)ihs(dot)com> writes:
> Is possible / reasonable / smart and or dumb to look at implementing the
> tablespaces as riding atop the initlocation handled stuff.

In my mind, one of the main benefits of this work will be that we'll be
able to get *rid* of the initlocation stuff. It's a crock.

regards, tom lane

From:	"scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Tablespaces
Date:	2004-02-27 17:56:32
Message-ID:	Pine.LNX.4.33.0402271054530.14753-100000@css120.ihs.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Fri, 27 Feb 2004, Tom Lane wrote:

> "scott.marlowe" <scott(dot)marlowe(at)ihs(dot)com> writes:
> > Is possible / reasonable / smart and or dumb to look at implementing the
> > tablespaces as riding atop the initlocation handled stuff.
>
> In my mind, one of the main benefits of this work will be that we'll be
> able to get *rid* of the initlocation stuff. It's a crock.

OK, that's fine, but I keep thinking that a superuser should have to
create the tablespace itself, and then tables can be assigned by users
based on the rights assigned by the dba / superuser. Is that how we're
looking at doing it, or will any user be able to create a tablespace
anywhere postgresql has write permission, or will only dbas be able to
create AND use table spaces. I'm just not sure how that's gonna be
handled, and haven't seen it addressed.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	"scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-02-27 18:24:26
Message-ID:	20852.1077906266@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

"scott.marlowe" <scott(dot)marlowe(at)ihs(dot)com> writes:
> On Fri, 27 Feb 2004, Tom Lane wrote:
>> In my mind, one of the main benefits of this work will be that we'll be
>> able to get *rid* of the initlocation stuff. It's a crock.

> OK, that's fine, but I keep thinking that a superuser should have to
> create the tablespace itself, and then tables can be assigned by users
> based on the rights assigned by the dba / superuser.

Yeah, we haven't yet gotten to the issue of permissions, but certainly
creating or deleting a tablespace has to be a superuser-only operation,
if only because you probably have also got some manual filesystem work
to do to set up the associated directory; and that has to be done as
root or postgres.

It might be a good idea to restrict connect/disconnect (if we use those
operations) to superusers as well.

regards, tom lane

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-01 14:31:32
Message-ID:	20040301143132.GB8345@phlogiston.dyndns.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Thu, Feb 26, 2004 at 05:28:41PM -0500, Alex J. Avriette wrote:

> The only reason I mentioned it to begin with was the recommendation of
> directio for databases in the Sun Blueprint, _Tuning Databases on the
> Solaris Platform_ (and being a Solaris geek, I asked, but apparently
> nobody else is worried enough about performance or not using Solaris
> enough to care).

That recommendation itself is a few years old. While it may still be
true that directio is still fastest for Oracle on Solaris, I'd sure
like to see some recent evidence. I've a funny feeling that this is
an old rule of thumb which is now true in the sense that everyone
believes it, but maybe not in the sense that a test would reveal it
to be a sensible rule.

> like to see some of the features of Oracle and DB2 available to users
> of postgres. Some of these features are raw disk, tablespaces, and
> replication. We're getting there, and making really terrific progress

I don't think we want features for their own sake, though, and I'm
not convinced that raw filesystems are actually useful. Course, it's
not my itch, and PostgreSQL _is_ free software.

--
Andrew Sullivan

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 03:33:50
Message-ID:	4045521E.8060106@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

How about allowing the specification on schemas and databases of
different default tablespaces for TEMP, TABLE and INDEX?? Is there any
point to that?

Chris

From:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 03:35:28
Message-ID:	40455280.4010305@familyhealth.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

> I've been looking at implementing table spaces for 7.5. Some notes and
> implementation details follow.

Ah sorry, other things you might need to consider:

Privileges on tablespaces:

GRANT USAGE ON TABLESPACE tbsp TO ...;

Different disk settings for different tablespaces (since they will
likely be on different disks):

ALTER TABLESPACE tbsp SET random_page_cost TO 2.5;

Chris

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 03:52:49
Message-ID:	7928.1078285969@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Christopher Kings-Lynne <chriskl(at)familyhealth(dot)com(dot)au> writes:
> How about allowing the specification on schemas and databases of
> different default tablespaces for TEMP, TABLE and INDEX?? Is there any
> point to that?

TEMP tables are not local to any particular schema, so it wouldn't make
sense to have a schema-level default for their placement.

The other five combinations are at least theoretically sensible, but
do we need 'em all? It seems to me that a reasonable compromise is to
offer database-level default tablespaces for TEMP, TABLE, and INDEX,
ignoring the schema level. This is simple and understandable, and if
you don't like it, you're probably the kind of guy who will want to
override it per-table anyway ...

BTW, another dimension to think about is where TOAST tables and their
indexes will get placed.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:23:50
Message-ID:	200403030423.i234NpR25817@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin Sherry wrote:
> The actual creation of the table space will be done with:
>
> CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;

Seems you should use CREATE TABLESPACE (no space) so it is more distinct
from CREATE TABLE.

--
Bruce Momjian | http://candle.pha.pa.us
pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
+ If your life is a hard drive, | 13 Roberts Road
+ Christ can be your backup. | Newtown Square, Pennsylvania 19073

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Richard Huxton <dev(at)archonet(dot)com>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:24:23
Message-ID:	200403030424.i234ONA25927@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Richard Huxton wrote:
> On Thursday 26 February 2004 10:07, Gavin Sherry wrote:
> >
> > CREATE TABLESPACE tbl1 LOCATION '/var/'
> >
> > which will result in something like '/var/123443' is a bad idea. Then
> > again, the user should know better. Comments?
>
> The LOCATION should have the same owner and permissions as $PGDATA - that
> should catch mistyping.
>
> Unless you're running as root, of course. In which case you clearly know
> better than everyone else, so off you go!

FYI, you can't run the postmaster as root.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Cc:	Hans-Jürgen Schönig <postgres(at)cybertec(dot)at>, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:25:04
Message-ID:	200403030425.i234P4U25973@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Joshua D. Drake wrote:
> > Is it possible to put WALs and CLOGs into different tablespaces? (maybe
> > different RAID systems). Some companies want that ...
>
> You can do this now, but it would be nice to be able to have it actually
> configurable versus the hacked linked method.

Agreed, but because the system has to be down to move pg_xlog, I think
we should write a command-line utility to assist with this, perhaps. It
could check permissions and stuff.

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:26:21
Message-ID:	Pine.LNX.4.58.0403031526060.29789@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Tue, 2 Mar 2004, Bruce Momjian wrote:

> Gavin Sherry wrote:
> > The actual creation of the table space will be done with:
> >
> > CREATE TABLE SPACE <name> LOCATION </path/to/tblspc>;
>
> Seems you should use CREATE TABLESPACE (no space) so it is more distinct
> from CREATE TABLE.

Oops. Typo.

>
> --
> Bruce Momjian | http://candle.pha.pa.us
> pgman(at)candle(dot)pha(dot)pa(dot)us | (610) 359-1001
> + If your life is a hard drive, | 13 Roberts Road
> + Christ can be your backup. | Newtown Square, Pennsylvania 19073
>
>
> !DSPAM:40455de0297537578347468!
>
>

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:26:21
Message-ID:	200403030426.i234QMi26168@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Tom Lane wrote:
> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> > A table space is a directory structure. The directory structure is as
> > follows:
> > [swm(at)dev /path/to/tblspc]$ ls
> > OID1/ OID2/
> > OID1 and OID2 are the OIDs of databases which have created a table space
> > against this file system location. In this respect, a table space
> > resembles $PGDATA/base. I thought it useful to keep this kind of
> > namespace mechanism in place ...
>
> Actually, this is *necessary* AFAICT. The case that forces it is DROP
> DATABASE. Since you have to execute that from another database, there's
> no reasonable way to look into the target database's catalogs. That
> means that the OID of the database has to be sufficient information to
> get rid of all its files. You can do this fairly easily if in each
> tablespace (whose locations you know from the shared pg_tablespace
> table) you can look for a subdirectory matching the target database's
> OID. If we tried to put the database's files just "loose" in each
> tablespace directory then we'd be in trouble.
>

Gavin, let us know if you want us to create the global pg_tablespace for
you. Some of us have done a lot of system catalog work.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:27:46
Message-ID:	200403030427.i234Rku26455@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Tom Lane wrote:
> Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> >> Speaking of locking, can we do anything to prevent people from shooting
> >> themselves in the foot by changing active tablespaces? Are we even
> >> going to have a DROP TABLESPACE command, and if so what would it do?
>
> > Ahh. I forgot to detail my ideas on this. It seems to me that we cannot
> > drop a table space until the directory is empty.
>
> How would it get to be empty? Are you thinking of some sort of "connect
> database to tablespace" and "disconnect database from tablespace"
> commands that would respectively create and delete the per-database
> subdirectory? That seems moderately reasonable to me. We could then
> invent a locking protocol that requires backends to lock a tablespace
> before they can execute either of these operations (or delete the
> tablespace of course).

One crude solution would be to remove the tablespace oid directory only
when the database is dropped, and require an empty tablespace directory
to drop the tablespace. This allows a lock only on tablespace creation,
and not a lock on object creation in each tablespace.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Tablespaces
Date:	2004-03-03 04:33:58
Message-ID:	200403030433.i234Xwg27505@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Greg Stark wrote:
>
> > > I am expecting to hear some bleating about this from people whose
> > > preferred platforms don't support symlinks ;-). However, if we don't
>
> Well, one option would be to have the low level filesystem storage (md.c?)
> routines implement a kind of symlink themselves. Just a file with a special
> magic number followed by a path.
>
> I'm normally against reimplementing OS services but symlinks are really a very
> simple concept and simple to implement. Especially if you can make a few
> simplifying assumptions: they only ever need to appear as the final path
> element not as parent directories and tablespaces don't need symlinks pointing
> to symlinks. Ideally postgres also doesn't need to implement relative links
> either.

I just checked from the MinGW console and I see:

# touch a
# ln -s a b
# echo test >a
# cat b
# l ?
-rw-r--r-- 1 Bruce Mo Administ 5 Mar 2 23:30 a
-rw-r--r-- 1 Bruce Mo Administ 0 Mar 2 23:30 b
# cat a
test
# cat b
#

It accepts ln -s, but does nothing with it.

For tablespaces on OS's that don't support it, I think we will have to
store the path name in the file and read it via the backend. Somehow we
should cache those lookups.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com>, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, "Alex J(dot) Avriette" <alex(at)posixnap(dot)net>, Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:34:55
Message-ID:	200403030434.i234YtK27793@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Tom Lane wrote:
> "scott.marlowe" <scott(dot)marlowe(at)ihs(dot)com> writes:
> > Is possible / reasonable / smart and or dumb to look at implementing the
> > tablespaces as riding atop the initlocation handled stuff.
>
> In my mind, one of the main benefits of this work will be that we'll be
> able to get *rid* of the initlocation stuff. It's a crock.

Agreed. It should be ripped out once we have tablespaces, and if we keep
it for one extra release, there will be confusion over which to use.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:39:53
Message-ID:	200403030439.i234drK28623@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin Sherry wrote:
> Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> issues in the backend (postmaster start up can recurse through
> $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> admins with complex configurations (perhaps).

Why are you asking about postmaster.pid files. That file goes in the
top level /data directory, no?

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:46:55
Message-ID:	Pine.LNX.4.58.0403031541280.29986@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Tue, 2 Mar 2004, Bruce Momjian wrote:

> Gavin Sherry wrote:
> > Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> > issues in the backend (postmaster start up can recurse through
> > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> > admins with complex configurations (perhaps).
>
> Why are you asking about postmaster.pid files. That file goes in the
> top level /data directory, no?

I was trying to be paranoid about users who have multiple postmasters on
the same machine and want to share a table space while both systems are
live. There'd be no mechanism to test for that situation if we didn't have
something like a postmaster.pid file. Is this being a little too paranoid?

Gavin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 04:59:55
Message-ID:	8520.1078289995@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> writes:
> I was trying to be paranoid about users who have multiple postmasters on
> the same machine and want to share a table space while both systems are
> live. There'd be no mechanism to test for that situation if we didn't have
> something like a postmaster.pid file. Is this being a little too paranoid?

Hm. AFAICS there is no safe situation in which a tablespace directory
could be shared by two different installations (== toplevel $PGDATA
directories). I don't think we need a dynamic postmaster.pid-type lock
to protect them. What might make sense is some sort of marker file in a
tablespace directory that links back to the owning $PGDATA directory.
CREATE TABLESPACE should create this, or reject if it already exists.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 05:00:40
Message-ID:	200403030500.i2350eO02175@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Gavin Sherry wrote:
> On Tue, 2 Mar 2004, Bruce Momjian wrote:
>
> > Gavin Sherry wrote:
> > > Actually, I think that's a pretty good idea :-). I'd solves a bunch of
> > > issues in the backend (postmaster start up can recurse through
> > > $PGDATA/tablespaces looking for postmaster.pid files) and will also assist
> > > admins with complex configurations (perhaps).
> >
> > Why are you asking about postmaster.pid files. That file goes in the
> > top level /data directory, no?
>
> I was trying to be paranoid about users who have multiple postmasters on
> the same machine and want to share a table space while both systems are
> live. There'd be no mechanism to test for that situation if we didn't have
> something like a postmaster.pid file. Is this being a little too paranoid?

Oh, yikes, I see. Right now we have the interlock on the /data
directory, but once you start moving stuff out from under /data using
tablespaces, we do perhaps loose the interlock. However, I assume the
CREATE TABLESPACE is going to create the tablespace directory, so I
don't see how two postmasters could both create the directory.

For example, if you say

CREATE TABLESPACE tb IN '/var/tb1'

I assume you have to create:

/var/tb1/pgsql_tablespace

and then

/var/tb1/pgsql_tablespace/oid1
/var/tb1/pgsql_tablespace/oid2

or something like that, and set the proper permissions on
pgsql_tablespace. We will have write permission on the directory they
pass to us, but we might not have permissions to change the mode of the
directory they pass, so we have to create a subdirectory anyway, and
that is our interlock.

For example:

# run as root
$ chmod a+w .
$ ls -ld .
drwxrwxrwx 2 root wheel 512 Mar 2 23:51 .

# run as the postmaster
$ mkdir new
$ ls -ld new
drwxr-xr-x 2 postgres wheel 512 Mar 2 23:52 new
$ chmod 700 new
$ chmod 700 .
chmod: .: Operation not permitted
chmod: .: Operation not permitted

As you can see, I have permission to create the /new directory, but no
ability to set its mode, so we have to create a directory that matches
the permissions of /data:

drwx------ 6 postgres postgres 512 Mar 2 12:48 /u/pg/data/

We could require the admin to create a directory that we own instead of
just one that we have write permission in, but why bother when we can
use the new directory as an interlock from multiple postmasters anyway.

Right now we do require the directory used as /data be one where we can
create a /data subdirectory, so this seems similar. We don't put the
data directly in the passed directory, but in /data under that. In
fact, we could just call it /var/tb1/data instead of
/var/tb1/pgsql_tablespace.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Tablespaces
Date:	2004-03-03 05:18:39
Message-ID:	8679.1078291119@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> For tablespaces on OS's that don't support it, I think we will have to
> store the path name in the file and read it via the backend. Somehow we
> should cache those lookups.

My feeling is that we need not support tablespaces on OS's without
symlinks.

regards, tom lane

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Tablespaces
Date:	2004-03-03 05:20:06
Message-ID:	200403030520.i235K6c05325@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > For tablespaces on OS's that don't support it, I think we will have to
> > store the path name in the file and read it via the backend. Somehow we
> > should cache those lookups.
>
> My feeling is that we need not support tablespaces on OS's without
> symlinks.

Agreed, but are we going to support non-tablespace installs? I wasn't
sure that was an option.

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Tablespaces
Date:	2004-03-03 05:25:12
Message-ID:	Pine.LNX.4.58.0403031623210.30445@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Wed, 3 Mar 2004, Tom Lane wrote:

> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > For tablespaces on OS's that don't support it, I think we will have to
> > store the path name in the file and read it via the backend. Somehow we
> > should cache those lookups.
>
> My feeling is that we need not support tablespaces on OS's without
> symlinks.

I'm going to focus on implementing this on the system(s) I'm used to
developing on (ie, those which support symlinks). Once that is done, I'll
talk with the Win32 guys about what, if anything, we can do about getting
this to work on Win32 (and possibly other non-symlink supporting OSs).

Thanks,

Gavin

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Tablespaces
Date:	2004-03-03 05:36:38
Message-ID:	8850.1078292198@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> Tom Lane wrote:
>> My feeling is that we need not support tablespaces on OS's without
>> symlinks.

> Agreed, but are we going to support non-tablespace installs? I wasn't
> sure that was an option.

A setup containing only the default tablespace cannot use any symlinks.
That doesn't seem hard though.

regards, tom lane

From:	Oliver Elphick <olly(at)lfix(dot)co(dot)uk>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 11:58:51
Message-ID:	1078315131.23531.115.camel@cerberus.lfix.co.uk
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Wed, 2004-03-03 at 04:59, Tom Lane wrote:
> What might make sense is some sort of marker file in a
> tablespace directory that links back to the owning $PGDATA directory.
> CREATE TABLESPACE should create this, or reject if it already exists.

It will not be enough for the marker to list the path of the parent
$PGDATA, since that path might get changed by system administration
action. The marker should contain some sort of unique string which
would match the same string somewhere in $PGDATA. Then, if either
tablespace or $PGDATA were moved, it would be possible to tie the two
back together. It wouldn't be an issue on most normal systems, but
might be of crucial importance for an ISP running numerous separate
clusters.
--
Oliver Elphick <olly(at)lfix(dot)co(dot)uk>
LFIX Ltd

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, pgsql-hackers(at)postgresql(dot)org, PostgreSQL Win32 port list <pgsql-hackers-win32(at)postgresql(dot)org>
Subject:	Re: [HACKERS] Tablespaces
Date:	2004-03-03 13:31:07
Message-ID:	200403031331.i23DV7i14780@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Tom Lane wrote:
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
> > Tom Lane wrote:
> >> My feeling is that we need not support tablespaces on OS's without
> >> symlinks.
>
> > Agreed, but are we going to support non-tablespace installs? I wasn't
> > sure that was an option.
>
> A setup containing only the default tablespace cannot use any symlinks.
> That doesn't seem hard though.

Yea, I think you are right. We just disable CREATE TABLESPACE and the
rest should just work. Basically, pg_tablespace will only have one
entry on those platforms. The initdb directory structure will have a
single tablespace, but that doesn't use symlinks.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	olly(at)lfix(dot)co(dot)uk
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 14:36:53
Message-ID:	12449.1078324613@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Oliver Elphick <olly(at)lfix(dot)co(dot)uk> writes:
> It will not be enough for the marker to list the path of the parent
> $PGDATA, since that path might get changed by system administration
> action. The marker should contain some sort of unique string which
> would match the same string somewhere in $PGDATA.

We have already added a notion of a "unique installation identifier"
for PITR purposes (look in pg_control). So we could use that for this
purpose if we wanted to.

But I'm not sure how important it really is. AFAICS the behavior of
CREATE TABLESPACE will be "create marker file, if it already exists
then abort". It has no need to actually look in the file and so there's
no need for the contents to be unique.

regards, tom lane

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 16:35:40
Message-ID:	874qt5c2hv.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:

> Greg Stark wrote:
> >
> > > > I am expecting to hear some bleating about this from people whose
> > > > preferred platforms don't support symlinks ;-). However, if we don't
> >
> > Well, one option would be to have the low level filesystem storage (md.c?)
> > routines implement a kind of symlink themselves. Just a file with a special
> > magic number followed by a path.

On further contemplation it doesn't seem like using symlinks really ought to
be necessary. It should be possible to drive everything off the catalog tables
while avoidin having the low level filesystem code know anything about them.

Instead of having the low level code fetch the pg_* records themselves, some
piece of higher level code would do the query and call down to storage layer
to inform it of the locations for everything. It would have to do this on
database initialization and on any subsequent object creation.

Basically maintain an in-memory hash table of oid -> path, and call down to
the low level code whenever that hash changes. (Or more likely oid->ts_id and
a separate list of ts_id -> path.)

--
greg

From:	Barry Lind <blind(at)xythos(dot)com>
To:	olly(at)lfix(dot)co(dot)uk
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 17:05:41
Message-ID:	40461065.90605@xythos.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Oliver Elphick wrote:
> On Wed, 2004-03-03 at 04:59, Tom Lane wrote:
>
>> What might make sense is some sort of marker file in a
>>tablespace directory that links back to the owning $PGDATA directory.
>>CREATE TABLESPACE should create this, or reject if it already exists.
>
>
> It will not be enough for the marker to list the path of the parent
> $PGDATA, since that path might get changed by system administration
> action. The marker should contain some sort of unique string which
> would match the same string somewhere in $PGDATA. Then, if either
> tablespace or $PGDATA were moved, it would be possible to tie the two
> back together. It wouldn't be an issue on most normal systems, but
> might be of crucial importance for an ISP running numerous separate
> clusters.

Taking this one step further would be to do something like Oracle does.
Every datafile in Oracle (because the Oracle storage manager stores
multiple objects inside datafiles, one could say there is some
similarity between Oracle datafiles and the proposed pg tablespaces),
has meta info that tells it which database instance it belongs to and
the last checkpoint that occured (It might actually be more granular
than checkpoint, such that on a clean shutdown you can tell that all
datafiles are consistent with each other and form a consistent database
instance). So Oracle on every checkpoint updates all datafiles with an
identifier. Now you might ask why is this useful. Well in normal day
to day operation it isn't, but it can be usefull in disaster recovery.
If you loose a disk and need to restore the entire database from backups
it can be difficult to make sure you have done it all correctly (do I
have all the necessary files/directories? did I get the right ones from
the right tapes?) Especially if you have directories spread across
various different disks that might be backed up to different tapes. So
by having additional information stored in each datafile Oracle can
provide additional checks that the set of files that are being used when
the database starts up are consistent and all belong together. Oracle
also ensures that all the datafiles that are suposed to exist actually
do as well.

So what might this mean for postgres and tablespaces? It could mean
that on startup the database checks to verify that all the tablespaces
that are registered actually exist. And that the data in each
tablespace is consistent with the current WAL status. (i.e. someone
didn't restore a tablespace from backup while the database was down that
is old and needs recovery.

A lot of what I am talking about here become PITR issues. But since
PITR and tablespaces are both potential features for 7.5, how they
interact probably should be thought about in the designs for each.

thanks,
--Barry

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-03 17:52:40
Message-ID:	200403031752.i23HqeB03122@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Greg Stark wrote:
>
> Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>
> > Greg Stark wrote:
> > >
> > > > > I am expecting to hear some bleating about this from people whose
> > > > > preferred platforms don't support symlinks ;-). However, if we don't
> > >
> > > Well, one option would be to have the low level filesystem storage (md.c?)
> > > routines implement a kind of symlink themselves. Just a file with a special
> > > magic number followed by a path.
>
> On further contemplation it doesn't seem like using symlinks really ought to
> be necessary. It should be possible to drive everything off the catalog tables
> while avoidin having the low level filesystem code know anything about them.
>
> Instead of having the low level code fetch the pg_* records themselves, some
> piece of higher level code would do the query and call down to storage layer
> to inform it of the locations for everything. It would have to do this on
> database initialization and on any subsequent object creation.
>
> Basically maintain an in-memory hash table of oid -> path, and call down to
> the low level code whenever that hash changes. (Or more likely oid->ts_id and
> a separate list of ts_id -> path.)

The advantage of symlinks is that an administrator could see how things
are laid out from the command line.

From:	Thomas Swan <tswan(at)idigx(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-04 02:31:16
Message-ID:	404694F4.9030006@idigx.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Bruce Momjian wrote:

>Greg Stark wrote:
>
>
>>Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> writes:
>>
>>
>>
>>>Greg Stark wrote:
>>>
>>>
>>>>>>I am expecting to hear some bleating about this from people whose
>>>>>>preferred platforms don't support symlinks ;-). However, if we don't
>>>>>>
>>>>>>
>>>>Well, one option would be to have the low level filesystem storage (md.c?)
>>>>routines implement a kind of symlink themselves. Just a file with a special
>>>>magic number followed by a path.
>>>>
>>>>
>>On further contemplation it doesn't seem like using symlinks really ought to
>>be necessary. It should be possible to drive everything off the catalog tables
>>while avoidin having the low level filesystem code know anything about them.
>>
>>Instead of having the low level code fetch the pg_* records themselves, some
>>piece of higher level code would do the query and call down to storage layer
>>to inform it of the locations for everything. It would have to do this on
>>database initialization and on any subsequent object creation.
>>
>>Basically maintain an in-memory hash table of oid -> path, and call down to
>>the low level code whenever that hash changes. (Or more likely oid->ts_id and
>>a separate list of ts_id -> path.)
>>
>>
>
>The advantage of symlinks is that an administrator could see how things
>are laid out from the command line.
>
>
>
That's a poor reason to require symlinks. The administrator can just as
easily open up psql and query pg_tablespace to see that same
information. Besides, the postgres doesn't know where to look on the
filesystem for the /path/to/oid without a system catalog lookup. There
doesn't seem to be any sensible reason to force a filesystem requirement
when the core operations are diffferent to begin with.

If a more global view of all databases is necessary, perhaps there ought
to be a system wide view which could display all of that information at
once: dbname, relation name, and physical location.

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Thomas Swan <tswan(at)idigx(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-04 02:41:17
Message-ID:	200403040241.i242fHI05299@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Thomas Swan wrote:
> >The advantage of symlinks is that an administrator could see how things
> >are laid out from the command line.
> >
> >
> >
> That's a poor reason to require symlinks. The administrator can just as
> easily open up psql and query pg_tablespace to see that same
> information. Besides, the postgres doesn't know where to look on the
> filesystem for the /path/to/oid without a system catalog lookup. There
> doesn't seem to be any sensible reason to force a filesystem requirement
> when the core operations are diffferent to begin with.
>
> If a more global view of all databases is necessary, perhaps there ought
> to be a system wide view which could display all of that information at
> once: dbname, relation name, and physical location.

Who doesn't have symlinks these days, and is going to be using
tablespaces? Even Win32 has them.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Thomas Swan <tswan(at)idigx(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-04 04:14:51
Message-ID:	21160.1078373691@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Thomas Swan <tswan(at)idigx(dot)com> writes:
> Bruce Momjian wrote:
>> The advantage of symlinks is that an administrator could see how things
>> are laid out from the command line.
>>
> That's a poor reason to require symlinks. The administrator can just as
> easily open up psql and query pg_tablespace to see that same
> information.

Something to keep in mind here is that one of the times you would most
likely need that information is when the database is broken and you
*can't* simply "open up psql" and inspect system catalogs. I like the
fact that a symlink implementation can be inspected without depending on
a working database.

If we were going to build a non-symlink implementation, I'd want the
highlevel-to-lowlevel data transfer to take the form of a flat ASCII
file that could be inspected by hand, rather than some hidden in-memory
datastructure. But given the previous discussion in this thread,
I cannot see any strong reason not to rely on symlinks for the purpose.
We are not in the business of building replacements for OS features.

regards, tom lane

From:	Thomas Swan <tswan(at)idigx(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-04 06:39:29
Message-ID:	4046CF21.1020907@idigx.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Tom Lane wrote:

>Thomas Swan <tswan(at)idigx(dot)com> writes:
>
>
>>Bruce Momjian wrote:
>>
>>
>>>The advantage of symlinks is that an administrator could see how things
>>>are laid out from the command line.
>>>
>>>
>>>
>>That's a poor reason to require symlinks. The administrator can just as
>>easily open up psql and query pg_tablespace to see that same
>>information.
>>
>>
>
>Something to keep in mind here is that one of the times you would most
>likely need that information is when the database is broken and you
>*can't* simply "open up psql" and inspect system catalogs. I like the
>fact that a symlink implementation can be inspected without depending on
>a working database.
>
>
>
That's a sufficient argument, to allow for it. Recoverability would be
one reason.

>If we were going to build a non-symlink implementation, I'd want the
>highlevel-to-lowlevel data transfer to take the form of a flat ASCII
>file that could be inspected by hand, rather than some hidden in-memory
>datastructure. But given the previous discussion in this thread,
>I cannot see any strong reason not to rely on symlinks for the purpose.
>We are not in the business of building replacements for OS features.
>
>
>
I do like the flat file output at least for a record of what went
where. Regardless of whether or not symlinks are used, the admin would
need to know what directories/files/filesystems are to be backed up.

I am concerned as to what extent different filesystems do when you back
the directories up. Would NTFS containing symlinks be able to be
backed up with a tar/zip command, or is something more elaborate needed?
In the past, before upgrading, I have had to tar the pgdata directory
with the postmaster shutdown to insure a quick restoration of the
database in case an upgrade didn't proceed uneventfully. Also, in the
event of a major version upgrade the restored information may or may not
proceed uneventfully. I just wanted to point out something I thought
might be an issue further down the road.

Perhaps the system catalog / flat file approach would be a more solid
approach, both of which would not involve replacing or duplicating OS
features.

From:	Marko Karppinen <marko(at)karppinen(dot)fi>
To:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-08 00:07:35
Message-ID:	9A79CD2E-7094-11D8-8995-000A95C56374@karppinen.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On 3 March 2004, at 19:52, Bruce Momjian wrote:
> The advantage of symlinks is that an administrator could see how things
> are laid out from the command line.

One thing to keep in mind is that system administrators don't see
symlinks as being informational -- they see them as the actual UI
for the redirection in question. So their expectation is that they'll
be able to move the actual directory around at will (as long as they
update the symlink to match).

If symlinks are used, the rule of least surprise would mean that
no information whatsoever about the physical location of a tablespace
should be stored in the system catalogs. Otherwise their relationship
with the information stored in the symlink is ambiguous.

From:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-08 13:32:49
Message-ID:	20040308133249.GA24861@phlogiston.dyndns.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote:
> One thing to keep in mind is that system administrators don't see
> symlinks as being informational -- they see them as the actual UI
> for the redirection in question. So their expectation is that they'll
> be able to move the actual directory around at will (as long as they
> update the symlink to match).

This is a good point. It's worth keeping in mind, too, that in large
shops, the DBAs and the sysadmins often are in separate departments
with separate management, precisely because the database system has
traditionally been somewhat divorced from the OS (as an aside, I
suspect that this sort of separation is part of the reason for the
popularity of raw filesystems among DBAs. Even if they didn't
provide better speed, it's just preferable not to have to involve
another department). System administrators in such places have been
known to decide to "reorganise the disks", assuming that the database
just has its own home. For such a sysadmin, a pile of symlinks would
be fair game for reorganisation.

--
Andrew Sullivan | ajs(at)crankycanuck(dot)ca

From:	Andreas Pflug <pgadmin(at)pse-consulting(dot)de>
To:
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-08 13:56:46
Message-ID:	404C7B9E.80303@pse-consulting.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Andrew Sullivan wrote:

>eorganise the disks", assuming that the database
>just has its own home. For such a sysadmin, a pile of symlinks would
>be fair game for reorganisation.
>
>

Please take into consideration that symlinks might be every day work for
*nix admins, but for win admins it's very uncommon. Additionally, win
admins are accustomed to gui tools, and many of them will stumble if
forced to use a command line. For worse, junctions are not
distinguishable in explorer; only the shell's dir command knows about
junctions. This obfuscation makes junctions quite a bad choice for admin
purposes.

IMHO there are only two viable options:
- no tablespaces for win32, i.e. recommend *ix for bigger installations
- a different tablespace storage approach., e.g. simple desktop links
(or alike) redirecting to a directory.

Regards,
Andreas

From:	Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us>
To:	Andrew Sullivan <ajs(at)crankycanuck(dot)ca>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-03-08 17:30:41
Message-ID:	200403081730.i28HUfa20453@candle.pha.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Andrew Sullivan wrote:
> On Mon, Mar 08, 2004 at 02:07:35AM +0200, Marko Karppinen wrote:
> > One thing to keep in mind is that system administrators don't see
> > symlinks as being informational -- they see them as the actual UI
> > for the redirection in question. So their expectation is that they'll
> > be able to move the actual directory around at will (as long as they
> > update the symlink to match).
>
> This is a good point. It's worth keeping in mind, too, that in large
> shops, the DBAs and the sysadmins often are in separate departments
> with separate management, precisely because the database system has
> traditionally been somewhat divorced from the OS (as an aside, I
> suspect that this sort of separation is part of the reason for the
> popularity of raw filesystems among DBAs. Even if they didn't
> provide better speed, it's just preferable not to have to involve
> another department). System administrators in such places have been
> known to decide to "reorganise the disks", assuming that the database
> just has its own home. For such a sysadmin, a pile of symlinks would
> be fair game for reorganisation.

Agreed. I think the idea is to use lstat to query the symlink, rather
than storing that information in the database. My idea was to create an
lstat server-side function that could be used by pg_dump and friends.

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-06-09 15:39:10
Message-ID:	87y8mwvizl.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:

That's a separate orthogonal feature called "partitioned tables". There's some
amount of resistance to the idea amongst postgres people, and there's nobody
who has spoken up interested in implementing it, but there's also lots of
interest from users. A good patch would probably go a long way to convincing
people :)

Table spaces are being able to store different tables in different physical
locations on disk. A first version of this has actually been implemented for
7.5 using symlinks.

Partitioned tables and tablespaces do indeed have a certain amount of synergy.
But even in a single tablespace your example makes sense.

> With the rule system and two underlying tables one could make it work by
> hand I think.

The rule system could be used to do this, but there was some discussion of
using inherited tables to handle it. However neither handles the really hard
part of detecting queries that use only a part of the table and taking that
into account in generating the plan.

--
greg

From:	pgsql(at)mohawksoft(dot)com
To:	"Andrew Sullivan" <ajs(at)crankycanuck(dot)ca>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: Tablespaces
Date:	2004-06-11 16:28:29
Message-ID:	16928.24.91.171.78.1086971309.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers pgsql-hackers-win32

> I don't think we want features for their own sake, though, and I'm
> not convinced that raw filesystems are actually useful. Course, it's
> not my itch, and PostgreSQL _is_ free software.
>

I agree that raw file systems are seldom useful with one caveat, more
advanced file systems are sometimes detrimental to database access.

Conceptually, a file system and a database are redundant, both are doing
their best to preserve data integrity. This is especially true with
journalling file systems. Not to mention technologies like reiserfs which
attempts to do sub-block allocation.

What I think would go a long way to improving database performance on
non-raw partitions would be a simplified file system -- SFS anyone? The
simplified file system would not track access time. It would not overly
try to manage disk space. The target applications are going to allocate
disk space on a block level, rather than quibble about 4K here or 8K here,
have a user defined standard allocation unit of 64K, 128K, or so on.
Reduction on allocation overhead also reduces meta-data updating I/O. I
can almost imagine 32BIT FAT with large clusers, only with real inodes.
The idea would be that a database, like PostgreSQL, would be managing the
data not the file system. The file systems job would only to be the most
minimalist interface to the OS.

The benefts would be awesome, near-raw partition access and standard OS
tools for maintainence.