Re: [GENERAL] One database vs. hundreds?

Lists: pgsql-generalpgsql-hackers
From: "Kynn Jones" <kynnjo(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: One database vs. hundreds?
Date: 2007-08-28 12:08:36
Message-ID: c2350ba40708280508g138ced4bwb4135043be3d474c@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

I'm hoping to get some advice on a design question I'm grappling with.
I have a database now that in many respects may be regarded as an
collection of a few hundred much smaller "parallel databases", all
having the same schema. What I mean by this is that, as far as the
intended use of this particular system there are no meaningful queries
whose results would include information from more than one of these
parallel component databases. Furthermore, one could delete all the
records of any one of these parallel components without affecting the
referential integrity of the rest of the database.

Therefore, both for performance and maintenance reasons, the idea of
splitting this database into its components looks very attractive.
This would result in a system with hundreds of small databases (and in
the future possibly reaching into the low thousands). I don't have
experience with such a situation, and I'm wondering if there are
issues I should be concerned about.

Alternatively, maybe there are techniques to achieve the benefits of
this split without actually carrying it out. The two benefits I see
are in the areas of performance and maintenance. As for performance,
I assume (naively, I'm sure) that searches will be faster in the
individual component databases, simply because it's a search among
fewer pieces of information. And for maintenance, I think the split
would make the system more robust during database updates, because
only a small component would be updated at a time, and the rest of the
system would completely insulated from this.

I'd very much appreciate your thoughts on these issues.

I imagine I'm not the first person to confront this kind of design
choice. Does it have a standard name that I could use in a Google
search?

TIA!

kj


From: Kamil Srot <kamil(dot)srot(at)nlogy(dot)com>
To: Kynn Jones <kynnjo(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: One database vs. hundreds?
Date: 2007-08-28 12:23:00
Message-ID: 46D413A4.2060100@nlogy.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers


Kynn Jones wrote:
> I'm hoping to get some advice on a design question I'm grappling with.
> I have a database now that in many respects may be regarded as an
> collection of a few hundred much smaller "parallel databases", all
> having the same schema. What I mean by this is that, as far as the
> intended use of this particular system there are no meaningful queries
> whose results would include information from more than one of these
> parallel component databases. Furthermore, one could delete all the
> records of any one of these parallel components without affecting the
> referential integrity of the rest of the database.
>
> Therefore, both for performance and maintenance reasons, the idea of
> splitting this database into its components looks very attractive.
> This would result in a system with hundreds of small databases (and in
> the future possibly reaching into the low thousands). I don't have
> experience with such a situation, and I'm wondering if there are
> issues I should be concerned about.
>
> Alternatively, maybe there are techniques to achieve the benefits of
> this split without actually carrying it out. The two benefits I see
> are in the areas of performance and maintenance. As for performance,
> I assume (naively, I'm sure) that searches will be faster in the
> individual component databases, simply because it's a search among
> fewer pieces of information. And for maintenance, I think the split
> would make the system more robust during database updates, because
> only a small component would be updated at a time, and the rest of the
> system would completely insulated from this.
>
> I'd very much appreciate your thoughts on these issues.
>
> I imagine I'm not the first person to confront this kind of design
> choice. Does it have a standard name that I could use in a Google
> search?
>
>
>
Hi Kynn.

I don't have experience in this type of application, but we use pgsql
partitioning for other reasons
and it has some of the features you want (data separation, query
performance, ...).
It can be worth reading:
http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html

Regards,
--
Kamil


From: "A(dot) Kretschmer" <andreas(dot)kretschmer(at)schollglas(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: One database vs. hundreds?
Date: 2007-08-28 12:32:32
Message-ID: 20070828123232.GE10490@a-kretschmer.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

am Tue, dem 28.08.2007, um 14:23:00 +0200 mailte Kamil Srot folgendes:
>
> Kynn Jones wrote:
> >I'm hoping to get some advice on a design question I'm grappling with.
> > I have a database now that in many respects may be regarded as an
> >collection of a few hundred much smaller "parallel databases", all
> >having the same schema. What I mean by this is that, as far as the
> >intended use of this particular system there are no meaningful queries
> >whose results would include information from more than one of these
>
> I don't have experience in this type of application, but we use pgsql
> partitioning for other reasons
> and it has some of the features you want (data separation, query
> performance, ...).
> It can be worth reading:
> http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html

He don't need table partitioning, this is a different thing.

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net


From: "A(dot) Kretschmer" <andreas(dot)kretschmer(at)schollglas(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: One database vs. hundreds?
Date: 2007-08-28 12:37:32
Message-ID: 20070828123732.GF10490@a-kretschmer.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

am Tue, dem 28.08.2007, um 8:08:36 -0400 mailte Kynn Jones folgendes:
> I'm hoping to get some advice on a design question I'm grappling with.
> I have a database now that in many respects may be regarded as an
> collection of a few hundred much smaller "parallel databases", all
> having the same schema. What I mean by this is that, as far as the
> intended use of this particular system there are no meaningful queries
> whose results would include information from more than one of these
> parallel component databases. Furthermore, one could delete all the

Maybe different schemas, one schema for every "parallel databases", can
help you. And different rights for the users.

Why one database with many schemas?

I suppose, you have objects to share with all users, for instance:
- programming languages
- stored procedures
- maybe shared data

Andreas
--
Andreas Kretschmer
Kontakt: Heynitz: 035242/47150, D1: 0160/7141639 (mehr: -> Header)
GnuPG-ID: 0x3FFF606C, privat 0x7F4584DA http://wwwkeys.de.pgp.net


From: Kevin Kempter <kevin(at)kevinkempterllc(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: One database vs. hundreds?
Date: 2007-08-28 12:47:29
Message-ID: 200708280647.29125.kevin@kevinkempterllc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

On Tuesday 28 August 2007 06:32:32 A. Kretschmer wrote:
> am Tue, dem 28.08.2007, um 14:23:00 +0200 mailte Kamil Srot folgendes:
> > Kynn Jones wrote:
> > >I'm hoping to get some advice on a design question I'm grappling with.
> > > I have a database now that in many respects may be regarded as an
> > >collection of a few hundred much smaller "parallel databases", all
> > >having the same schema. What I mean by this is that, as far as the
> > >intended use of this particular system there are no meaningful queries
> > >whose results would include information from more than one of these
> >
> > I don't have experience in this type of application, but we use pgsql
> > partitioning for other reasons
> > and it has some of the features you want (data separation, query
> > performance, ...).
> > It can be worth reading:
> > http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html
>
> He don't need table partitioning, this is a different thing.
>
>
> Andreas

I find that creating multiple schema's is often better than creating multiple
db's since in the remote chance you might want to query across the schema's
it's a no-brainer where queries across db's are a bit more difficult.

That sais AFAIK db's vs. schema's outside of the query Issue I mentioned above
are merely an organizational mechanism. Unless you're talking about multiple
clusters it's kinda the old 6 vs. half a dozen arguement - meaning it doesn't
really matter per performance or functionality from strictly a postgres
perspective. The deciscion then becomes an application architecture
question(s) such as how important is it to isolate these db's/schema's from
each other and is there a forseeable need to query across them in the future,
etc...

Hope this helps.. My vote would be for schema's unless there's a requirement
on the table for isolation - it leaves your options open & more flexible for
the future.

/Kevin


From: btober(at)ct(dot)metrocast(dot)net
To: Kamil Srot <kamil(dot)srot(at)nlogy(dot)com>
Cc: Kynn Jones <kynnjo(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)postgreSQL(dot)org
Subject: Re: [GENERAL] One database vs. hundreds?
Date: 2007-08-28 12:57:11
Message-ID: 46D41BA7.1070308@ct.metrocast.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

(On the GENERAL list) Kamil Srot wrote:
>
> Kynn Jones wrote:
>> I'm hoping to get some advice on a design question ...
>
>
> ...we use pgsql partitioning for other reasons
> and it has some of the features you want (data separation, query
> performance, ...).
> It can be worth reading:
> http://www.postgresql.org/docs/8.2/interactive/ddl-partitioning.html

I'm not sure the tip is really what the OP is going to need -- I think
he has a different problem in mind than that addresses -- but after a
quick R of TFM, all I can say is HOLY CR** THAT IS COOL! Postrgresql was
amazing when I first started using it at 7.2, and I'm continually
astounded by learning new features and techniques made available by the
continuing work of the development team.


From: "Kynn Jones" <kynnjo(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: One database vs. hundreds?
Date: 2007-08-28 14:36:22
Message-ID: c2350ba40708280736tde37dd2qb40c412c3a216a94@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Thank you very much for your replies. Given the differences in the
opinions expressed, I thought I would describe the database briefly.

The purpose of the database is basically translation of terms.
Imagine a collection of disjoint sets A, B, C, ...

Now imagine that for each element of a set multiple *names* exist,
although even within each set, the number of names varies from one
element to the next. To complicate matters further, imagine that
there are several *authorities* X, Y, Z... on the subject of naming
these elements, and they don't always agree.

So currently, the relevant portion of schema is something like

SET ... one record per available set
AUTHORITY ... one record per naming authority
ELEMENT ... which refers to SET and has one record per element in any
of the sets
NAME ... which refers to ELEMENT and to AUTHORITY and whose records
represent the various names assigned to each element by the various
authorities.

The largest tables, by far, are ELEMENT and NAME. It is clear from
this description that ELEMENT can be partitioned according to the set
each element belongs to, since these sets are disjoint (for the moment
I'm using the word "partitioned" in its general sense, not in the
sense of Pg's table partitioning that Kamil proposed). Similarly NAME
can be partitioned according to the authorities associated with each
naming (even when two authorities agree on a name for a given element,
NAME includes separate records to represent this fact). Furthermore,
each one of these authority-based partitions can be in turn
partitioned according to the set that the referred-to element belongs
to.

Therefore instead of having a single database we could have many
databases, one for each combination of set and authority, with the
simplified schema

ELEMENT
NAME ... refers to ELEMENT

From a semantic point of view, all the kinds of queries that the
system is designed to support would be satisfied by this design.

Given this it seems to me that Pg's table partitioning does indeed fit
the bill after all. I like the idea of being able to bulk-update
entire chunks of the database by creating temporary a partition,
testing it, deleting the old one, and renaming the temporary partition
to replace it.

Furthermore, it seems like this arrangement, contrary to my original
multiple-DB idea, does not preclude the occasional global query
spanning multiple authorities or sets (though these queries would be
useful mostly for administrative purposes).

I join btober in my amazement and admiration at the overall coolness
of PostgreSQL. I'm sending massive props to the developers right
now...

Many thanks!!!

kj


From: Shane Ambler <pgsql(at)Sheeky(dot)Biz>
To: Kynn Jones <kynnjo(at)gmail(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: One database vs. hundreds?
Date: 2007-08-28 15:44:14
Message-ID: 46D442CE.7020102@Sheeky.Biz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-general pgsql-hackers

Kynn Jones wrote:
> I'm hoping to get some advice on a design question I'm grappling with.
> I have a database now that in many respects may be regarded as an
> collection of a few hundred much smaller "parallel databases", all
> having the same schema. What I mean by this is that, as far as the
> intended use of this particular system there are no meaningful queries
> whose results would include information from more than one of these
> parallel component databases. Furthermore, one could delete all the
> records of any one of these parallel components without affecting the
> referential integrity of the rest of the database.
>
> Therefore, both for performance and maintenance reasons, the idea of
> splitting this database into its components looks very attractive.
> This would result in a system with hundreds of small databases (and in
> the future possibly reaching into the low thousands). I don't have
> experience with such a situation, and I'm wondering if there are
> issues I should be concerned about.
>
> Alternatively, maybe there are techniques to achieve the benefits of
> this split without actually carrying it out. The two benefits I see
> are in the areas of performance and maintenance. As for performance,
> I assume (naively, I'm sure) that searches will be faster in the
> individual component databases, simply because it's a search among
> fewer pieces of information. And for maintenance, I think the split
> would make the system more robust during database updates, because
> only a small component would be updated at a time, and the rest of the
> system would completely insulated from this.
>
> I'd very much appreciate your thoughts on these issues.
>
> I imagine I'm not the first person to confront this kind of design
> choice. Does it have a standard name that I could use in a Google
> search?
>

I see four ways you can go with this and the choice will be the one that
makes more sense to the way you see yourself accessing the data with
only one of which affecting the performance the way you mention.

1. is to have each set of tables and data in a separate database.
2. is to have each set of tables duplicated in a different schema name.
3. is to have it all in one set of tables with a column to identify
which set the row belongs to.
4. is to use the above with ddl-partitioning to achieve data separation
removing the performance costs of having it all in one table.

The first leaves you with complete data separation, you log in
differently to access each data set.
If you using psql then you would use a connect command to look at
different data. Other clients may mean you need to disconnect then connect.

The second allows you to to use the same log in command and you can
either use schema qualifications with the table names to access the data
set you want, or change the default schema to change between data sets.
The difference will depend on your client. You can use schema qualified
names with your client inserting the selected schema when generating the
sql. Or you may use SET search_path TO myschema; then SELECT without
schema qualified names, leaving your sql as it is.

The last two are similar with the forth using the ddl setup to separate
the data into separate tables for you, so you will still have the same
(or very close) performance as if each is separated by schema or db.
Here you would have to include datasetidcolumn=x within each sql
statement. Again this may be inserted by your client as it generates the
sql.

Whether the tables are separated by dbname, schema name or
ddl-partitioning each will be a separate table using their own indexes etc.

As far as backup/restore goes, you won't find much difference with
either option - 3 will mean they will be all together in one backup
file, the others mean adding a line to your backup script to dump the
data set to it's own file. If you just want to dumpall to one file then
it won't matter either way.

Apart from the first the other options allow you to share data between
data sets - a postcode list, state names, country names...

The choice may also be affected by your knowledge of sql. The
ddl-partitioning may make your table structure more complex and harder
for you to think with (mainly when defining a new data set). Yet adding
a column will mean adding it in one place (the parent table) not
repeating it for each db or schema.

As you mention increasing to a few thousand db's - what sort of server
load are you expecting? Would moving a few hundred of the db's to
another server be easier than setting up load balancing/replication when
things get too busy for your current server?

Of course if you ever want to report on stats comparing all db's then
the last two will be the easiest.
You may change your mind there later - even if each data set is a
clients financial records - what if you want to ask which clients spend
too much money on office supplies? which clients will be affected by the
new tax changes? or which clients have the largest profit margins? list
top ten clients with the most turnover.
If they all relate to different websites then which is the least
productive and not worth continuing?

I guess I rattled on a bit there - but that should cover most of the
points you will need to consider to make the choice. Even though you say
you don't need the data all together think forward to what may happen in
the future, and how much work it will be if you do change your mind.

--

Shane Ambler
pgSQL(at)Sheeky(dot)Biz

Get Sheeky @ http://Sheeky.Biz