Re: Multiple Postmasters on Beowulf cluster

Lists: pgsql-admin
From: Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 13:35:36
Message-ID: 20020726133536.63502.qmail@web20009.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

> I played with this tonight writing a small insert/update routine and
> frequent vacuums. Here is what I came up with ( (PostgreSQL) 7.2.1 )
>
This is some great info, thanks.

> In addition, max_fsm_pages has an impact on how many pages will be
> available to be marked as re-usable. If you have a huge table and
> changes are impacting more than the default 10,000 pages this is set to,
> you will want to bump this number up. My problem was I saw my UnUsed
> tuples always growing and not being re-used until I bumped this value
> up. As I watched the vacuum verbose output each run, I notices more
> than 10k pages were in fact changing between vacuums.
>
This has made me think about something we've been doing. We've got one
db that is used basically read-only; every day ~15000 records are added,
but very rarely are any deleted. What we've been doing is just letting it
sit until it gets close to too big for the filesystem, then lopping off
the earliest 6 months worth of records. The question is, is it best
to do this then set the max_fsm_pages to a huge number and vacuum full?
Or should I change it so scripts remove the oldest day and vacuum before
adding the next days?

Or just rebuild the db every time. :)

__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com


From: "Michael G(dot) Martin" <michael(at)vpmonline(dot)com>
To: Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 13:50:28
Message-ID: 3D4153A4.3080604@vpmonline.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

I believe the more frequently you vacuum, the faster it will go, so that
may be the driving factor in deciding. Personally, each day, I'd add
the new tuples then remove the no-longer needed tuples, make sure
max_fsm_pages is large enough to handle all the pages removed in the
largest table, then run a vacuum analyze on the table or entire
database. Run it each night and it will be nice and fast and you
shouldn't ever need to worry about locking the entire table with a
vacuum full or spend time to re-create the table and indicies.

That's what I do which I think is the most automated,maintainance-free
solution. I currently run a lazy vacuum analyze each night after making
my large changes. My tables don't change enough during the day to
require mid-day vacuums.

--Michael

Joshua Daniel Franklin wrote:

>>I played with this tonight writing a small insert/update routine and
>>frequent vacuums. Here is what I came up with ( (PostgreSQL) 7.2.1 )
>>
>>
>>
>This is some great info, thanks.
>
>
>
>>In addition, max_fsm_pages has an impact on how many pages will be
>>available to be marked as re-usable. If you have a huge table and
>>changes are impacting more than the default 10,000 pages this is set to,
>>you will want to bump this number up. My problem was I saw my UnUsed
>>tuples always growing and not being re-used until I bumped this value
>>up. As I watched the vacuum verbose output each run, I notices more
>>than 10k pages were in fact changing between vacuums.
>>
>>
>>
>This has made me think about something we've been doing. We've got one
>db that is used basically read-only; every day ~15000 records are added,
>but very rarely are any deleted. What we've been doing is just letting it
>sit until it gets close to too big for the filesystem, then lopping off
>the earliest 6 months worth of records. The question is, is it best
>to do this then set the max_fsm_pages to a huge number and vacuum full?
>Or should I change it so scripts remove the oldest day and vacuum before
>adding the next days?
>
>Or just rebuild the db every time. :)
>
>__________________________________________________
>Do You Yahoo!?
>Yahoo! Health - Feel better, live better
>http://health.yahoo.com
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>
>


From: Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com>
To: "Michael G(dot) Martin" <michael(at)vpmonline(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 14:45:37
Message-ID: 20020726144537.92735.qmail@web20002.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Perhaps I wasn't clear. There really aren't any (daily) "no-longer needed
tuples", just added ones. I am under the impression that vacuum is just for
freeing up tuples to be re-used, so the only time it needs to be run is after
the 6-monthly tuple massacre, at which time I would also need to set
max_fsm_pages to a huge number.

--- "Michael G. Martin" <michael(at)vpmonline(dot)com> wrote:
> I believe the more frequently you vacuum, the faster it will go, so that
> may be the driving factor in deciding. Personally, each day, I'd add
> the new tuples then remove the no-longer needed tuples, make sure
> max_fsm_pages is large enough to handle all the pages removed in the
> largest table, then run a vacuum analyze on the table or entire
> database. Run it each night and it will be nice and fast and you
> shouldn't ever need to worry about locking the entire table with a
> vacuum full or spend time to re-create the table and indicies.
>
> That's what I do which I think is the most automated,maintainance-free
> solution. I currently run a lazy vacuum analyze each night after making
> my large changes. My tables don't change enough during the day to
> require mid-day vacuums.
>
> --Michael
>
> Joshua Daniel Franklin wrote:
> >
> >>In addition, max_fsm_pages has an impact on how many pages will be
> >>available to be marked as re-usable. If you have a huge table and
> >>changes are impacting more than the default 10,000 pages this is set to,
> >>you will want to bump this number up. My problem was I saw my UnUsed
> >>tuples always growing and not being re-used until I bumped this value
> >>up. As I watched the vacuum verbose output each run, I notices more
> >>than 10k pages were in fact changing between vacuums.
> >>
> >This has made me think about something we've been doing. We've got one
> >db that is used basically read-only; every day ~15000 records are added,
> >but very rarely are any deleted. What we've been doing is just letting it
> >sit until it gets close to too big for the filesystem, then lopping off
> >the earliest 6 months worth of records. The question is, is it best
> >to do this then set the max_fsm_pages to a huge number and vacuum full?
> >Or should I change it so scripts remove the oldest day and vacuum before
> >adding the next days?
> >
> >Or just rebuild the db every time. :)
> >

__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com


From: "Michael G(dot) Martin" <michael(at)vpmonline(dot)com>
To: Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 15:10:55
Message-ID: 3D41667F.40208@vpmonline.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Yea, you're correct. I think you'll be able to avoid the vacuum full
and re-use the tuples by making sure max_fsm_pages is large enough to
handle the number of pages changed by the 6-month massacre. After your
vacuum, note the unused tuples and page size of the table. Then, as you
incremently add new stuff over the next 6 months, you should see the
unused tuples decrease while the page size remains failry fixed. The
only other thing you may want to do more frequently is analyze if the
new tuples might change some statistics during the 6-month interval.

--Michael

Joshua Daniel Franklin wrote:

>Perhaps I wasn't clear. There really aren't any (daily) "no-longer needed
>tuples", just added ones. I am under the impression that vacuum is just for
>freeing up tuples to be re-used, so the only time it needs to be run is after
>the 6-monthly tuple massacre, at which time I would also need to set
>max_fsm_pages to a huge number.
>
>--- "Michael G. Martin" <michael(at)vpmonline(dot)com> wrote:
>
>
>>I believe the more frequently you vacuum, the faster it will go, so that
>>may be the driving factor in deciding. Personally, each day, I'd add
>>the new tuples then remove the no-longer needed tuples, make sure
>>max_fsm_pages is large enough to handle all the pages removed in the
>>largest table, then run a vacuum analyze on the table or entire
>>database. Run it each night and it will be nice and fast and you
>>shouldn't ever need to worry about locking the entire table with a
>>vacuum full or spend time to re-create the table and indicies.
>>
>>That's what I do which I think is the most automated,maintainance-free
>>solution. I currently run a lazy vacuum analyze each night after making
>>my large changes. My tables don't change enough during the day to
>>require mid-day vacuums.
>>
>>--Michael
>>
>>Joshua Daniel Franklin wrote:
>>
>>
>>>>In addition, max_fsm_pages has an impact on how many pages will be
>>>>available to be marked as re-usable. If you have a huge table and
>>>>changes are impacting more than the default 10,000 pages this is set to,
>>>>you will want to bump this number up. My problem was I saw my UnUsed
>>>>tuples always growing and not being re-used until I bumped this value
>>>>up. As I watched the vacuum verbose output each run, I notices more
>>>>than 10k pages were in fact changing between vacuums.
>>>>
>>>>
>>>>
>>>This has made me think about something we've been doing. We've got one
>>>db that is used basically read-only; every day ~15000 records are added,
>>>but very rarely are any deleted. What we've been doing is just letting it
>>>sit until it gets close to too big for the filesystem, then lopping off
>>>the earliest 6 months worth of records. The question is, is it best
>>>to do this then set the max_fsm_pages to a huge number and vacuum full?
>>>Or should I change it so scripts remove the oldest day and vacuum before
>>>adding the next days?
>>>
>>>Or just rebuild the db every time. :)
>>>
>>>
>>>
>
>
>__________________________________________________
>Do You Yahoo!?
>Yahoo! Health - Feel better, live better
>http://health.yahoo.com
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>message can get through to the mailing list cleanly
>
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com>
Cc: "Michael G(dot) Martin" <michael(at)vpmonline(dot)com>, pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 15:35:38
Message-ID: 25769.1027697738@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com> writes:
> Perhaps I wasn't clear. There really aren't any (daily) "no-longer needed
> tuples", just added ones. I am under the impression that vacuum is just for
> freeing up tuples to be re-used, so the only time it needs to be run is after
> the 6-monthly tuple massacre, at which time I would also need to set
> max_fsm_pages to a huge number.

If you do VACUUM FULL after each "tuple massacre" (which you'd better,
since the point AFAICT is to cut the total disk space used by the file)
then there's not really any need for bumping up max_fsm_pages. The
post-vacuum-full state of the table isn't going to have a whole lot
of embedded free space ...

regards, tom lane


From: "Michael G(dot) Martin" <michael(at)vpmonline(dot)com>
To: Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 15:38:41
Message-ID: 3D416D01.7000909@vpmonline.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

I look at it like this:

Your database takes up space X after a full vacuum and is ready for the
next 6 months of inserts. Then , over the next 6 months it grows by
space Y, it now occupies X+Y Space.

You then remove a bunch of old tuples. Space is still X+Y. You now
have 2 basic options:

1. Run a vacuum full -- this locks the entier table, and de-fragments
all unused space, so space is now back to X. Table will grow incremently
by Y over the next 6 months again.
2. Run a lazy vacuum-- no lock, no de-fragment, space is still X+Y.
Assuming max_fsm_pages was large enough to hold all the changed pages,
over the next 6 months, the space remains fixed at about X+Y. You are
now re-using the unused table space.

Either solution will work. If you really want to cut disk space, choose
1. If you want to keep the space at about it optimal size and avoid any
downtime, choose 2.

--Michael

Michael G. Martin wrote:

> Yea, you're correct. I think you'll be able to avoid the vacuum full
> and re-use the tuples by making sure max_fsm_pages is large enough to
> handle the number of pages changed by the 6-month massacre. After your
> vacuum, note the unused tuples and page size of the table. Then, as
> you incremently add new stuff over the next 6 months, you should see
> the unused tuples decrease while the page size remains failry fixed.
> The only other thing you may want to do more frequently is analyze if
> the new tuples might change some statistics during the 6-month interval.
>
> --Michael
>
> Joshua Daniel Franklin wrote:
>
>>Perhaps I wasn't clear. There really aren't any (daily) "no-longer needed
>>tuples", just added ones. I am under the impression that vacuum is just for
>>freeing up tuples to be re-used, so the only time it needs to be run is after
>>the 6-monthly tuple massacre, at which time I would also need to set
>>max_fsm_pages to a huge number.
>>
>>--- "Michael G. Martin" <michael(at)vpmonline(dot)com> wrote:
>>
>>
>>>I believe the more frequently you vacuum, the faster it will go, so that
>>>may be the driving factor in deciding. Personally, each day, I'd add
>>>the new tuples then remove the no-longer needed tuples, make sure
>>>max_fsm_pages is large enough to handle all the pages removed in the
>>>largest table, then run a vacuum analyze on the table or entire
>>>database. Run it each night and it will be nice and fast and you
>>>shouldn't ever need to worry about locking the entire table with a
>>>vacuum full or spend time to re-create the table and indicies.
>>>
>>>That's what I do which I think is the most automated,maintainance-free
>>>solution. I currently run a lazy vacuum analyze each night after making
>>>my large changes. My tables don't change enough during the day to
>>>require mid-day vacuums.
>>>
>>>--Michael
>>>
>>>Joshua Daniel Franklin wrote:
>>>
>>>
>>>>>In addition, max_fsm_pages has an impact on how many pages will be
>>>>>available to be marked as re-usable. If you have a huge table and
>>>>>changes are impacting more than the default 10,000 pages this is set to,
>>>>>you will want to bump this number up. My problem was I saw my UnUsed
>>>>>tuples always growing and not being re-used until I bumped this value
>>>>>up. As I watched the vacuum verbose output each run, I notices more
>>>>>than 10k pages were in fact changing between vacuums.
>>>>>
>>>>>
>>>>>
>>>>This has made me think about something we've been doing. We've got one
>>>>db that is used basically read-only; every day ~15000 records are added,
>>>>but very rarely are any deleted. What we've been doing is just letting it
>>>>sit until it gets close to too big for the filesystem, then lopping off
>>>>the earliest 6 months worth of records. The question is, is it best
>>>>to do this then set the max_fsm_pages to a huge number and vacuum full?
>>>>Or should I change it so scripts remove the oldest day and vacuum before
>>>>adding the next days?
>>>>
>>>>Or just rebuild the db every time. :)
>>>>
>>>>
>>>>
>>
>>
>>__________________________________________________
>>Do You Yahoo!?
>>Yahoo! Health - Feel better, live better
>>http://health.yahoo.com
>>
>>---------------------------(end of broadcast)---------------------------
>>TIP 3: if posting/reading through Usenet, please send an appropriate
>>subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
>>message can get through to the mailing list cleanly
>>
>>
>


From: Joshua Daniel Franklin <joshuadfranklin(at)yahoo(dot)com>
To: "Michael G(dot) Martin" <michael(at)vpmonline(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 16:11:25
Message-ID: 20020726161125.95391.qmail@web20005.mail.yahoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Thanks, this is exactly what I was thinking.

--- "Michael G. Martin" <michael(at)vpmonline(dot)com> wrote:
> You then remove a bunch of old tuples. Space is still X+Y. You now
> have 2 basic options:
>
> 1. Run a vacuum full -- this locks the entier table, and de-fragments
> all unused space, so space is now back to X. Table will grow incremently
> by Y over the next 6 months again.
> 2. Run a lazy vacuum-- no lock, no de-fragment, space is still X+Y.
> Assuming max_fsm_pages was large enough to hold all the changed pages,
> over the next 6 months, the space remains fixed at about X+Y. You are
> now re-using the unused table space.
>
> Either solution will work. If you really want to cut disk space, choose
> 1. If you want to keep the space at about it optimal size and avoid any
> downtime, choose 2.
>
> --Michael

__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com


From: Marc Spitzer <marc(at)oscar(dot)eng(dot)cv(dot)net>
To: pgsql-admin(at)postgresql(dot)org
Subject: Re: Postgres performance slowly gets worse over a month
Date: 2002-07-26 17:34:16
Message-ID: 20020726133416.B16887@oscar.eng.cv.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

On Fri, Jul 26, 2002 at 09:11:25AM -0700, Joshua Daniel Franklin wrote:
> Thanks, this is exactly what I was thinking.
>
> --- "Michael G. Martin" <michael(at)vpmonline(dot)com> wrote:
> > You then remove a bunch of old tuples. Space is still X+Y. You now
> > have 2 basic options:
> >
> > 1. Run a vacuum full -- this locks the entier table, and de-fragments
> > all unused space, so space is now back to X. Table will grow incremently
> > by Y over the next 6 months again.
> > 2. Run a lazy vacuum-- no lock, no de-fragment, space is still X+Y.
> > Assuming max_fsm_pages was large enough to hold all the changed pages,
> > over the next 6 months, the space remains fixed at about X+Y. You are
> > now re-using the unused table space.
> >
> > Either solution will work. If you really want to cut disk space, choose
> > 1. If you want to keep the space at about it optimal size and avoid any
> > downtime, choose 2.
> >
> > --Michael
>

Do not forget to reindex the db after the delete, index's do not
manage them selves(if I remember correctly). The index will continue
to grow until it eats your file system, as it did with me. Also if
you do not reindex regulary it can take a looong time to do, much like
vacuum. Also bigger indexes mean slower queries.

marc

>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Health - Feel better, live better
> http://health.yahoo.com
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo(at)postgresql(dot)org so that your
> message can get through to the mailing list cleanly


From: "Jan Hartmann" <jhart(at)frw(dot)uva(dot)nl>
To: <pgsql-admin(at)postgresql(dot)org>
Subject: Multiple Postmasters on Beowulf cluster
Date: 2002-07-27 15:24:03
Message-ID: DIEALLGCLLCNIHBDCMAEKEFJCDAA.jhart@frw.uva.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Hello,

I am using Postgresql with PostGis and the Minnesota MapServer on a Beowulf
cluster for web-mapping applications. It runs fine on one node, producing
very fast interactive maps for an Apache/PHP web server. However, the
cluster consists of 45 nodes, all using a shared user file system. Is it
possible to start up a postmaster on every node, using the same database?
The backend processes themselves would be completely autonomous, but they
would have to share their data from the same source. To simplify things,
only read-access is necessary. Would this be possible, and if so, how can
the different postmasters be made to use a different postmaster.pid file
(which is located in the shared data directory)?

It would be an interesting way for using the cluster, as the individual map
layers can be independently constructed on different nodes, and only finally
have to be put together in a complete map. Essentially, MapServer constructs
a map from layers, where each layer originates from an individual PostgreSQL
connection, even when using only one database. In a cluster solution
therefore, no communication between the nodes would be required. Even the
data could be distributed over the nodes and put into different databases,
but this would inevitably lead to much duplication and a set of databases
that would be very difficult to administer. In the archives, I saw some
mention of work in progress on distributed databases, but for this I don't
need much in the way of distributed facilities, just reading shared data.

Any help would be very much appreciated. It certainly would be a great
example of PostgrSQL's advanced geometrical capabilities!

Jan Hartmann
Department of Geography
University of Amsterdam
jhart(at)frw(dot)uva(dot)nl


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Jan Hartmann" <jhart(at)frw(dot)uva(dot)nl>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Multiple Postmasters on Beowulf cluster
Date: 2002-07-27 15:34:11
Message-ID: 4400.1027784051@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

"Jan Hartmann" <jhart(at)frw(dot)uva(dot)nl> writes:
> cluster consists of 45 nodes, all using a shared user file system. Is it
> possible to start up a postmaster on every node, using the same database?

No.

> The backend processes themselves would be completely autonomous, but they
> would have to share their data from the same source. To simplify things,
> only read-access is necessary.

In that case, make 45 copies of the database ...

regards, tom lane


From: Paul Ramsey <pramsey(at)refractions(dot)net>
To: pgsql-admin(at)postgresql(dot)org
Cc: jhart(at)frw(dot)uva(dot)nl, tgl(at)sss(dot)pgh(dot)pa(dot)us
Subject: Re: Multiple Postmasters on Beowulf cluster
Date: 2002-07-28 20:01:46
Message-ID: 3D444DAA.712FEBFE@refractions.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Completely aside from the database issues, mapserver would need some
heavy work to make all the layer processes "independent" and amenable to
beowulf-style parallelization. It is worth noting that layers are not
really all that independant from a display point of view. If you are
drawing lake polygons on top of jurisdiction polygons, for example, you
are first drawing all the jurisdictions into the end image, then drawing
all the lakes into that image. The process of creating the final visual
product is the result of sequential application of layers. Individually
rendering then serially combining the layers may or may not be more
efficient than simply serially rendering them, depending on the render
complexity of the layers.

A little work on multi-threading mapserver on the data-access side would
not hurt though, since reading from the source data files are completely
independant actions.

As far as parallelizing database reads, Tom can correct me, but
multi-processor database systems would help (one postmaster, many
postgres's). I wonder also if a Mosix system would work as a way of
using a cluster for database work? Having the database process working
on a network mounted datastore just can't be good mojo though...

P.

Tom Lane wrote:
>
> "Jan Hartmann" <jhart(at)frw(dot)uva(dot)nl> writes:
> > cluster consists of 45 nodes, all using a shared user file system. Is it
> > possible to start up a postmaster on every node, using the same database?
>
> No.
>
> > The backend processes themselves would be completely autonomous, but they
> > would have to share their data from the same source. To simplify things,
> > only read-access is necessary.
>
> In that case, make 45 copies of the database ...
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org


From: "Robert M(dot) Meyer" <rmeyer(at)installs(dot)com>
To: Jan Hartmann <jhart(at)frw(dot)uva(dot)nl>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Multiple Postmasters on Beowulf cluster
Date: 2002-07-29 13:30:38
Message-ID: 1027949439.20988.8.camel@skymaster
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Aside from the problems associated with making multiple Postgresql
processes access a single data store, wouldn't you be dealing with a
severe performance penalty? Since disk is typically the slowest part of
any system, I would imagine that 45 nodes, all beating on one network
file system (or a multiport filesystem for that matter) would tend to
slow things down dramatically. I would think that it would be better to
make 45 separate copies of the database and then if there are updates,
make some kind of process to pass all of the transactions to each
instantiation of the DB. Granted, the disk space would increase to 45X
the original estimate. How much updating/changing goes on in the Db?

Cheers!

Bob

On Sat, 2002-07-27 at 11:24, Jan Hartmann wrote:
> Hello,
>
> I am using Postgresql with PostGis and the Minnesota MapServer on a Beowulf
> cluster for web-mapping applications. It runs fine on one node, producing
> very fast interactive maps for an Apache/PHP web server. However, the
> cluster consists of 45 nodes, all using a shared user file system. Is it
> possible to start up a postmaster on every node, using the same database?
> The backend processes themselves would be completely autonomous, but they
> would have to share their data from the same source. To simplify things,
> only read-access is necessary. Would this be possible, and if so, how can
> the different postmasters be made to use a different postmaster.pid file
> (which is located in the shared data directory)?
>
> It would be an interesting way for using the cluster, as the individual map
> layers can be independently constructed on different nodes, and only finally
> have to be put together in a complete map. Essentially, MapServer constructs
> a map from layers, where each layer originates from an individual PostgreSQL
> connection, even when using only one database. In a cluster solution
> therefore, no communication between the nodes would be required. Even the
> data could be distributed over the nodes and put into different databases,
> but this would inevitably lead to much duplication and a set of databases
> that would be very difficult to administer. In the archives, I saw some
> mention of work in progress on distributed databases, but for this I don't
> need much in the way of distributed facilities, just reading shared data.
>
> Any help would be very much appreciated. It certainly would be a great
> example of PostgrSQL's advanced geometrical capabilities!
>
>
> Jan Hartmann
> Department of Geography
> University of Amsterdam
> jhart(at)frw(dot)uva(dot)nl
--
Robert M. Meyer
Sr. Network Administrator
DigiVision Satellite Services
14 Lafayette Sq, Ste 410
Buffalo, NY 14203-1904
(716)332-1451


From: "Jan Hartmann" <jhart(at)frw(dot)uva(dot)nl>
To: <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Multiple Postmasters on Beowulf cluster
Date: 2002-07-29 16:43:26
Message-ID: DIEALLGCLLCNIHBDCMAEOEHFCDAA.jhart@frw.uva.nl
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-admin

Thanks a lot for the reactions. I experimented a bit further with the
answers in mind and got the following result:

(Tom Lane)
> In that case, make 45 copies of the database ...

Without expecting much I created 3 data-directories and made symbolic links
from everything in the original data-directory, except postmaster.pid. Next
I started PostgreSQL on 3 nodes with PGDATA set to a different directory and
PGPORT to a different port. Surprisingly it worked! First startup gave a
message on each node about not having had a proper shutdown, but afterwards
everything ran ok from all servers, even restarting PostgreSQL. MapServer
didn't have a problem at all in producing a map from layers requested from
different nodes, although without any time gain (see next point). Probably
this gives PostgreSQL wizards the creeps and I wouldn't advise it to anyone,
but just for curiosity's sake, what dangers am I running (given that only
read access is needed).

(Paul Ramsey)
> It is worth noting that layers are not really all that independant from a
display
> point of view. (...) The process of creating the final visual product is
the
> result of sequential application of layers.

Yes, I didn't realise that MapServer waits until a layer has been returned
from PostgreSQL before starting with the next, essentially doing nothing in
the meantime. I thought it worked like a web browser retrieving images,
which is done asynchronously. It should work however when asking for
complete maps from different browser frames, or retrieving a map in one
frame and getting statistics for it in another, using separate PHP scripts
targeted at different nodes. This would already help me enormously.

(Bob Meyer)
> Since disk is typically the slowest part of
> any system, I would imagine that 45 nodes, all beating on one network
> file system (or a multiport filesystem for that matter) would tend to
> slow things down dramatically. I would think that it would be better to
> make 45 separate copies of the database and then if there are updates,
> make some kind of process to pass all of the transactions to each
> instantiation of the DB. Granted, the disk space would increase to 45X
> the original estimate. How much updating/changing goes on in the Db?

I am trying this out for population statistics in Dutch municipalities
within specified distances (1, 2, 5, 10, 25 km etc ) from the railway
network. Number of railway lines: 419 (each having numerous line segments),
#municipalites: 633, size of mun. map about 10M. It takes about 30 seconds
wall time to produce a map (good compared to desktop GIS-sytems, I have no
experience with Oracle Spatial). Next step would be using the roads network,
(much larger of course, but still in the range of tens of megabytes, perhaps
a hundred), and data from very diverse sources and years, including raster
bitmaps, all not excessively large. Lots of different buffers have to be put
around all kinds of selections (type roads, geographical selections) and
compared with each other. Last step is animating the results in Flash
movies: MapServer will be supporting Flash in the very near future, and I
already got some preliminary results. This will require even more
computations of intermediate results, to get flowing movies. So the problem
is not data access, it is computing power and administration of a very
disparate bunch of data. I certainly have enough computing power, and
probably also enough disk space for a 45 fold data reduplication, but I know
from experience how error prone this is, even with duplicating scripts. Even
so, unless I am very much mistaken, the MapServer-PostgreSQL-Beowulf
combinatation should offer some very interesting prospects in GIS.

Thanks for the answers

Jan

Jan Hartmann
Department of Geography
University of Amsterdam
jhart(at)frw(dot)uva(dot)nl