Quick Links

Re: Some ideas about Vacuum

Lists:	pgsql-hackers

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Some ideas about Vacuum
Date:	2008-01-09 10:39:31
Message-ID:	9362e74e0801090239y4fde9d93tac044907df842c39@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,
May be i am reposting something which has been discussed to end in this
forum. I have made a search in the archives and i couldn't find any
immediately.
With my relatively small experience in Performance Testing and Tuning,
one of the rules of thumb for getting Performance is "Don't do it, if you
don't need to do it". When we look at clearing the older versions of tuples
from our tables in PostgreSQL, we can't stop thinking about how it is done
in other databases. When we compare the Oracle Undo Log approach with the
Postgresql Vacuum approach, the pattern looks very similar to C++ memory
de-allocation and Java garbage collection.
So, as you may all know, the thing which worries us about Vacuum is
that it is going to places where it need not goto. That's when we are
thinking about Dead space Map. This dead space map is a map, if implemented
correctly, would guide Vacuum to go and only look at places where there was
some activity of Delete/Update/Insert after the last Vacuum. This is
accomplished at the cost of some very small overhead to
Inserts/Deletes/Updates.
Dead space Map is like an undo-log, if we think its role is to get rid
of the older versions of data. Instead of moving the tuples to separate
location, it guides the Vacuum process to do the cleanup task. May be we can
even think of something like Dead space log, which may not be a bitmap. In
this log, transactions might enter their transaction ids and ctids, which
can be scanned by the Vacuum process. While this might take more space, it
is with lesser contention, while compared to Dead space Map. To me, as far
as i can think of, the only advantage of Dead space Map over Dead space log
is the disk space.
It just strikes me that WAL log is already doing just that. I think you
can follow my thought-line. If we can ask the Vacuum process to scan the WAL
log, it can get all the relevant details on where it needs to go. One
optimization, that can be placed here is to somehow make the archiver do a
double-job of helping the Vacuum, while doing the archiving. For people, who
have switched off archiving, this might not be a benefit.
One main restriction it places on the WAL Logs is that the WAL Log needs
to be archived only after all the transactions in it completes. In other
words, WAL logs need to be given enough space, to survive the longest
transaction of the database. It is possible to avoid this situation by
asking the Vacuum process to take the necessary information out of WAL log
and store it somewhere and wait for the long running transaction to
complete.
The information of interest in WAL is only the table
inserts/updates/deletes. So if everyone accepts that this is a good idea,
till this point, there is a point in reading further.
Ultimately, what has been achieved till now is that we have made the
sequential scans made by the Vacuum process on each table into a few random
i/os. Of course there are optimizations possible to group the random i/os
and find some sequential i/o out of it. But still we need to do a full index
scan for all those indexes out there. HOT might have saved some work over
there. But i am pessimistic here and wondering how it could have been
improved. So it just strikes me, we can do the same thing which we did just
with the tables. Convert a seq scan of the entire table into a random scan
of few blocks. We can read the necessary tuple information from the tuples,
group them and hit at the index in just those blocks and clean it up.
I can already hear people, saying that it is not always possible to go
back to index from table. There is this culprit called unstable function
based indexes. The structure stops us from going back to index from table.
So currently we should restrict the above said approach to only normal
indexes(not the function based ones). I hope it would still give a good
benefit.
Of course Vacuum can convert the few random scans into a seq scan, if
required by referring to table statistics.

Thoughts about the idea????

Thanks,
Gokul.

P.S.: Let the objections/opposing views have a subtle reduction in its
harshness.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-09 12:21:29
Message-ID:	4784BC49.5090701@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Gokulakannan Somasundaram wrote:
> If we can ask the Vacuum process to scan
> the WAL log, it can get all the relevant details on where it needs to
> go.

You seem to be assuming that only few tuples have changed between
vacuums, so that WAL could quickly guide the VACUUM processes to the
areas where cleaning is necessary.

Let's drop that assumption, because by default, autovacuum_scale_factor
is 20%, so a VACUUM process normally kicks in after 20% of tuples
changed (disk space is cheap, I/O isn't). Additionally, there's a
default nap time of one minute - and VACUUM is forced to take at least
that much of a nap.

So it's easily possible having more dead tuples, than live ones. In such
cases, scanning the WAL can easily takes *longer* than scanning the
table, because the amount of WAL to read would be bigger.

> One main restriction it places on the WAL Logs is that the WAL Log
> needs to be archived only after all the transactions in it completes. In
> other words, WAL logs need to be given enough space, to survive the
> longest transaction of the database. It is possible to avoid this
> situation by asking the Vacuum process to take the necessary information
> out of WAL log and store it somewhere and wait for the long running
> transaction to complete.

That would result in even more I/O...

> The information of interest in WAL is only the table
> inserts/updates/deletes. So if everyone accepts that this is a good
> idea, till this point, there is a point in reading further.

Well, that's the information of interest, the question is where to store
that information. Maintaining a dead space map looks a lot cheaper to
me, than relying on the WAL to store that information.

> Ultimately, what has been achieved till now is that we have made the
> sequential scans made by the Vacuum process on each table into a few
> random i/os. Of course there are optimizations possible to group the
> random i/os and find some sequential i/o out of it. But still we need to
> do a full index scan for all those indexes out there. HOT might have
> saved some work over there. But i am pessimistic here and wondering how
> it could have been improved. So it just strikes me, we can do the same
> thing which we did just with the tables. Convert a seq scan of the
> entire table into a random scan of few blocks. We can read the necessary
> tuple information from the tuples, group them and hit at the index in
> just those blocks and clean it up.

Sorry, I don't quite get what you are talking about here. What do
indexes have to do with dead space? Why not just keep acting on the
block level?

> I can already hear people, saying that it is not always possible to
> go back to index from table. There is this culprit called unstable
> function based indexes.

No, there's no such thing. Citing [1]: "All functions and operators used
in an index definition must be "immutable", that is, their results must
depend only on their arguments and never on any outside influence".

Of course, you can mark any function IMMUTABLE and get unstable function
based indexes, but that turns into a giant foot gun very quickly.

> P.S.: Let the objections/opposing views have a subtle reduction in its
> harshness.

I'm just pointing at things that are in conflict with my knowledge,
assumptions and believes, all which might be erroneous, plain wrong or
completely mad. ;-)

Regards

Markus

[1]: the Very Fine Postgres Manual on CREATE INDEX:
http://www.postgresql.org/docs/8.3/static/sql-createindex.html

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-09 14:49:38
Message-ID:	9362e74e0801090649j37f56e3cg1a153738925c280@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> So it's easily possible having more dead tuples, than live ones. In such
> cases, scanning the WAL can easily takes *longer* than scanning the
> table, because the amount of WAL to read would be bigger.

Yes... i made a wrong assumption there...... so the idea is totally
useless.

Thanks,
Gokul.

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-09 15:10:34
Message-ID:	87hchnm3v9.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Markus Schiltknecht" <markus(at)bluegap(dot)ch> writes:

> Hi,
>
> Gokulakannan Somasundaram wrote:
>> If we can ask the Vacuum process to scan the WAL log, it can get all the
>> relevant details on where it needs to go.

That's an interesting thought. I think your caveats are right but with some
more work it might be possible to work it out. For example if a background
process processed the WAL and accumulated an array of possibly-dead tuples to
process in batch. It would wait whenever it sees an xid which isn't yet past
globalxmin, and keep accumulating until it has enough to make it worthwhile
doing a pass.

I think a bigger issue with this approach is that it ties all your tables
together. You can't process one table frequently while some other table has
some long-lived deleted tuples.

I'm also not sure it really buys us anything over having a second
dead-space-map data structure. The WAL is much larger and serves other
purposes which would limit what we can do with it.

> You seem to be assuming that only few tuples have changed between vacuums, so
> that WAL could quickly guide the VACUUM processes to the areas where cleaning
> is necessary.
>
> Let's drop that assumption, because by default, autovacuum_scale_factor is 20%,
> so a VACUUM process normally kicks in after 20% of tuples changed (disk space
> is cheap, I/O isn't). Additionally, there's a default nap time of one minute -
> and VACUUM is forced to take at least that much of a nap.

I think this is exactly backwards. The goal should be to improve vacuum, then
adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper
the scale factor can go lower and lower. We shouldn't allow the existing
autovacuum behaviour to control the way vacuum works.

As a side point, "disk is cheap, I/O isn't" is a weird statement. The more
disk you use the more I/O you'll have to do to work with the data. I still
maintain the default autovacuum_scale_factor is *far* to liberal. If I had my
druthers it would be 5%. But that's mostly informed by TPCC experience, in
real life the actual value will vary depending on the width of your records
and the relative length of your transactions versus transaction rate. The TPCC
experience is with ~ 400 byte records and many short transactions.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning

From:	Simon Riggs <simon(at)2ndquadrant(dot)com>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Markus Schiltknecht <markus(at)bluegap(dot)ch>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-09 16:24:50
Message-ID:	1199895890.4266.371.camel@ebony.site
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-01-09 at 15:10 +0000, Gregory Stark wrote:

> The goal should be to improve vacuum, then
> adjust the autovacuum_scale_factor as low as we can. As vacuum gets
> cheaper the scale factor can go lower and lower. We shouldn't allow
> the existing autovacuum behaviour to control the way vacuum works.

Very much agreed.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Gregory Stark <stark(at)enterprisedb(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-09 16:40:05
Message-ID:	4784F8E5.4020403@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Gregory Stark wrote:
> That's an interesting thought. I think your caveats are right but with some
> more work it might be possible to work it out. For example if a background
> process processed the WAL and accumulated an array of possibly-dead tuples to
> process in batch. It would wait whenever it sees an xid which isn't yet past
> globalxmin, and keep accumulating until it has enough to make it worthwhile
> doing a pass.

I don't understand why one would want to go via the WAL, that only
creates needless I/O. Better accumulate the data right away, during the
inserts, updates and deletes. Spilling the accumulated data to disk, if
absolutely required, would presumably still result in less I/O.

> I think a bigger issue with this approach is that it ties all your tables
> together. You can't process one table frequently while some other table has
> some long-lived deleted tuples.

Don't use the WAL as the source of that information and that's issue's gone.

> I'm also not sure it really buys us anything over having a second
> dead-space-map data structure. The WAL is much larger and serves other
> purposes which would limit what we can do with it.

Exactly.

>> You seem to be assuming that only few tuples have changed between vacuums, so
>> that WAL could quickly guide the VACUUM processes to the areas where cleaning
>> is necessary.
>>
>> Let's drop that assumption, because by default, autovacuum_scale_factor is 20%,
>> so a VACUUM process normally kicks in after 20% of tuples changed (disk space
>> is cheap, I/O isn't). Additionally, there's a default nap time of one minute -
>> and VACUUM is forced to take at least that much of a nap.
>
> I think this is exactly backwards. The goal should be to improve vacuum, then
> adjust the autovacuum_scale_factor as low as we can. As vacuum gets cheaper
> the scale factor can go lower and lower.

But you can't lower it endlessly, it's still a compromise, because it
also means reducing the amount of tuples being cleaned per scan, which
is against the goal of minimizing overall I/O cost of vacuuming.

> We shouldn't allow the existing
> autovacuum behaviour to control the way vacuum works.

That's a point.

> As a side point, "disk is cheap, I/O isn't" is a weird statement. The more
> disk you use the more I/O you'll have to do to work with the data.

That's only true, as long as you need *all* your data to work with it.

> I still
> maintain the default autovacuum_scale_factor is *far* to liberal. If I had my
> druthers it would be 5%. But that's mostly informed by TPCC experience, in
> real life the actual value will vary depending on the width of your records
> and the relative length of your transactions versus transaction rate. The TPCC
> experience is with ~ 400 byte records and many short transactions.

Hm.. 5% vs 20% would mean 4x as many vacuum scans, but only a 15% growth
in size (105% vs 120%), right? Granted, those 15% are also taken from
memory and caches, resulting in additional I/O... Still these numbers
are surprising me. Or am I missing something?

Regards

Markus

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-10 09:16:31
Message-ID:	9362e74e0801100116s368a30der4d6b2a29d588b45c@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus,
I was re-thinking about what you said. I feel, if we read the WAL
through archiver(Where the archiver is switched on), which anyway reads the
entire WAL Log, it might save some CPU cycles off updates, inserts and
deletes.
The question is about reducing I/Os and i have no doubt about it.
But if we create the WAL Log in a seperate disk and we make the Vacuum scan
through it(in case the archiver is absent), it would reduce the I/O off the
disk containing the data. Essentially the I/O effects are seperated. We
might end up doing more I/Os, but it would not affect the OLTP transactions.
I would also like to clarify one more thing. I am not asking to
remove the DSM approach. But i am just thinking of creating the DSM by
reading through the WAL Logs, instead of asking the Inserts, updates and
deletes to do the DSM creation.
Of course, if a person places both WAL logs and Data files in the
same disk drives, this would reduce the performance. But can we take that
hit?
I think what Gregory is coming at is, "if we schedule the Vacuum
after 20% of table changes, then we essentially say we need 120% of the disk
space and hence our select operations might end up doing more I/Os."
Please put forward your suggestions.

Hi All,

Essentially concluding
a) If there is a archiver running, we are putting slightly more CPU cycles
on the archiver to help form the DSM.
b) If there is no archiver, if the DBA places the WAL in a seperate disk,
Vacuum will do more I/O on that disk to form the DSM.
c) In case someone has not schedules both archiver and is not ready to spare
a disk for WAL, this approach reduces the performance of that setup.
Are my conclusions right?
If they are right, how much percentage constitute the third part? (Field
experts out there!!)
If the percentage is more, we should stop this line of thinking.

Thanks,
Gokul.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-10 10:13:35
Message-ID:	4785EFCF.7090108@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Gokulakannan Somasundaram wrote:
> But i am just thinking of creating the DSM
> by reading through the WAL Logs, instead of asking the Inserts, updates
> and deletes to do the DSM creation.

What's the advantage of that? What's wrong with collecting the
information for DSM at transaction processing time? The overhead is
certainly smaller than the overhead for doing it later on.

> I think what Gregory is coming at is, "if we schedule the Vacuum
> after 20% of table changes, then we essentially say we need 120% of the
> disk space and hence our select operations might end up doing more I/Os."

Well, full sequential scans end up doing more I/O, but not index scans
typical for OLTP. So if autovacuum is the only thing doing full
sequential scans, you'd better reduce the number of full scans, instead
of saving only some percentage per scan, no?

Of course, depending on how much of your table fits in ram, you also
need to consider the space savings in RAM... However, I'm assuming a
reasonably low ratio of RAM size vs table size.

Regards

Markus

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-10 10:47:15
Message-ID:	9362e74e0801100247w165f01dck982bcd01214e3485@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 10, 2008 3:43 PM, Markus Schiltknecht <markus(at)bluegap(dot)ch> wrote:

> Hi,
>
> Gokulakannan Somasundaram wrote:
> > But i am just thinking of creating the DSM
> > by reading through the WAL Logs, instead of asking the Inserts, updates
> > and deletes to do the DSM creation.
>
> What's the advantage of that? What's wrong with collecting the
> information for DSM at transaction processing time? The overhead is
> certainly smaller than the overhead for doing it later on.

The overhead ..... is because of the contention. Am i missing something
here? While Vacuum is reading the DSM, operations may not be able to update
the bits. We need to put the DSM in shared memory, if all the processes are
going to update it, whereas if Vacuum is going to form the DSM, then it
might well be in the process local memory. I can think of things like False
sharing which might be avoided. But i think the main stuff is contention.

>
>
> > I think what Gregory is coming at is, "if we schedule the Vacuum
> > after 20% of table changes, then we essentially say we need 120% of the
> > disk space and hence our select operations might end up doing more
> I/Os."
>
> Well, full sequential scans end up doing more I/O, but not index scans
> typical for OLTP. So if autovacuum is the only thing doing full
> sequential scans, you'd better reduce the number of full scans, instead
> of saving only some percentage per scan, no?

Even in indexes, we might end up reading dead tuples. We would mark it with
LP_DEAD. So the overhead is less, but its there. Ofcourse its natural to
think of some background jobs during OLTP, and they will be affected

>
>
> Of course, depending on how much of your table fits in ram, you also
> need to consider the space savings in RAM... However, I'm assuming a
> reasonably low ratio of RAM size vs table size.

That's another one.

Thanks,
Gokul.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-10 11:01:36
Message-ID:	4785FB10.8030800@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Gokulakannan Somasundaram wrote:
> because of the contention. Am i missing something
> here? While Vacuum is reading the DSM, operations may not be able to
> update the bits. We need to put the DSM in shared memory, if all the
> processes are going to update it, whereas if Vacuum is going to form the
> DSM, then it might well be in the process local memory. I can think of
> things like False sharing which might be avoided. But i think the main
> stuff is contention.

Ah, I begin to understand where you are coming from now, yes. However,
(ab-)using the WAL and archiver still doesn't look like a good idea to me.

> Even in indexes, we might end up reading dead tuples. We would mark it
> with LP_DEAD. So the overhead is less, but its there.

That's a good point, yes.

> Ofcourse its
> natural to think of some background jobs during OLTP, and they will be
> affected

Agreed.

Regards

Markus

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Gregory Stark" <stark(at)enterprisedb(dot)com>
Cc:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-11 21:10:17
Message-ID:	9362e74e0801111310k6266093au9fc53f2b1256dda5@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Sorry Greg , I missed to read this part before.

On Jan 9, 2008 8:40 PM, Gregory Stark <stark(at)enterprisedb(dot)com> wrote:

>
> "Markus Schiltknecht" <markus(at)bluegap(dot)ch> writes:
>
> > Hi,
> >
> > Gokulakannan Somasundaram wrote:
> >> If we can ask the Vacuum process to scan the WAL log, it can get all
> the
> >> relevant details on where it needs to go.
>
> That's an interesting thought. I think your caveats are right but with
> some
> more work it might be possible to work it out. For example if a background
> process processed the WAL and accumulated an array of possibly-dead tuples
> to
> process in batch. It would wait whenever it sees an xid which isn't yet
> past
> globalxmin, and keep accumulating until it has enough to make it
> worthwhile
> doing a pass.
>
> I think a bigger issue with this approach is that it ties all your tables
> together. You can't process one table frequently while some other table
> has
> some long-lived deleted tuples.
>

I am not able to clearly understand what you are saying here. It ties all
the tables yes. There are two options here
a) Do we really need to do Vacuum table by table? Say we read 'n' WAL
segments and accumulate the data. We should try to sort the result with
Relation name, Block num and we can go ahead with the Vacuum. In this way,
Vacuum will only work at the database level. Why do we need to process one
table frequently?
b) We can create DSMs for each table separately and Vacuum will use the WAL
information to update it. In this way, we can Vacuum table wise.

>
>
> I'm also not sure it really buys us anything over having a second
> dead-space-map data structure. The WAL is much larger and serves other
> purposes which would limit what we can do with it.

Ok. One obvious advantage is that it saves the contention over DSM for the
DML operations and Vacuum process. Since Vacuum process is going to have
much more information on what has happened in the database, it is possible
for some new structures. For example i have been thinking of changing our
current index structure in such a way, it won't hold any duplicate tuples
for different versions of data. Whenever there is a update, only the indexes
relevant to the columns changed will get updated. The Vacuum has to play the
role of changing the tid, the index tuple points to, whenever it vacuums a
older version.
It would be possible to create such structures, which can be synched
asynchronously. Another example would be Asynchronous Materialized views.

But pushing those future plans aside, don't you think this would reduce the
contention, which otherwise would be faced by the DML operations?

Thanks,
Gokul.

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>
Cc:	Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-12 09:46:07
Message-ID:	47888C5F.9070405@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Gokulakannan Somasundaram wrote:
> I'm also not sure it really buys us anything over having a second
> dead-space-map data structure. The WAL is much larger and serves other
> purposes which would limit what we can do with it.
>
> Ok. One obvious advantage is that it saves the contention over DSM for
> the DML operations and Vacuum process.

Do you have evidence of that contention being so worse, that it
justifies the additional WAL reading from disk? (Assuming no WAL archiving).

IMO we can get about any granularity we want for DSM update locking,
depending on how we arrange the DSM bits.

> Since Vacuum process is going to
> have much more information on what has happened in the database,

Why should that be? IMO, collecting the information at transaction time
can give you exactly the same information, if not more or better
information.

> it is
> possible for some new structures. For example i have been thinking of
> changing our current index structure in such a way, it won't hold any
> duplicate tuples for different versions of data. Whenever there is a
> update, only the indexes relevant to the columns changed will get
> updated. The Vacuum has to play the role of changing the tid, the index
> tuple points to, whenever it vacuums a older version.

Huh? The index would then point to the old tuple only, until a VACUUM
comes by, right. How are following transactions expected to find the new
tuple before that VACUUMing?

Regards

Markus

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-12 17:02:41
Message-ID:	7697.1200157361@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Markus Schiltknecht <markus(at)bluegap(dot)ch> writes:
>> Since Vacuum process is going to
>> have much more information on what has happened in the database,

> Why should that be? IMO, collecting the information at transaction time
> can give you exactly the same information, if not more or better
> information.

Well, one of the principal arguments for having VACUUM at all is that it
off-loads required maintenance effort from foreground transaction code
paths. I'm not really going to be in favor of solutions that put more
work into the transaction code paths (HOT already did more of that than
I would like :-(). OTOH, I agree that scanning the WAL log doesn't
really sound like something well-matched to this problem either.

regards, tom lane

From:	Markus Schiltknecht <markus(at)bluegap(dot)ch>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-12 17:51:56
Message-ID:	4788FE3C.2080500@bluegap.ch
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Tom Lane wrote:
> Well, one of the principal arguments for having VACUUM at all is that it
> off-loads required maintenance effort from foreground transaction code
> paths.

Off-loading doesn't mean we don't have to do the work, so it's obviously
is a compromise.

AFAICT, having to write some DSM blocks from foreground transaction code
paths may well be worth it overall, if it saves VACUUM from doing much
more I/O.

Especially if the bgwriter can defer the I/O to after commit time (which
I'm thinking of as another form of off-loading work from foreground
transaction code).

Regards

Markus

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 08:25:36
Message-ID:	9362e74e0801160025s5415caeeq9599d6fbaa7563f8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,

Please find my answers inline

>
> Do you have evidence of that contention being so worse, that it
> justifies the additional WAL reading from disk? (Assuming no WAL
> archiving).

On a broader sense, DSM is a bitmap index with some optimization that has
been placed to make the updates more effective. As you may know, the design
of Bitmap index doesn't scale very well with concurrency. If you put more
information into a little space, then i feel it might affect concurrency.
Let us discuss it in detail.
DSM, i believe plans to achieve the following objectives,
a) To find out the blocks, which are to be Vacuumed
b) To find out the blocks, where freezing is required
c) To find out the blocks which are visible to everyone.

The DSM might get split into multiple maps like Visibility maps(already
proposed by Heikki), Vacuum Maps and Freezing maps. When the inserts
happen, the map has to get extended and it has to lock the block to extend
the map. Say if the DSM block corresponds to some 60K data blocks. Then any
updates / deletes happening over those blocks have to wait for that time.
This is just an example, which i can think of off-hand. May be the people,
who are implementing might throw more light on the synchronization points.

>
> IMO we can get about any granularity we want for DSM update locking,
> depending on how we arrange the DSM bits.

I can't understand this exactly.

> > Since Vacuum process is going to
> > have much more information on what has happened in the database,
>
> Why should that be? IMO, collecting the information at transaction time
> can give you exactly the same information, if not more or better
> information.

My argument is if we have collected that information in WAL, why should we
collect it again and again?

> > it is
> > possible for some new structures. For example i have been thinking of
> > changing our current index structure in such a way, it won't hold any
> > duplicate tuples for different versions of data. Whenever there is a
> > update, only the indexes relevant to the columns changed will get
> > updated. The Vacuum has to play the role of changing the tid, the index
> > tuple points to, whenever it vacuums a older version.
>
> Huh? The index would then point to the old tuple only, until a VACUUM
> comes by, right. How are following transactions expected to find the new
> tuple before that VACUUMing?

You are right. We have already discusses about this. In the Vacuum aproach,
we travel front in time. We catch the oldest transaction and go to the new
transaction, by following the ctid in the old tuple. In the undo log
approach, it is the reverse. We go to the latest transaction and travel back
in time. Its interesting to see, how theory of relativity has got applied in
database science right?

So say we have 'n' versions of the same data in index. Right now we have 'n'
index tuples which point to 'n' block in heap. we would read all the 'n'
index tuples and go to all the versions of data in the table. If this
changes, there will be one index tuple, which would point to the oldest heap
tuple and from there we will navigate to all the new tuples. The advantage
is obvious, the index is going to have lesser size and the updates will not
update indexes, unless the data in it has got changed.

Hope i was clear. Please revert back, in case i am not clear.

Thanks,
Gokul.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>
Cc:	"Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 08:27:34
Message-ID:	9362e74e0801160027s7e5e8f3du5a1786e2ed70a11@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

One more application of the same is Asynchronous Materialized views. I hope
you agree that the asynchronous materialized views have to get updated only
through WAL. If WAL can be used for that purpose, why can't we multiplex it?

Thanks,
Gokul.

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 08:34:26
Message-ID:	9362e74e0801160034w2524c0bs74754365cab5d9f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>
> Well, one of the principal arguments for having VACUUM at all is that it
> off-loads required maintenance effort from foreground transaction code
> paths. I'm not really going to be in favor of solutions that put more
> work into the transaction code paths (HOT already did more of that than
> I would like :-(). OTOH, I agree that scanning the WAL log doesn't
> really sound like something well-matched to this problem either.
>

Tom, Don't you like the idea of building some more structures around WAL,
like Asynchronous Materialized views. Indexes, if implemented as stated,
would remove the HOT code in the path of the transaction(as you may know).
I am also slightly doubtful of the argument, that doing full-table scans and
full index scans for Vacuum is efficient. Can you please advise me on why we
should not use a read only operation on WAL log ?

Thanks,
Gokul.

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 11:44:48
Message-ID:	478DEE30.808@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Gokulakannan Somasundaram wrote:
>>
>> Well, one of the principal arguments for having VACUUM at all is that it
>> off-loads required maintenance effort from foreground transaction code
>> paths. I'm not really going to be in favor of solutions that put more
>> work into the transaction code paths (HOT already did more of that than
>> I would like :-(). OTOH, I agree that scanning the WAL log doesn't
>> really sound like something well-matched to this problem either.
>>
>
> Tom, Don't you like the idea of building some more structures around WAL,
> like Asynchronous Materialized views. Indexes, if implemented as stated,
> would remove the HOT code in the path of the transaction(as you may know).
> I am also slightly doubtful of the argument, that doing full-table scans and
> full index scans for Vacuum is efficient. Can you please advise me on why we
> should not use a read only operation on WAL log ?

I haven't been paying close attention to this thread, but there is a
couple general issues with using the WAL for this kind of things. First
of all, one extremely cool feature of PostgreSQL is that transaction
size is not limited by WAL space, unlike on many other DBMSs. I think
many of the proposed ideas of reading WAL would require us to keep all
WAL available back to the beginning of the oldest running transaction.

Another issue is that reading WAL is inherently not very scalable.
There's only one WAL for the whole cluster, and it needs to be read
sequentially, so it can easily become a bottleneck on large systems.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 13:49:55
Message-ID:	9362e74e0801160549k226ae2b8u57f449c2efc68800@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>
> I haven't been paying close attention to this thread, but there is a
> couple general issues with using the WAL for this kind of things. First
> of all, one extremely cool feature of PostgreSQL is that transaction
> size is not limited by WAL space, unlike on many other DBMSs. I think
> many of the proposed ideas of reading WAL would require us to keep all
> WAL available back to the beginning of the oldest running transaction.

Initially i thought this may be required. But the current idea is Vacuum is
going to maintain a DSM per relation and it will update it, once the WAL
segement is switched. so if the WAL logging is happening at segment 2, then
the first segment will be scanned to update the DSM.

>
> Another issue is that reading WAL is inherently not very scalable.
> There's only one WAL for the whole cluster, and it needs to be read
> sequentially, so it can easily become a bottleneck on large systems.

Let me try to understand what would become a problem here. We are going to
have only one process, which would open this WAL (one segment at a time) and
update the DSMs. The limitation would be that we should have completed
reading the log before the WAL segment round-up. What else do you think
would be the problem?

Thanks,
Gokul.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 13:56:33
Message-ID:	20080116135633.GE5106@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas escribió:

> Another issue is that reading WAL is inherently not very scalable. There's
> only one WAL for the whole cluster, and it needs to be read sequentially,
> so it can easily become a bottleneck on large systems.

I have wondered why do we do it this way. Is there a problem with
having one WAL per database, and another for general operations? This
last WAL would have changes to shared tables, as well as global stuff
like "create database" or "create tablespace".

Of course, it means a lot more files, and a PITR setup is a bit more
complex.

One obvious problem is that it is no longer true that you have a "no
seek" disk head. But is there much use of that, these days? People
have either a big RAID on which the WAL resides along all data; or, on
low-cost systems, the whole thing is in a single disk or a small RAID.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 16:22:57
Message-ID:	24423.1200500577@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Heikki Linnakangas escribi:
>> Another issue is that reading WAL is inherently not very scalable. There's
>> only one WAL for the whole cluster, and it needs to be read sequentially,
>> so it can easily become a bottleneck on large systems.

> I have wondered why do we do it this way. Is there a problem with
> having one WAL per database, and another for general operations? This
> last WAL would have changes to shared tables, as well as global stuff
> like "create database" or "create tablespace".

It would only be useful to have one per spindle-dedicated-to-WAL, so
tying the division to databases doesn't seem like it'd be a good idea.

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 16:39:43
Message-ID:	20080116163943.GC5076@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane escribió:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Heikki Linnakangas escribi:
> >> Another issue is that reading WAL is inherently not very scalable. There's
> >> only one WAL for the whole cluster, and it needs to be read sequentially,
> >> so it can easily become a bottleneck on large systems.
>
> > I have wondered why do we do it this way. Is there a problem with
> > having one WAL per database, and another for general operations? This
> > last WAL would have changes to shared tables, as well as global stuff
> > like "create database" or "create tablespace".
>
> It would only be useful to have one per spindle-dedicated-to-WAL, so
> tying the division to databases doesn't seem like it'd be a good idea.

Keep in mind that there are claims that a write-cache-enabled
battery-backed RAID controller negates the effect of a separate spindle.
Also, these days people is recommending keeping WAL in a mirrored disk,
so you would have to have a mirrored pair for every WAL stream, which
starts to sound unworkable.

My point, rather, is that with this sort of setup it would be easier to
do per-database PITR shipping, and one database's WAL activity would not
affect another's (thus hosting providers are happier -- high-rate
customer A need not affect low-budget customer B).

A totally separate consideration is that of LWLock contention.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 16:52:53
Message-ID:	24876.1200502373@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Tom Lane escribi:
>> It would only be useful to have one per spindle-dedicated-to-WAL, so
>> tying the division to databases doesn't seem like it'd be a good idea.

> Keep in mind that there are claims that a write-cache-enabled
> battery-backed RAID controller negates the effect of a separate spindle.

Possibly true, but if that's the underlying hardware then there's no
performance benefit in breaking WAL up at all, no?

> My point, rather, is that with this sort of setup it would be easier to
> do per-database PITR shipping, and one database's WAL activity would not
> affect another's (thus hosting providers are happier -- high-rate
> customer A need not affect low-budget customer B).

You won't get far with that because of the shared catalogs. In
particular, most DDL operations these days touch pg_shdepend ...

regards, tom lane

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 17:12:34
Message-ID:	20080116171234.GF5076@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane escribió:
> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> > Tom Lane escribió:
> >> It would only be useful to have one per spindle-dedicated-to-WAL, so
> >> tying the division to databases doesn't seem like it'd be a good idea.
>
> > Keep in mind that there are claims that a write-cache-enabled
> > battery-backed RAID controller negates the effect of a separate spindle.
>
> Possibly true, but if that's the underlying hardware then there's no
> performance benefit in breaking WAL up at all, no?

Selective PITR shipping.

> > My point, rather, is that with this sort of setup it would be easier to
> > do per-database PITR shipping, and one database's WAL activity would not
> > affect another's (thus hosting providers are happier -- high-rate
> > customer A need not affect low-budget customer B).
>
> You won't get far with that because of the shared catalogs. In
> particular, most DDL operations these days touch pg_shdepend ...

That's why you log shared activity to another WAL stream, and ship that
to everyone, while the other databases' WAL streams are shipped only to
the interested slaves.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 17:17:47
Message-ID:	87ir1titac.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
>> Heikki Linnakangas escribió:
>>> Another issue is that reading WAL is inherently not very scalable. There's
>>> only one WAL for the whole cluster, and it needs to be read sequentially,
>>> so it can easily become a bottleneck on large systems.
>
>> I have wondered why do we do it this way. Is there a problem with
>> having one WAL per database, and another for general operations? This
>> last WAL would have changes to shared tables, as well as global stuff
>> like "create database" or "create tablespace".
>
> It would only be useful to have one per spindle-dedicated-to-WAL, so
> tying the division to databases doesn't seem like it'd be a good idea.

I think one-per-database would help if you had a very particular type of
application which had a lot of equally busy databases. In general to eliminate
the bottleneck I think you would need to be able to break them up by process.
So two processes writing to the same table would be able to write to different
WAL logs.

That sounds hard but I'm not sure. It may not be as bad as it sounds.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's Slony Replication support!

From:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 17:32:37
Message-ID:	1d4e0c10801160932g576556b7hc145fa4cf7580ff@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 16, 2008 6:12 PM, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> Tom Lane escribió:
> > Possibly true, but if that's the underlying hardware then there's no
> > performance benefit in breaking WAL up at all, no?
>
> Selective PITR shipping.

If it was possible to launch a PITR only on a given database, that
could be a great feature too. We have at least one customer who runs
every database in a separate cluster to be able to do PITR on only one
database if needed (for example if someone executed a DROP TABLE by
mistake).

--
Guillaume

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 17:40:23
Message-ID:	Pine.GSO.4.64.0801161233260.20980@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 16 Jan 2008, Alvaro Herrera wrote:

> Keep in mind that there are claims that a write-cache-enabled
> battery-backed RAID controller negates the effect of a separate spindle.

"Negates" is a bit strong; there's still some performance advantage on
systems that write a serious amount of data. It's certainly true that a
BCC controller greatly reduces the need for a separate spindle.

It can be handy to keep it seperate anyway because it makes it trivial to
track WAL I/O vs. database I/O.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

From:	"Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To:	"Greg Smith" <gsmith(at)gregsmith(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 18:01:30
Message-ID:	478DF219.EE98.0025.0@wicourts.gov
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>>> On Wed, Jan 16, 2008 at 11:40 AM, in message
<Pine(dot)GSO(dot)4(dot)64(dot)0801161233260(dot)20980(at)westnet(dot)com>, Greg Smith
<gsmith(at)gregsmith(dot)com> wrote:
> On Wed, 16 Jan 2008, Alvaro Herrera wrote:
>
>> Keep in mind that there are claims that a write-cache-enabled
>> battery-backed RAID controller negates the effect of a separate spindle.
>
> "Negates" is a bit strong; there's still some performance advantage on
> systems that write a serious amount of data. It's certainly true that a
> BCC controller greatly reduces the need for a separate spindle.

I haven't seen any benchmarks on the list or in our environment
where the separate spindles gave more than a 1% increase in
performance when using a good-quality BBC controller. Do you have
results that show more of a difference? Can you share them?

-Kevin

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 18:41:45
Message-ID:	478E4FE9.40205@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Guillaume Smet wrote:
> On Jan 16, 2008 6:12 PM, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>> Tom Lane escribió:
>>> Possibly true, but if that's the underlying hardware then there's no
>>> performance benefit in breaking WAL up at all, no?
>> Selective PITR shipping.
>
> If it was possible to launch a PITR only on a given database, that
> could be a great feature too. We have at least one customer who runs
> every database in a separate cluster to be able to do PITR on only one
> database if needed (for example if someone executed a DROP TABLE by
> mistake).

Yeah, it sure would be nice.

I don't think it's going to work too well, though, not without major
changes at least. What would happen when you restore a PITR backup of
just one database? Would the other databases still be there in the
restored cluster? What state would they be in? After restoring one
database, and doing some stuff on it, could you ever "merge" those
changes with the rest of the cluster?

Mind you, there's more things shared between databases than the shared
catalogs. clog for example.

It might be useful for creating read-only copies of a master database,
but I don't see it being very useful/possible in general.

For more usefulness, we'd need to keep databases more separate from each
other than we do now. Databases would need to have their own transaction
counters, for example. Shared relations would obviously need major
changes for that to work. If we ultimately could separate databases so
that you could take a filesystem copy of a single database, and restore
it to another cluster, then per-database WAL and PITR would work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 19:10:40
Message-ID:	9362e74e0801161110g79215e72hf982e78e4e655991@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>
> For more usefulness, we'd need to keep databases more separate from each
> other than we do now. Databases would need to have their own transaction
> counters, for example. Shared relations would obviously need major
> changes for that to work. If we ultimately could separate databases so
> that you could take a filesystem copy of a single database, and restore
> it to another cluster, then per-database WAL and PITR would work.
>
> I agree to the fact that we can't have a separate WAL per database. Looks
like it makes more sense to create a seperate database cluster, instead of
adding one more database, if we want to make better use of available horse
power and if we don't have cross database queries.

Thanks,
Gokul.

From:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 19:26:19
Message-ID:	1d4e0c10801161126w6450fbbco97e4f4ce8ead9767@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jan 16, 2008 7:41 PM, Heikki Linnakangas <heikki(at)enterprisedb(dot)com> wrote:
> I don't think it's going to work too well, though, not without major
> changes at least.

Well, I know it's really not doable with the current behaviour of WAL.
I just wanted to point this feature request because we had it a few
times and having one cluster per database is not really smart and it
wasn't too far from the subject.

> What would happen when you restore a PITR backup of
> just one database? Would the other databases still be there in the
> restored cluster?

In a perfect world, you should be able to trigger the PITR for only
one database of the cluster with the other databases still running.
It's especially interesting on a shared database server.

--
Guillaume

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Guillaume Smet <guillaume(dot)smet(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gokulakannan Somasundaram <gokul007(at)gmail(dot)com>, Markus Schiltknecht <markus(at)bluegap(dot)ch>, Gregory Stark <stark(at)enterprisedb(dot)com>, pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 20:52:56
Message-ID:	20080116205256.GL5076@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas escribió:

> I don't think it's going to work too well, though, not without major
> changes at least. What would happen when you restore a PITR backup of just
> one database? Would the other databases still be there in the restored
> cluster? What state would they be in? After restoring one database, and
> doing some stuff on it, could you ever "merge" those changes with the rest
> of the cluster?

Well, a PITR slave, after you change it, cannot be brought in sync with
the master. This is not different.

If you replicate a single database's stream, the other databases should
not be there. My idea is that a slave could request multiple databases'
streams. The ability to do it is needed anyway, to follow both the
basic database stream and the shared stream.

> Mind you, there's more things shared between databases than the shared
> catalogs. clog for example.

Sure --- my original proposal mentioned the use of the shared WAL stream
for global objects (though I didn't mention pg_clog, but surely it had
better be there).

> For more usefulness, we'd need to keep databases more separate from each
> other than we do now. Databases would need to have their own transaction
> counters, for example.

Hmm, why? Perhaps you are right but I don't see the reason.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>
Cc:	"Guillaume Smet" <guillaume(dot)smet(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gokulakannan Somasundaram" <gokul007(at)gmail(dot)com>, "Markus Schiltknecht" <markus(at)bluegap(dot)ch>, "Gregory Stark" <stark(at)enterprisedb(dot)com>, "pgsql-hackers list" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 21:12:39
Message-ID:	478E7347.90909@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> Heikki Linnakangas escribió:
>> For more usefulness, we'd need to keep databases more separate from each
>> other than we do now. Databases would need to have their own transaction
>> counters, for example.
>
> Hmm, why? Perhaps you are right but I don't see the reason.

If each database was stand-alone, you would need only one base backup
and WAL per database to restore, instead of base backup and WAL of the
database, and base backup and WAL of shared stuff. You could backup one
database in cluster, restore it somewhere else, and later copy it back
to the original cluster. You could back up one database at a time, and
restore the whole cluster from the N per-database backups.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Greg Smith <gsmith(at)gregsmith(dot)com>
To:	pgsql-hackers list <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some ideas about Vacuum
Date:	2008-01-16 21:21:44
Message-ID:	Pine.GSO.4.64.0801161431080.28184@westnet.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 16 Jan 2008, Kevin Grittner wrote:

> I haven't seen any benchmarks on the list or in our environment
> where the separate spindles gave more than a 1% increase in
> performance when using a good-quality BBC controller.

Well, even 1% isn't nothing, which is the main point I was making--it
doesn't completely remove the gain, just reduce it a lot. If you wanted
to see a bigger difference you could simulate a workload with lots of
clients doing short transactions.

The biggest gain in having a separate WAL isn't as obvious in gross
performance measurements. It's what happens to worst-case performance for
the transactions just after a checkpoint, when there is a burst of more
full page writes (they normally settle down as the most popular pages get
written). That's the spot where you're most likely to run into a WAL
bottleneck that just having a BBC doesn't completely eliminate.

> Do you have results that show more of a difference? Can you share them?

I wasn't trying to quantify this particular number and it would take a bit
just to figure out what I could and couldn't share. Expanding on the
above, though, if you look some of the recent public benchmarks like
http://www.kaltenbrunner.cc/blog/index.php?/archives/21-8.3-vs.-8.2-a-simple-benchmark.html
you'll see Stefan was able to hit around 4000 TPS on that test system.
Now, if you had a typical 256MB BBC (it's 512MB there) and full pages
writes are dumping 8K each, that means you can fit 32768 of them before
you blow your cache and the disks really have to keep up--and the WAL
doesn't get the whole cache to itself. The first 10-20 seconds after a
checkpoint on such a system are kind of interesting to zoom in on. If the
WAL has to fight for seek time with database reads during that period (DB
writes will still be mostly cached by the OS just after a checkpoint) it
can be messy compared to what you get with a dedicated WAL. But that will
average out to a minimal effect on TPS over the course of the test.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD