Quick Links

Re: vacuum, performance, and MVCC

Lists:	pgsql-hackers

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	vacuum, performance, and MVCC
Date:	2006-06-21 18:10:31
Message-ID:	18613.24.91.171.78.1150913431.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

We all know that PostgreSQL suffers performance problems when rows are
updated frequently prior to a vacuum. The most serious example can be seen
by using PostgreSQL as a session handler for a busy we site. You may have
thousands or millions of active sessions, each being updated per page hit.

Each time the record is updated, a new version is created, thus
lengthening the "correct" version search each time row is accessed, until,
of course, the next vacuum comes along and corrects the index to point to
the latest version of the record.

Is that a fair explanation?

If my assertion is fundimentally true, then PostgreSQL will always suffer
performance penalties under a heavy modification load. Of course, tables
with many inserts are not an issue, it is mainly updates. The problem is
that there are classes of problems where updates are the primary
operation.

I was thinking, just as a hypothetical, what if we reversed the problem,
and always referenced the newest version of a row and scanned backwards
across the versions to the first that has a lower transacton number?

One possible implementation: PostgreSQL could keep an indirection array of
index to table ref for use by all the indexes on a table. The various
indexes return offsets into the array, not direct table refs. Because the
table refs are separate from the index, they can be updated each time a
transaction is commited.

This way, the newest version of a row is always the first row found. Also,
on a heavily updated site, the most used rows would always be at the end
of the table, reducing amount of disk reads or cache memory required to
find the correct row version for each query.

From:	Christopher Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 01:36:39
Message-ID:	871wtic6oo.fsf@wolfe.cbbrowne.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward") mumbled into her beard:
> We all know that PostgreSQL suffers performance problems when rows are
> updated frequently prior to a vacuum. The most serious example can be seen
> by using PostgreSQL as a session handler for a busy we site. You may have
> thousands or millions of active sessions, each being updated per page hit.
>
> Each time the record is updated, a new version is created, thus
> lengthening the "correct" version search each time row is accessed, until,
> of course, the next vacuum comes along and corrects the index to point to
> the latest version of the record.
>
> Is that a fair explanation?

No, it's not.

1. The index points to all the versions, until they get vacuumed out.

2. There may simultaneously be multiple "correct" versions. The
notion that there is one version that is The Correct One is wrong, and
you need to get rid of that thought.

> If my assertion is fundimentally true, then PostgreSQL will always suffer
> performance penalties under a heavy modification load. Of course, tables
> with many inserts are not an issue, it is mainly updates. The problem is
> that there are classes of problems where updates are the primary
> operation.

The trouble with your assertion is that it is true for *all* database
systems except for those whose only transaction mode is READ
UNCOMMITTED, where the only row visible is the "Latest" version.

> I was thinking, just as a hypothetical, what if we reversed the
> problem, and always referenced the newest version of a row and
> scanned backwards across the versions to the first that has a lower
> transacton number?

That would require an index on transaction number, which is an
additional data structure not in place now. That would presumably
worsen things.

> One possible implementation: PostgreSQL could keep an indirection array of
> index to table ref for use by all the indexes on a table. The various
> indexes return offsets into the array, not direct table refs. Because the
> table refs are separate from the index, they can be updated each time a
> transaction is commited.

You mean, this index would be "VACUUMed" as a part of each transaction
COMMIT? I can't see that turning out well...

> This way, the newest version of a row is always the first row
> found. Also, on a heavily updated site, the most used rows would
> always be at the end of the table, reducing amount of disk reads or
> cache memory required to find the correct row version for each
> query.

I can't see how it follows that most-used rows would migrate to the
end of the table. That would only be true in a database that is never
VACUUMed; as soon as a VACUUM is done, free space opens up in the
interior, so that new tuples may be placed in the "interior."
--
If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me
http://linuxdatabases.info/info/lisp.html
"On a normal ascii line, the only safe condition to detect is a
'BREAK' - everything else having been assigned functions by Gnu
EMACS." -- Tarl Neustaedter

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Christopher Browne" <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 10:52:43
Message-ID:	18435.24.91.171.78.1150973563.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward") mumbled into
> her beard:
>> We all know that PostgreSQL suffers performance problems when rows are
>> updated frequently prior to a vacuum. The most serious example can be
>> seen
>> by using PostgreSQL as a session handler for a busy we site. You may
>> have
>> thousands or millions of active sessions, each being updated per page
>> hit.
>>
>> Each time the record is updated, a new version is created, thus
>> lengthening the "correct" version search each time row is accessed,
>> until,
>> of course, the next vacuum comes along and corrects the index to point
>> to
>> the latest version of the record.
>>
>> Is that a fair explanation?
>
> No, it's not.
>
> 1. The index points to all the versions, until they get vacuumed out.

It can't point to "all" versions, it points to the last "current" version
as updated by vacuum, or the first version of the row.

>
> 2. There may simultaneously be multiple "correct" versions. The
> notion that there is one version that is The Correct One is wrong, and
> you need to get rid of that thought.

Sorry, this is misunderstanding. By "correct version search" it was
implied "for this transaction." Later I mention finding the first row with
a transaction lower than the current.

>
>> If my assertion is fundimentally true, then PostgreSQL will always
>> suffer
>> performance penalties under a heavy modification load. Of course, tables
>> with many inserts are not an issue, it is mainly updates. The problem is
>> that there are classes of problems where updates are the primary
>> operation.
>
> The trouble with your assertion is that it is true for *all* database
> systems except for those whose only transaction mode is READ
> UNCOMMITTED, where the only row visible is the "Latest" version.

Not true. Oracle does not seem to exhibit this problem.

>
>> I was thinking, just as a hypothetical, what if we reversed the
>> problem, and always referenced the newest version of a row and
>> scanned backwards across the versions to the first that has a lower
>> transacton number?
>
> That would require an index on transaction number, which is an
> additional data structure not in place now. That would presumably
> worsen things.

All things being equal, perhaps not. It would proably be a loser if you
have a static database, but in a database that undergoes modification, it
would be the same or less work if the average row has two versions.
(assuming nothing else changes)
>
>> One possible implementation: PostgreSQL could keep an indirection array
>> of
>> index to table ref for use by all the indexes on a table. The various
>> indexes return offsets into the array, not direct table refs. Because
>> the
>> table refs are separate from the index, they can be updated each time a
>> transaction is commited.
>
> You mean, this index would be "VACUUMed" as a part of each transaction
> COMMIT? I can't see that turning out well...

No, it would not be vacuumed!!!

Right now, the indexes point to the lowest row version. When an index
returns the row ID, it is checked if there are newer versions, if so, the
newer versions are searched until the last one is found or exceeds the
current TID.

>
>> This way, the newest version of a row is always the first row
>> found. Also, on a heavily updated site, the most used rows would
>> always be at the end of the table, reducing amount of disk reads or
>> cache memory required to find the correct row version for each
>> query.
>
> I can't see how it follows that most-used rows would migrate to the
> end of the table.

Sorry, OK, as assumtion it ignores the FSM, but the idea is that there is
only one lookup.

> That would only be true in a database that is never
> VACUUMed; as soon as a VACUUM is done, free space opens up in the
> interior, so that new tuples may be placed in the "interior."

Regardless, the point is that you have to search the [N] versions of a row
to find the latest correct version of the row for your transacation. This
is done, AFAICT, from first to last version, meaning that the work
required to find a row increases with every update prior to vacuum.

PostgreSQL fails miserably as a web session handler because of this
behavior and it requires too frequent vacuums and inconsistent
performance.

OK, forget the version array, it was just an off the top idea. How about
this:

Currently a row does this:

row_TID[0] -> row_TID[1] ->row_TID[2] ./. row_TID[LAST-1] -> row_TID[LAST]

Pointing to each subsequent row. What if it did this:

row_TID[0] -> row_TID[LAST] -> row_TID[LAST-1] ./. -> row_TID[2] ->
row_TID[1]

The base tuple of a version chain gets updated to point to the latest
commited row. It should be fairly low impact on performance on a static
database, but REALLY speed up PostgreSQL on a heavily modified database
and provide more consistent performance between vacuums and require fewer
vacuums to maintain performance.

From:	Christopher Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 12:23:48
Message-ID:	87veqtbcq3.fsf@wolfe.cbbrowne.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark Woodward"), an earthling, wrote:
>> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward") mumbled into
>> her beard:
>>> We all know that PostgreSQL suffers performance problems when rows are
>>> updated frequently prior to a vacuum. The most serious example can be
>>> seen
>>> by using PostgreSQL as a session handler for a busy we site. You may
>>> have
>>> thousands or millions of active sessions, each being updated per page
>>> hit.
>>>
>>> Each time the record is updated, a new version is created, thus
>>> lengthening the "correct" version search each time row is accessed,
>>> until,
>>> of course, the next vacuum comes along and corrects the index to point
>>> to
>>> the latest version of the record.
>>>
>>> Is that a fair explanation?
>>
>> No, it's not.
>>
>> 1. The index points to all the versions, until they get vacuumed out.
>
> It can't point to "all" versions, it points to the last "current" version
> as updated by vacuum, or the first version of the row.

No, it points to *all* the versions.

Suppose I take a table with two rows:

INFO: analyzing "public.test"
INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
VACUUM

Then, over and over, I remove and insert one entry with the same PK:

sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842550 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842551 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842552 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842553 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842554 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842555 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842556 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842557 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842558 1
sample=# delete from test where id = 2;insert into test (id) values (2);
DELETE 1
INSERT 4842559 1

Now, I vacuum it.

sample=# vacuum verbose analyze test;
INFO: vacuuming "public.test"
INFO: index "test_id_key" now contains 2 row versions in 2 pages
DETAIL: 10 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "test": removed 10 row versions in 1 pages
DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: "test": found 10 removable, 2 nonremovable row versions in 1 pages
DETAIL: 0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO: analyzing "public.test"
INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
VACUUM

Notice that the index contained 10 versions of that one row.

It pointed to *ALL* the versions.

>> 2. There may simultaneously be multiple "correct" versions. The
>> notion that there is one version that is The Correct One is wrong, and
>> you need to get rid of that thought.
>
> Sorry, this is misunderstanding. By "correct version search" it was
> implied "for this transaction." Later I mention finding the first row with
> a transaction lower than the current.

Ah. Then you need for each transaction to spawn an index for each
table that excludes non-current values.

>>> If my assertion is fundimentally true, then PostgreSQL will always
>>> suffer performance penalties under a heavy modification load. Of
>>> course, tables with many inserts are not an issue, it is mainly
>>> updates. The problem is that there are classes of problems where
>>> updates are the primary operation.
>>
>> The trouble with your assertion is that it is true for *all* database
>> systems except for those whose only transaction mode is READ
>> UNCOMMITTED, where the only row visible is the "Latest" version.
>
> Not true. Oracle does not seem to exhibit this problem.

Oracle suffers a problem in this regard that PostgreSQL doesn't; in
Oracle, rollbacks are quite expensive, as "recovery" requires doing
extra work that PostgreSQL doesn't do.
--
output = ("cbbrowne" "@" "gmail.com")
http://cbbrowne.com/info/
Marriage means commitment. Of course, so does insanity.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 13:59:06
Message-ID:	20060622135906.GI29883@surnet.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Woodward wrote:

> Hmm, OK, then the problem is more serious than I suspected.
> This means that every index on a row has to be updated on every
> transaction that modifies that row. Is that correct?

Add an index entry, yes.

So vacuum the table often.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Christopher Browne" <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 13:59:42
Message-ID:	18188.24.91.171.78.1150984782.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark
> Woodward"), an earthling, wrote:
>>> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward") mumbled into
>>> her beard:
[snip]
>>>
>>> 1. The index points to all the versions, until they get vacuumed out.
>>
>> It can't point to "all" versions, it points to the last "current"
>> version
>> as updated by vacuum, or the first version of the row.
>
> No, it points to *all* the versions.
>
> Suppose I take a table with two rows:
>
> INFO: analyzing "public.test"
> INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
> VACUUM
>
> Then, over and over, I remove and insert one entry with the same PK:
>
> sample=# delete from test where id = 2;insert into test (id) values (2);
> DELETE 1

[snip]
>
> Now, I vacuum it.
>
> sample=# vacuum verbose analyze test;
> INFO: vacuuming "public.test"
> INFO: index "test_id_key" now contains 2 row versions in 2 pages
> DETAIL: 10 index row versions were removed.
> 0 index pages have been deleted, 0 are currently reusable.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: "test": removed 10 row versions in 1 pages
> DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: "test": found 10 removable, 2 nonremovable row versions in 1 pages
> DETAIL: 0 dead row versions cannot be removed yet.
> There were 0 unused item pointers.
> 0 pages are entirely empty.
> CPU 0.00s/0.00u sec elapsed 0.00 sec.
> INFO: analyzing "public.test"
> INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
> VACUUM
>
> Notice that the index contained 10 versions of that one row.
>
> It pointed to *ALL* the versions.

Hmm, OK, then the problem is more serious than I suspected.
This means that every index on a row has to be updated on every
transaction that modifies that row. Is that correct?

I am attaching some code that shows the problem with regard to
applications such as web server session management, when run, each second
the system can handle fewer and fewer connections. Here is a brief output:

markw(at)ent:~/pgfoo$ ./footest
1307 sessions per second, elapsed: 1
1292 sessions per second, elapsed: 2
1287 sessions per second, elapsed: 3
....
1216 sessions per second, elapsed: 25
1213 sessions per second, elapsed: 26
1208 sessions per second, elapsed: 27
....
1192 sessions per second, elapsed: 36
1184 sessions per second, elapsed: 37
1183 sessions per second, elapsed: 38
....
1164 sessions per second, elapsed: 58
1170 sessions per second, elapsed: 59
1168 sessions per second, elapsed: 60

As you can see, in about a minute at high load, this very simple table
lost about 10% of its performance, and I've seen worse based on update
frequency. Before you say this is an obscure problem, I can tell you it
isn't. I have worked with more than a few projects that had to switch away
from PostgreSQL because of this behavior.

Obviously this is not a problem with small sites, but this is a real
problem with an enterprise level web site with millions of visitors and
actions a day. Quite frankly it is a classic example of something that
does not scale. The more and more updates there are, the higher the load
becomes. You can see it on "top" as the footest program runs.

There has to be a more linear way of handling this scenario.

Attachment	Content-Type	Size
footest.c	text/x-csrc	1.9 KB

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 14:09:19
Message-ID:	1150985358.3309.90.camel@coppola.muc.ecircle.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > [...]
> > There has to be a more linear way of handling this scenario.
>
> So vacuum the table often.

Good advice, except if the table is huge :-)

Here we have for example some tables which are frequently updated but
contain >100 million rows. Vacuuming that takes hours. And the dead row
candidates are the ones which are updated again and again and looked up
frequently...

A good solution would be a new type of vacuum which does not need to do
a full table scan but can clean the pending dead rows without that... I
guess then I could vacuum really frequently those tables.

Cheers,
Csaba.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 14:20:28
Message-ID:	36e682920606220720k132009ap81c06e41b28a8217@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/06, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> > Hmm, OK, then the problem is more serious than I suspected.
> > This means that every index on a row has to be updated on every
> > transaction that modifies that row. Is that correct?
>
> Add an index entry, yes.

Again, this is a case for update-in-place. No need to write an extra
index entry and incur the WAL associated with it. Imagine a table
with 3 indexes on it... I would estimate that we perform at least 3 to
6 times more overhead than any commercial database on such an update.

> > There has to be a more linear way of handling this scenario.
>
> So vacuum the table often.

It's easy to say VACUUM often... but I'd bet that vacuuming is going
to lessen the throughput in his tests even more; no matter how it's
tuned.

--
Jonah H. Harris, Software Architect | phone: 732.331.1300
EnterpriseDB Corporation | fax: 732.331.1301
33 Wood Ave S, 2nd Floor | jharris(at)enterprisedb(dot)com
Iselin, New Jersey 08830 | http://www.enterprisedb.com/

From:	Mario Weilguni <mweilguni(at)sime(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 14:44:04
Message-ID:	200606221644.04457.mweilguni@sime.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Am Donnerstag, 22. Juni 2006 16:09 schrieb Csaba Nagy:
> > > [...]
> > > There has to be a more linear way of handling this scenario.
> >
> > So vacuum the table often.
>
> Good advice, except if the table is huge :-)
>
> Here we have for example some tables which are frequently updated but
> contain >100 million rows. Vacuuming that takes hours. And the dead row
> candidates are the ones which are updated again and again and looked up
> frequently...
>
> A good solution would be a new type of vacuum which does not need to do
> a full table scan but can clean the pending dead rows without that... I
> guess then I could vacuum really frequently those tables.

Now that there is autovaccum, why not think of something like continous
vacuum? A background process that gets info about potential changed tuples,
and vacuums them (only those tuples), possibly with honouring I/O needs of
backgrounds (not steealing I/O from busy backends).

For sure not that easy as autovacuum. I'm pretty sure I've read something
about partial vacuum lately, is somebody working on something like this?

Regards,
Mario

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 15:17:54
Message-ID:	1150989474.4370.17.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, N, 2006-06-22 kell 09:59, kirjutas Mark Woodward:
> > After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark
> > Woodward"), an earthling, wrote:
> >>> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward") mumbled into
> > It pointed to *ALL* the versions.
>
> Hmm, OK, then the problem is more serious than I suspected.
> This means that every index on a row has to be updated on every
> transaction that modifies that row. Is that correct?

Yes.

> I am attaching some code that shows the problem with regard to
> applications such as web server session management, when run, each second
> the system can handle fewer and fewer connections. Here is a brief output:
>
> markw(at)ent:~/pgfoo$ ./footest
> 1307 sessions per second, elapsed: 1
> 1292 sessions per second, elapsed: 2
> 1287 sessions per second, elapsed: 3
> ....
> 1216 sessions per second, elapsed: 25
> 1213 sessions per second, elapsed: 26
> 1208 sessions per second, elapsed: 27
> ....
> 1192 sessions per second, elapsed: 36
> 1184 sessions per second, elapsed: 37
> 1183 sessions per second, elapsed: 38
> ....
> 1164 sessions per second, elapsed: 58
> 1170 sessions per second, elapsed: 59
> 1168 sessions per second, elapsed: 60
>
> As you can see, in about a minute at high load, this very simple table
> lost about 10% of its performance, and I've seen worse based on update
> frequency. Before you say this is an obscure problem, I can tell you it
> isn't. I have worked with more than a few projects that had to switch away
> from PostgreSQL because of this behavior.

You mean systems that are designed so exactly, that they can't take 10%
performance change ?

Or just that they did not vacuum for so long, that performance was less
than needed in the end?

btw, what did they switch to ?

> Obviously this is not a problem with small sites, but this is a real
> problem with an enterprise level web site with millions of visitors and
> actions a day.

On such site you should design so that db load stays below 50% and run
vacuum "often", that may even mean that you run vacuum continuously with
no wait between runs. If you run vacuum with right settings,

> Quite frankly it is a classic example of something that
> does not scale. The more and more updates there are, the higher the load
> becomes. You can see it on "top" as the footest program runs.

Yes, you understood correctly - the more updates, the higher the load :)

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 15:38:19
Message-ID:	1150990699.4370.33.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, N, 2006-06-22 kell 10:20, kirjutas Jonah H. Harris:
> On 6/22/06, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> > > Hmm, OK, then the problem is more serious than I suspected.
> > > This means that every index on a row has to be updated on every
> > > transaction that modifies that row. Is that correct?
> >
> > Add an index entry, yes.
>
> Again, this is a case for update-in-place. No need to write an extra
> index entry and incur the WAL associated with it.

I guess that MySQL on its original storage does that, but they allow
only one concurrent update per table and no transactions.

> Imagine a table
> with 3 indexes on it... I would estimate that we perform at least 3 to
> 6 times more overhead than any commercial database on such an update.

One way to describe what "commercial databases" do to keep constant
update rates is saying that they do either vacuuming as part of
update, or they just use locks anf force some transactions to wait or
fail/retry.

Depending on exact details and optimisations done, this can be either
slower or faster than postgresql's way, but they still need to do
something to get transactional visibility rules implemented.

> > > There has to be a more linear way of handling this scenario.
> >
> > So vacuum the table often.
>
> It's easy to say VACUUM often... but I'd bet that vacuuming is going
> to lessen the throughput in his tests even more; no matter how it's
> tuned.

Running VACUUM often/continuously will likely keep his update rate
fluctuatons within a corridor of maybe 5-10%, at the cost of 1-2% extra
load. At least if vacuum is configured right and the server is not
already running at 100% IO saturation, in which case it will be worse.

The max throughput figure is not something you actually need very often
in production. What is interesting is setting up the server so that you
can service your loads comfortably. Running the server at 100% lead is
not anything you want to do on production server. There will be things
you need to do anyway and you need some headroom for that.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Chris Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 16:08:53
Message-ID:	604pydxje2.fsf@dba2.int.libertyrms.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

nagy(at)ecircle-ag(dot)com (Csaba Nagy) writes:

>> > [...]
>> > There has to be a more linear way of handling this scenario.
>>
>> So vacuum the table often.
>
> Good advice, except if the table is huge :-)

... Then the table shouldn't be designed to be huge. That represents
a design error.

> Here we have for example some tables which are frequently updated but
> contain >100 million rows. Vacuuming that takes hours. And the dead row
> candidates are the ones which are updated again and again and looked up
> frequently...

This demonstrates that "archival" material and "active" data should be
kept separately.

They have different access patterns; kludging them into the same table
turns out badly.

> A good solution would be a new type of vacuum which does not need to
> do a full table scan but can clean the pending dead rows without
> that... I guess then I could vacuum really frequently those tables.

That's yet another feature that's on the ToDo list; the "Vacuum Space
Map."

The notion is to have lists of recently modified pages, and to
restrict VACUUM to those pages. (Probably a special version of
VACUUM...)
--
output = reverse("moc.enworbbc" "@" "enworbbc")
http://www.ntlug.org/~cbbrowne/lisp.html
"As I've gained more experience with Perl it strikes me that it
resembles Lisp in many ways, albeit Lisp as channeled by an awk script
on acid." -- Tim Moore (on comp.lang.lisp)

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 16:21:55
Message-ID:	36e682920606220921r55a47422t8e1d90bf30a2476b@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/06, Hannu Krosing <hannu(at)skype(dot)net> wrote:
> I guess that MySQL on its original storage does that, but they allow
> only one concurrent update per table and no transactions.

More like practically every commercial database. As ~97% of
transactions commit (yes, some can argue that number), most commercial
systems prefer optimistic storage managers; whereas PostgreSQL opts
for the ~3% of pessimistic cases.

Let's see, if I had a 97% chance of winning the lottery... I'd
probably play a lot more than if I only had a 3% chance.

> One way to describe what "commercial databases" do to keep constant
> update rates is saying that they do either vacuuming as part of
> update, or they just use locks anf force some transactions to wait or
> fail/retry.

Not exactly... there are several ways to handle UNDO without locking.
Certainly the other systems have to perform background cleanup, but
I'd hardly compare that to vacuuming.

> Depending on exact details and optimisations done, this can be either
> slower or faster than postgresql's way, but they still need to do
> something to get transactional visibility rules implemented.

No argument there... but I have yet to find proof that
straight-out-of-CVS PostgreSQL (including tuning) is faster than any
commercial system on almost any large workload. Without a doubt, our
MVCC is in most cases, much nicer than other OSS databases. However,
it hasn't yet proven itself to be better than the concepts employed by
the database vendors with billions of dollars. The reasoning behind
PostgreSQL's storage and MVCC architecture made sense for its design,
but this had nothing to do with creating a super-highly-concurrent
database.

> Running the server at 100% lead is not anything you want to
> do on production server. There will be things you need to
> do anyway and you need some headroom for that.

No argument there.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 16:33:52
Message-ID:	18263.24.91.171.78.1150994032.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Ãhel kenal pÃ¤eval, N, 2006-06-22 kell 09:59, kirjutas Mark Woodward:
>> > After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark
>> > Woodward"), an earthling, wrote:
>> >>> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward") mumbled
>> into
>> > It pointed to *ALL* the versions.
>>
>> Hmm, OK, then the problem is more serious than I suspected.
>> This means that every index on a row has to be updated on every
>> transaction that modifies that row. Is that correct?
>
> Yes.
>
>> I am attaching some code that shows the problem with regard to
>> applications such as web server session management, when run, each
>> second
>> the system can handle fewer and fewer connections. Here is a brief
>> output:
>>
>> markw(at)ent:~/pgfoo$ ./footest
>> 1307 sessions per second, elapsed: 1
>> 1292 sessions per second, elapsed: 2
>> 1287 sessions per second, elapsed: 3
>> ....
>> 1216 sessions per second, elapsed: 25
>> 1213 sessions per second, elapsed: 26
>> 1208 sessions per second, elapsed: 27
>> ....
>> 1192 sessions per second, elapsed: 36
>> 1184 sessions per second, elapsed: 37
>> 1183 sessions per second, elapsed: 38
>> ....
>> 1164 sessions per second, elapsed: 58
>> 1170 sessions per second, elapsed: 59
>> 1168 sessions per second, elapsed: 60
>>
>> As you can see, in about a minute at high load, this very simple table
>> lost about 10% of its performance, and I've seen worse based on update
>> frequency. Before you say this is an obscure problem, I can tell you it
>> isn't. I have worked with more than a few projects that had to switch
>> away
>> from PostgreSQL because of this behavior.
>
> You mean systems that are designed so exactly, that they can't take 10%
> performance change ?

No, that's not really the point, performance degrades over time, in one
minute it degraded 10%.

The update to session ratio has a HUGE impact on PostgreSQL. If you have a
thousand active sessions, it may take a minute to degrade 10% assuming
some level of active vs operations per session per action.

If an active user causes a session update once a second, that is not too
bad, but if an active user updates a session more often, then it is worse.

Generally speaking, sessions aren't updated when they change, they are
usually updated per HTTP request. The data in a session may not change,
but the session handling code doesn't know this and simply updates anyway.

In a heavily AJAX site, you may have many smaller HTTP requests returning
items in a page. So, a single page may consist of multiple HTTP requests.
Worse yet, as a user drags an image around, there are lots of background
requests being made. Each request typically means a session lookup and a
session update. This is compounded by the number of active users. Since
the object of a site is to have many active users, this is always a
problem. It is less intrusive now that non-locking vacuum is there, but
that doesn't mean it isn't a problem.

>
> Or just that they did not vacuum for so long, that performance was less
> than needed in the end?

In an active site or application, vacuuming often enough to prevent this
often is, itself, a load on the system.

>
> btw, what did they switch to ?

One switched to oracle and one is using a session handler I wrote for PHP.
One company I did work for tried to maintain a table with a single row
that indicated state, this single row would sometimes take more than a
second to query. It was horrible. I'm not sure what they ended up using,
but I wrote a shared memory variable C function got rid of that specific
problem. They were trying to use PostgreSQL as the database to implement a
HUGE redundent networked file system. My personal opinion was that there
biggest problem was that they decided to use Java as the programming
environment, but that's another issue.

>
>> Obviously this is not a problem with small sites, but this is a real
>> problem with an enterprise level web site with millions of visitors and
>> actions a day.
>
> On such site you should design so that db load stays below 50% and run
> vacuum "often", that may even mean that you run vacuum continuously with
> no wait between runs. If you run vacuum with right settings,

Yea, but that, at least in my opinion, is a poor design.
>
>> Quite frankly it is a classic example of something that
>> does not scale. The more and more updates there are, the higher the load
>> becomes. You can see it on "top" as the footest program runs.
>
> Yes, you understood correctly - the more updates, the higher the load :)

Imagine this:

Each row in a table has a single entry that represents that row. Lets call
it the "key" entry. Whether or not the key entry maintains data is an
implementation detail.

When indexing a table, the index always points to the key entry for a row.

When a row is updated, in the spirit of MVCC, a new data row is created.
The key entry is then updated to point to the new version of the row. The
new row points to the previous version of the row, and the previous entry
continues to point to its previous entry, etc.

When a row is found by the index, the key entry is found first. Much more
often than not, the latest entry in the table for a row is the correct row
for a query.

Now this works in the simplest cases, and there are edge conditions where
index keys change, of course, but that's why its called software
development and not typing.

However, this accomplishes a few things, updates become cheaper because
indexes do not need to be updated if the keys don't change, and
performance doesn't degrade based on the number of updates a row has had.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 16:41:36
Message-ID:	18267.24.91.171.78.1150994496.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Ãhel kenal pÃ¤eval, N, 2006-06-22 kell 10:20, kirjutas Jonah H. Harris:
>> On 6/22/06, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
>> > > Hmm, OK, then the problem is more serious than I suspected.
>> > > This means that every index on a row has to be updated on every
>> > > transaction that modifies that row. Is that correct?
>> >
>> > Add an index entry, yes.
>>
>> Again, this is a case for update-in-place. No need to write an extra
>> index entry and incur the WAL associated with it.
>
> I guess that MySQL on its original storage does that, but they allow
> only one concurrent update per table and no transactions.
>
>> Imagine a table
>> with 3 indexes on it... I would estimate that we perform at least 3 to
>> 6 times more overhead than any commercial database on such an update.
>
> One way to describe what "commercial databases" do to keep constant
> update rates is saying that they do either vacuuming as part of
> update, or they just use locks anf force some transactions to wait or
> fail/retry.
>
> Depending on exact details and optimisations done, this can be either
> slower or faster than postgresql's way, but they still need to do
> something to get transactional visibility rules implemented.

I think they have a different strategy. I think they maintain the notion
of "current version" of a row, and hunt for previous versions when needed,
at least that's how I suspect Oracle does it with redo logs.

>
>> > > There has to be a more linear way of handling this scenario.
>> >
>> > So vacuum the table often.
>>
>> It's easy to say VACUUM often... but I'd bet that vacuuming is going
>> to lessen the throughput in his tests even more; no matter how it's
>> tuned.
>
> Running VACUUM often/continuously will likely keep his update rate
> fluctuatons within a corridor of maybe 5-10%, at the cost of 1-2% extra
> load. At least if vacuum is configured right and the server is not
> already running at 100% IO saturation, in which case it will be worse.

Assuming the table is a reasonable size, the I/O required for vacuum
doesn't kill everything else!
>
> The max throughput figure is not something you actually need very often
> in production.

No, but you need to have some degree of certainty and predictability in
the system you are developing.

> What is interesting is setting up the server so that you
> can service your loads comfortably. Running the server at 100% lead is
> not anything you want to do on production server. There will be things
> you need to do anyway and you need some headroom for that.

Of course, you design it so peaks are easily managed, but unless you run
vacuum continuously, and that has its own set of problems, you run into
this problem, and it can get really really bad.

>
> --
> ----------------
> Hannu Krosing
> Database Architect
> Skype Technologies OÃ
> Akadeemia tee 21 F, Tallinn, 12618, Estonia
>
> Skype me: callto:hkrosing
> Get Skype for free: http://www.skype.com
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
> http://archives.postgresql.org
>

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Christopher Browne <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 16:50:02
Message-ID:	996.1150995002@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Christopher Browne <cbbrowne(at)acm(dot)org> writes:
> After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark Woodward"), an earthling, wrote:
>> Not true. Oracle does not seem to exhibit this problem.

> Oracle suffers a problem in this regard that PostgreSQL doesn't; in
> Oracle, rollbacks are quite expensive, as "recovery" requires doing
> extra work that PostgreSQL doesn't do.

The Oracle design has got other drawbacks: if you need to access a row
version other than than the very latest, you need to go searching in the
rollback segments for it. This is slow (no index help) and creates
significant amounts of contention (since lots of processes are competing
to touch the rollback segments). Plus there's the old bugaboo that
long-running transactions require indefinite amounts of rollback space,
and Oracle is apparently unable to enlarge that space on-the-fly.
(This last seems like a surmountable problem, but maybe there is some
non-obvious reason why it's hard.)

Basically there's no free lunch: if you want the benefits of MVCC it's
going to cost you somewhere. In the Postgres design you pay by having
to do VACUUM pretty often for heavily-updated tables. I don't think
that decision is fundamentally wrong --- the attractive thing about it
is that the overhead is pushed out of the foreground query-processing
code paths. We still have lots of work to do in making autovacuum
smarter, avoiding vacuuming parts of relations that have not changed,
and so on. But I have no desire to go over to an Oracle-style solution
instead. We can't beat them by trying to be like them, and we run no
small risk of falling foul of some of their patents if we do.

regards, tom lane

From:	"Jochem van Dieten" <jochemd(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 16:53:07
Message-ID:	f96a9b830606220953i6369796hf18e3631a718068a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/06, Mark Woodward wrote:
(..)
> thousand active sessions
(..)
> If an active user causes a session update once a second
(..)
> Generally speaking, sessions aren't updated when they change, they are
> usually updated per HTTP request. The data in a session may not change,
> but the session handling code doesn't know this and simply updates anyway.

So what you are saying is that you are doing hundreds of unnecessary
updates per second and as a result of those unnecessary problems you
have a performance problem. Why not attack the root of the problem and
make the session handler smarter? And if you can't do that, put some
logic in the session table that turns an update without changes into a
no-op?

Jochem

From:	Lukas Smith <smith(at)pooteeweet(dot)org>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 16:58:14
Message-ID:	449ACC26.7010206@pooteeweet.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:

> Basically there's no free lunch: if you want the benefits of MVCC it's
> going to cost you somewhere. In the Postgres design you pay by having
> to do VACUUM pretty often for heavily-updated tables. I don't think
> that decision is fundamentally wrong --- the attractive thing about it
> is that the overhead is pushed out of the foreground query-processing
> code paths. We still have lots of work to do in making autovacuum
> smarter, avoiding vacuuming parts of relations that have not changed,
> and so on. But I have no desire to go over to an Oracle-style solution
> instead. We can't beat them by trying to be like them, and we run no
> small risk of falling foul of some of their patents if we do.

The question is just if it makes sense to give people the option of
running some tables with a different approach where the drawbacks of the
current approach are significant. This would let them stick to
PostgreSQL as their one stop solution.

The MySQL storage engine plugin architecture does have some merit in
general (even if you consider the rest of the RDBMS along with the
available storage engines to be inferior). Some problems simply require
different algorithms.

regards,
Lukas

From:	Lukas Smith <smith(at)pooteeweet(dot)org>
To:	Jochem van Dieten <jochemd(at)gmail(dot)com>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:01:38
Message-ID:	449ACCF2.3010603@pooteeweet.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jochem van Dieten wrote:

> make the session handler smarter? And if you can't do that, put some
> logic in the session table that turns an update without changes into a
> no-op?

err isnt that one the job of the database?

regards,
Lukas

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:07:02
Message-ID:	36e682920606221007t6653d69eyfe17155375c46787@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> The Oracle design has got other drawbacks: if you need to access a row
> version other than than the very latest, you need to go searching in the
> rollback segments for it.

There are ways to implement this functionality without implementing it
exactly as Oracle has.

> Plus there's the old bugaboo that long-running transactions
> require indefinite amounts of rollback space, and Oracle is
> apparently unable to enlarge that space on-the-fly.

This has actually gotten much better in recent versions.

> Basically there's no free lunch: if you want the benefits of MVCC it's
> going to cost you somewhere.

Surely. Our MVCC design is great for SELECT, INSERT, and for the most
part, DELETE. However, I'm confident that we can build a hybrid MVCC
model that takes some of the pain out of UPDATE without having to
overcomplicate VACUUM or violate patents.

From:	"D'Arcy J(dot)M(dot) Cain" <darcy(at)druid(dot)net>
To:	Lukas Smith <smith(at)pooteeweet(dot)org>
Cc:	jochemd(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:08:05
Message-ID:	20060622130805.66da7de2.darcy@druid.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 22 Jun 2006 19:01:38 +0200
Lukas Smith <smith(at)pooteeweet(dot)org> wrote:

> Jochem van Dieten wrote:
>
> > make the session handler smarter? And if you can't do that, put some
> > logic in the session table that turns an update without changes into a
> > no-op?
>
> err isnt that one the job of the database?

Seems like a lot of work to give the database for special cases. Not
only would it have to compare all fields, it would also have to check
for the behaviour of all rules, triggers and follow chains. Doing this
for each and every update seems like a loss to me.

Some things just make more sense in the framework of the business rules.

--
D'Arcy J.M. Cain <darcy(at)druid(dot)net> | Democracy is three wolves
http://www.druid.net/darcy/ | and a sheep voting on
+1 416 425 1212 (DoD#0082) (eNTP) | what's for dinner.

From:	Rod Taylor <pg(at)rbt(dot)ca>
To:	Chris Browne <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:23:45
Message-ID:	1150997025.745.137.camel@home
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > Here we have for example some tables which are frequently updated but
> > contain >100 million rows. Vacuuming that takes hours. And the dead row
> > candidates are the ones which are updated again and again and looked up
> > frequently...
>
> This demonstrates that "archival" material and "active" data should be
> kept separately.
>
> They have different access patterns; kludging them into the same table
> turns out badly.

Rightfully it should be up to the database engine to ensure that both of
these patterns work against the same structure. Splitting up the data
for their access patterns is the job of partitions (hidden from the end
user preferably).

Same table good, same partition and possible same table space is bad.

From:	Rod Taylor <pg(at)rbt(dot)ca>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:30:36
Message-ID:	1150997436.745.146.camel@home
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > You mean systems that are designed so exactly, that they can't take 10%
> > performance change ?
>
> No, that's not really the point, performance degrades over time, in one
> minute it degraded 10%.
>
> The update to session ratio has a HUGE impact on PostgreSQL. If you have a
> thousand active sessions, it may take a minute to degrade 10% assuming
> some level of active vs operations per session per action.

So don't do an update. Multiple updates to the same row block anyway
which is generally not something you want anyway.

If you INSERT into multiple partitions (by time -- say one per minute)
and TRUNCATE periodically (30 minute old partitions for 30 minute
expiry) it works much better. Expiring the session is quite fast as well
since they'll go away with the truncate.

Index on sessionid and time and grab the row with the most recent time.
--

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Rod Taylor" <pg(at)rbt(dot)ca>
Cc:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:42:47
Message-ID:	36e682920606221042v2a321189p9ffc79abea14c77a@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/06, Rod Taylor <pg(at)rbt(dot)ca> wrote:
> If you INSERT into multiple partitions (by time -- say one per minute)
> and TRUNCATE periodically (30 minute old partitions for 30 minute
> expiry) it works much better. Expiring the session is quite fast as well
> since they'll go away with the truncate.

Forgive me if this sounds rude because I'm not meaning it as such, but
this suggestion sounds like a MySQL-ish hack. No doubt it would work,
but should an application-writer have to incur the cost of writing
something different because the database can't handle it? I remember
having to write subselect code in PHP because MySQL couldn't do it...
not what I would call a *solution* by any mean.

From:	Rod Taylor <pg(at)rbt(dot)ca>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:48:06
Message-ID:	1150998486.745.150.camel@home
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2006-06-22 at 13:42 -0400, Jonah H. Harris wrote:
> On 6/22/06, Rod Taylor <pg(at)rbt(dot)ca> wrote:
> > If you INSERT into multiple partitions (by time -- say one per minute)
> > and TRUNCATE periodically (30 minute old partitions for 30 minute
> > expiry) it works much better. Expiring the session is quite fast as well
> > since they'll go away with the truncate.
>
> Forgive me if this sounds rude because I'm not meaning it as such, but
> this suggestion sounds like a MySQL-ish hack. No doubt it would work,
> but should an application-writer have to incur the cost of writing
> something different because the database can't handle it? I remember
> having to write subselect code in PHP because MySQL couldn't do it...
> not what I would call a *solution* by any mean.

Application writer doesn't need to do anything. That's what rules,
views, etc. are for.

The only snag is that PostgreSQL doesn't 'eliminate' a partition by
itself, so the truncate is a manual requirement.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:49:36
Message-ID:	3936.216.145.49.15.1150998576.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Christopher Browne <cbbrowne(at)acm(dot)org> writes:
>> After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark
>> Woodward"), an earthling, wrote:
>>> Not true. Oracle does not seem to exhibit this problem.
>
>> Oracle suffers a problem in this regard that PostgreSQL doesn't; in
>> Oracle, rollbacks are quite expensive, as "recovery" requires doing
>> extra work that PostgreSQL doesn't do.
>
> The Oracle design has got other drawbacks: if you need to access a row
> version other than than the very latest, you need to go searching in the
> rollback segments for it. This is slow (no index help) and creates
> significant amounts of contention (since lots of processes are competing
> to touch the rollback segments).

But, it is all probability, in most cases, the VAST majority, older
versions aren't much needed outside the concurency of of active
transactions.

> Plus there's the old bugaboo that
> long-running transactions require indefinite amounts of rollback space,
> and Oracle is apparently unable to enlarge that space on-the-fly.
> (This last seems like a surmountable problem, but maybe there is some
> non-obvious reason why it's hard.)

Yea, Oracle has a million way to die. And when you think you know all one
million, you find one million and one.

>
> Basically there's no free lunch: if you want the benefits of MVCC it's
> going to cost you somewhere. In the Postgres design you pay by having
> to do VACUUM pretty often for heavily-updated tables. I don't think
> that decision is fundamentally wrong --- the attractive thing about it
> is that the overhead is pushed out of the foreground query-processing
> code paths.

Under certain circumstances, it is a very poor design. Think of a single
row table that keeps a scoreboard or a session table that keeps a limited
number of rows that are updated very frequently.

> We still have lots of work to do in making autovacuum
> smarter, avoiding vacuuming parts of relations that have not changed,
> and so on. But I have no desire to go over to an Oracle-style solution
> instead. We can't beat them by trying to be like them, and we run no
> small risk of falling foul of some of their patents if we do.

I proposed having a "key row entry" for each logical row. The key row
entry points to the latest version of the row. There, each row entry is a
linked list, in descending order, of previous row versions. The vast
majority of the time, the latest version will be the first version. It is
only when you have a previously started long running or concurrent
transaction will you ever look at previous versions.

I'm not saying it is an easy slam dunk, as I can think of a few
difficulties off the top of my head, but it would solve the steady
degradation of performance between vacuums and, to a possibly lesser
extent, the cost of updating a row in a heavily indexed table.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Rod Taylor <pg(at)rbt(dot)ca>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:50:36
Message-ID:	Pine.OSF.4.61.0606222047100.359248@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 22 Jun 2006, Jonah H. Harris wrote:

> On 6/22/06, Rod Taylor <pg(at)rbt(dot)ca> wrote:
>> If you INSERT into multiple partitions (by time -- say one per minute)
>> and TRUNCATE periodically (30 minute old partitions for 30 minute
>> expiry) it works much better. Expiring the session is quite fast as well
>> since they'll go away with the truncate.
>
> Forgive me if this sounds rude because I'm not meaning it as such, but
> this suggestion sounds like a MySQL-ish hack. No doubt it would work,
> but should an application-writer have to incur the cost of writing
> something different because the database can't handle it? I remember
> having to write subselect code in PHP because MySQL couldn't do it...
> not what I would call a *solution* by any mean.

AFAICS, you could hide it from the application with update rules and
views.

- Heikki

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:56:28
Message-ID:	8764it2hwz.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> The Oracle design has got other drawbacks: if you need to access a row
> version other than than the very latest, you need to go searching in the
> rollback segments for it. This is slow (no index help)

Just for the record, if i understood correctly -- this was all a bit black
magicky -- Oracle found the data in the rollback segment by storing a pointer
to it in the block header where the updated data is. Ie, it could jump
straight to the right place. Of course the record could have been updated
multiple times which would necessitate following a linked list of rollback
segment entries. And those are all random access reads though you could hope
it was mostly cached.

> and creates significant amounts of contention (since lots of processes are
> competing to touch the rollback segments). Plus there's the old bugaboo that
> long-running transactions require indefinite amounts of rollback space, and
> Oracle is apparently unable to enlarge that space on-the-fly. (This last
> seems like a surmountable problem, but maybe there is some non-obvious
> reason why it's hard.)

That was one of the biggest bugaboos with Oracle but I get the impression they
spent a lot of work improving it precisely because it was such a high profile
failure. I think it's basically impossible to get a good old 'Snapshot Too
Old' error now but I'm unclear how they achieved that. There must have been a
tradeoff somewhere.

There are other solutions too. I never used DB2 but I was led to believe they
used their transaction log to retrieve old versions of the records. Someone
else here claimed DB2 didn't implement MVCC at all so perhaps that's wrong
though.

--
greg

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Rod Taylor" <pg(at)rbt(dot)ca>
Cc:	"Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 17:56:56
Message-ID:	43419.216.145.49.15.1150999016.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> > You mean systems that are designed so exactly, that they can't take
>> 10%
>> > performance change ?
>>
>> No, that's not really the point, performance degrades over time, in one
>> minute it degraded 10%.
>>
>> The update to session ratio has a HUGE impact on PostgreSQL. If you have
>> a
>> thousand active sessions, it may take a minute to degrade 10% assuming
>> some level of active vs operations per session per action.
>
> So don't do an update. Multiple updates to the same row block anyway
> which is generally not something you want anyway.

The example is a very active web site, the flow is this:

query for session information
process HTTP request
update session information

This happens for EVERY http request. Chances are that you won't have
concurrent requests for the same row, but you may have well over 100 HTTP
server processes/threads answering queries in your web server farm.

>
> If you INSERT into multiple partitions (by time -- say one per minute)
> and TRUNCATE periodically (30 minute old partitions for 30 minute
> expiry) it works much better. Expiring the session is quite fast as well
> since they'll go away with the truncate.
>
> Index on sessionid and time and grab the row with the most recent time.

I doubt that that approach (1) answers the problem or (2) would be more
efficient.
> --
>

From:	Lukas Smith <smith(at)pooteeweet(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 18:03:13
Message-ID:	e7em10$baa$1@news.hub.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark wrote:

> There are other solutions too. I never used DB2 but I was led to believe they
> used their transaction log to retrieve old versions of the records. Someone
> else here claimed DB2 didn't implement MVCC at all so perhaps that's wrong
> though.

I would be suprised giving this paper:
ftp://ftp.software.ibm.com/software/data/pubs/papers/readconsistency.pdf

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Lukas Smith <smith(at)pooteeweet(dot)org>
Cc:	Jochem van Dieten <jochemd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 18:09:00
Message-ID:	1886.1150999740@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Lukas Smith <smith(at)pooteeweet(dot)org> writes:
> Jochem van Dieten wrote:
>> make the session handler smarter? And if you can't do that, put some
>> logic in the session table that turns an update without changes into a
>> no-op?

> err isnt that one the job of the database?

No. That idea has been suggested and rejected before. Detecting that
an UPDATE is a no-op would require a significant number of cycles, and
in most applications, most or all of the time those cycles would be
wasted effort. If you have a need for this behavior, you can attach a
BEFORE UPDATE trigger to a table that checks for all-fields-the-same and
suppresses the update. I don't think that should be automatic though.

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 18:19:15
Message-ID:	2001.1151000355@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>> The Oracle design has got other drawbacks: if you need to access a row
>> version other than than the very latest, you need to go searching in the
>> rollback segments for it. This is slow (no index help)

> Just for the record, if i understood correctly -- this was all a bit black
> magicky -- Oracle found the data in the rollback segment by storing a pointer
> to it in the block header where the updated data is. Ie, it could jump
> straight to the right place.

Yeah, you should be able to find the older version easily enough, if you
arrived at the newer version and realized you needed to visit the older
version. But this fails in scenarios where you are searching on a
column that's been updated --- the index entry for the newer version
will not lead you to visit it at all. I've not seen any documentation
about what Oracle does with cases like that.

regards, tom lane

From:	Rod Taylor <pg(at)rbt(dot)ca>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 18:23:15
Message-ID:	1151000595.745.169.camel@home
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 2006-06-22 at 13:56 -0400, Mark Woodward wrote:
> >> > You mean systems that are designed so exactly, that they can't take
> >> 10%
> >> > performance change ?
> >>
> >> No, that's not really the point, performance degrades over time, in one
> >> minute it degraded 10%.
> >>
> >> The update to session ratio has a HUGE impact on PostgreSQL. If you have
> >> a
> >> thousand active sessions, it may take a minute to degrade 10% assuming
> >> some level of active vs operations per session per action.
> >
> > So don't do an update. Multiple updates to the same row block anyway
> > which is generally not something you want anyway.
>
> The example is a very active web site, the flow is this:
>
> query for session information
> process HTTP request
> update session information
>
> This happens for EVERY http request. Chances are that you won't have
> concurrent requests for the same row, but you may have well over 100 HTTP
> server processes/threads answering queries in your web server farm.
>
> >
> > If you INSERT into multiple partitions (by time -- say one per minute)
> > and TRUNCATE periodically (30 minute old partitions for 30 minute
> > expiry) it works much better. Expiring the session is quite fast as well
> > since they'll go away with the truncate.
> >
> > Index on sessionid and time and grab the row with the most recent time.
>
> I doubt that that approach (1) answers the problem or (2) would be more
> efficient.

Depends on what else you have going on in the background. If you can
vacuum constantly and have the garbage cleaned up, it may not be. If you
have a long running transaction (pg_dump, vacuum?, report query, ...) in
the background then it can make a pretty good difference last time I
measured.

If this is the only thing that the database has and vacuum still isn't
able to keep up, who knows you might find advantages to being able to
truncate periodically instead of a constant vacuum.

From:	PFC <lists(at)peufeu(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 18:28:22
Message-ID:	op.tbj79ky2cigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> As you can see, in about a minute at high load, this very simple table
> lost about 10% of its performance, and I've seen worse based on update
> frequency. Before you say this is an obscure problem, I can tell you it
> isn't. I have worked with more than a few projects that had to switch
> away
> from PostgreSQL because of this behavior.
>
> Obviously this is not a problem with small sites, but this is a real
> problem with an enterprise level web site with millions of visitors and
> actions a day. Quite frankly it is a classic example of something that
> does not scale. The more and more updates there are, the higher the load
> becomes. You can see it on "top" as the footest program runs.

I believe sessions should not be stored in a SQL database.
It makes no sense. Updates and Inserts to the database should only be
done where there is an interesting thing to record, when the user does an
action like posting to a forum, making a purchase, sending a message, etc.

I believe sessions should be stored in the memory of the application
server, as native objects of the whatever language the application is
written in. This way, sessions incur no serializing overhead and can be
quite large and complex, which allows storage of interesting things, like
the result of a complicated search query which is to be later paginated,
for instance. It really makes sense to use native language objects too, as
these have a lot more power and versatility than a database row. Think
about rights management, for instance.

When the framework used lacks this power (most do and this is sad), then
sessions incur serializing overhead ; but they should be serialized to
filesystem files, or better, to memory using memcached, for instance. It
makes no sense to pay the performance penalty of a COMMIT (disk seek delay
etc) and the database overhead for sessions, which are by nature volatile
data changing all the time.

I don't think postgres should be tweaked to allow better handling of
this. It would only make a better foot-gun.

From:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 18:37:19
Message-ID:	20060622183719.GB16267@surnet.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Adding back pgsql-hackers.

Mark Woodward wrote:
> > Mark Woodward wrote:
> >
> >> Hmm, OK, then the problem is more serious than I suspected.
> >> This means that every index on a row has to be updated on every
> >> transaction that modifies that row. Is that correct?
> >
> > Add an index entry, yes.
> >
> >> I am attaching some code that shows the problem with regard to
> >> applications such as web server session management, when run, each
> >> second
> >> the system can handle fewer and fewer connections. Here is a brief
> >> output:
> >> [...]
> >> There has to be a more linear way of handling this scenario.
> >
> > So vacuum the table often.
>
> That fixes the symptom, not the problem. The problem is performance
> steadily degrades over time.

No, you got it backwards. The performance degradation is the symptom.
The problem is that there are too many dead tuples in the table. There
is one way to solve that problem -- remove them, which is done by
running vacuum.

There are some problems with vacuum itself, that I agree with. For
example it would be good if a long-running vacuum wouldn't affect a
vacuum running in another table because of the long-running transaction
effect it has. It would be good if vacuum could be run partially over a
table. It would be good if there was a way to speed up vacuum by using
a dead space map or something.

I have a pending patch which adds the possibility of "non-transactional
catalogs". Maybe that could evolve into non-transactional user tables
eventually, which would have problems of their own but maybe could be
used in certain very limited scenarios, like the sessions table.

--
Alvaro Herrera Developer, http://www.PostgreSQL.org
"Oh, great altar of passive entertainment, bestow upon me thy discordant images
at such speed as to render linear thought impossible" (Calvin a la TV)

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 18:49:56
Message-ID:	87wtb910vf.fsf@stark.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> Yeah, you should be able to find the older version easily enough, if you
> arrived at the newer version and realized you needed to visit the older
> version. But this fails in scenarios where you are searching on a
> column that's been updated --- the index entry for the newer version
> will not lead you to visit it at all. I've not seen any documentation
> about what Oracle does with cases like that.

That's a good question. I had the impression rollback was handled at a lower
level so index pages were stored in rollback segments just like table pages.
If you saw that the page of the index had been updated you went and fetched
the index page of the appropriate age from the rollback segment, giving you
pointers to the right records or at least the right places to start from.

Now that I think about it though I'm not too sure. It seems like the
performance of that would be pretty awful.

--
greg

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"PFC" <lists(at)peufeu(dot)com>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 19:00:46
Message-ID:	18382.24.91.171.78.1151002846.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>> As you can see, in about a minute at high load, this very simple table
>> lost about 10% of its performance, and I've seen worse based on update
>> frequency. Before you say this is an obscure problem, I can tell you it
>> isn't. I have worked with more than a few projects that had to switch
>> away
>> from PostgreSQL because of this behavior.
>>
>> Obviously this is not a problem with small sites, but this is a real
>> problem with an enterprise level web site with millions of visitors and
>> actions a day. Quite frankly it is a classic example of something that
>> does not scale. The more and more updates there are, the higher the load
>> becomes. You can see it on "top" as the footest program runs.
>
>
> I believe sessions should not be stored in a SQL database.
Alas, this is a long debate, and while I fundimentally agree with this
position, there is an inconvenient truth that it is often nessisary.

http://www.mohawksoft.org/?q=node/8

> It makes no sense. Updates and Inserts to the database should only be
> done where there is an interesting thing to record, when the user does an
> action like posting to a forum, making a purchase, sending a message, etc.

Again, preaching to the chior.

>
> I believe sessions should be stored in the memory of the application
> server, as native objects of the whatever language the application is
> written in. This way, sessions incur no serializing overhead and can be
> quite large and complex, which allows storage of interesting things, like
> the result of a complicated search query which is to be later paginated,
> for instance. It really makes sense to use native language objects too, as
> these have a lot more power and versatility than a database row. Think
> about rights management, for instance.

What you seem not to grasp at this point is a large web-farm, about 10 or
more servers running PHP, Java, ASP, or even perl. The database is usually
the most convenient and, aside from the particular issue we are talking
about, best suited.

>
> When the framework used lacks this power (most do and this is sad), then
> sessions incur serializing overhead ; but they should be serialized to
> filesystem files, or better, to memory using memcached, for instance.

I actually have a good number of years of experience in this topic, and
memcached or file system files are NOT the best solutions for a server
farm.

> It
> makes no sense to pay the performance penalty of a COMMIT (disk seek delay
> etc) and the database overhead for sessions, which are by nature volatile
> data changing all the time.

Very true.

>
> I don't think postgres should be tweaked to allow better handling of
> this. It would only make a better foot-gun.

I think the debate is over, it may be a bad use of a database, but there
are few alternatives, and SQL databases have become the defacto
methodology for dealing with this type of problem.

From:	David Fetter <david(at)fetter(dot)org>
To:	Lukas Smith <smith(at)pooteeweet(dot)org>
Cc:	Jochem van Dieten <jochemd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 19:55:52
Message-ID:	20060622195552.GA19158@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, Jun 22, 2006 at 07:01:38PM +0200, Lukas Smith wrote:
> Jochem van Dieten wrote:
>
> >make the session handler smarter? And if you can't do that, put
> >some logic in the session table that turns an update without
> >changes into a no-op?
>
> err isnt that one the job of the database?

By no means. Even if there were zero overhead for this, which there
couldn't be, you'd have a real problem auditing attempted actions.
This would be a very big problem when trying to track down a rogue app
and/or a denial of service attack.

Cheers,
D
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

From:	PFC <lists(at)peufeu(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 20:03:22
Message-ID:	op.tbkcnwsicigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What you seem not to grasp at this point is a large web-farm, about 10 or
> more servers running PHP, Java, ASP, or even perl. The database is
> usually
> the most convenient and, aside from the particular issue we are talking
> about, best suited.

The answer is sticky sessions : each user is assigned to one and only one
webserver in the cluster and his session is maintained locally, in RAM. No
locks, no need to manage distributed session...

> I actually have a good number of years of experience in this topic, and
> memcached or file system files are NOT the best solutions for a server
> farm.

If sessions are distributed, certainly, but if sessions are sticky to
their own server ?

From:	Andrew Dunstan <andrew(at)dunslane(dot)net>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 20:19:43
Message-ID:	449AFB5F.4020807@dunslane.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

PFC wrote:

>
>> What you seem not to grasp at this point is a large web-farm, about
>> 10 or
>> more servers running PHP, Java, ASP, or even perl. The database is
>> usually
>> the most convenient and, aside from the particular issue we are talking
>> about, best suited.
>
>
> The answer is sticky sessions : each user is assigned to one and
> only one webserver in the cluster and his session is maintained
> locally, in RAM. No locks, no need to manage distributed session...

Sticky sessions can cause enormous problems. I have just worked on a
site whose problems largely come back to having to use a load balancer
in front of an app server farm that required sticky sessions.

They are not a solution, they are a disease.

cheers

andrew

From:	Jim Nasby <jnasby(at)pervasive(dot)com>
To:	Greg Stark <gsstark(at)mit(dot)edu>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 20:28:34
Message-ID:	877C71B2-7C18-481E-B67F-6483E1A29BAA@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 22, 2006, at 12:56 PM, Greg Stark wrote:
> Just for the record, if i understood correctly -- this was all a
> bit black
> magicky -- Oracle found the data in the rollback segment by storing
> a pointer
> to it in the block header where the updated data is. Ie, it could jump
> straight to the right place. Of course the record could have been
> updated
> multiple times which would necessitate following a linked list of
> rollback
> segment entries. And those are all random access reads though you
> could hope
> it was mostly cached.

Correct. Mark is basically suggesting we do the same thing in the
heap so that updates that don't touch any indexed fields could just
update the indexes to point them at the new row version.

Question: do we currently create new index entries even if the index
key hasn't changed? If so, what's the purpose of storing the CTID of
the next version in the old version of the row?

> There are other solutions too. I never used DB2 but I was led to
> believe they
> used their transaction log to retrieve old versions of the records.
> Someone
> else here claimed DB2 didn't implement MVCC at all so perhaps
> that's wrong
> though.

DB2 switched to MVCC in version 8. MSSQL switched in MSSQL2005. At
this point, I don't know of any major databases that don't use some
form of MVCC (err, maybe Sybase?)
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Jim Nasby <jnasby(at)pervasive(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"PFC" <lists(at)peufeu(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 20:31:13
Message-ID:	322E84DA-29E4-47FA-905F-19E7C16D58AC@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 22, 2006, at 2:00 PM, Mark Woodward wrote:

> I actually have a good number of years of experience in this topic,
> and
> memcached or file system files are NOT the best solutions for a server
> farm.

What's wrong with memcached for session data?
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"PFC" <lists(at)peufeu(dot)com>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 20:36:34
Message-ID:	18436.24.91.171.78.1151008594.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
>> What you seem not to grasp at this point is a large web-farm, about 10
>> or
>> more servers running PHP, Java, ASP, or even perl. The database is
>> usually
>> the most convenient and, aside from the particular issue we are talking
>> about, best suited.
>
> The answer is sticky sessions : each user is assigned to one and only one
> webserver in the cluster and his session is maintained locally, in RAM. No
> locks, no need to manage distributed session...
>
>> I actually have a good number of years of experience in this topic, and
>> memcached or file system files are NOT the best solutions for a server
>> farm.
>
> If sessions are distributed, certainly, but if sessions are sticky to
> their own server ?

And what if a particulr server goes down? or gets too high a percentage of
the load?

From:	Jim Nasby <jnasby(at)pervasive(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Lukas Smith <smith(at)pooteeweet(dot)org>, Jochem van Dieten <jochemd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 20:42:10
Message-ID:	2B83762A-CCDC-4B65-9458-1B3F8E14560F@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 22, 2006, at 1:09 PM, Tom Lane wrote:
> Lukas Smith <smith(at)pooteeweet(dot)org> writes:
>> Jochem van Dieten wrote:
>>> make the session handler smarter? And if you can't do that, put some
>>> logic in the session table that turns an update without changes
>>> into a
>>> no-op?
>
>> err isnt that one the job of the database?
>
> No. That idea has been suggested and rejected before. Detecting that
> an UPDATE is a no-op would require a significant number of cycles, and
> in most applications, most or all of the time those cycles would be
> wasted effort. If you have a need for this behavior, you can attach a
> BEFORE UPDATE trigger to a table that checks for all-fields-the-
> same and
> suppresses the update. I don't think that should be automatic though.

What would be nice to add is the ability to perform that check more
easily. As of 8.1...

create function test() returns trigger as $$
begin
if NEW=OLD then
return null;
else
return NEW;
end if;
end;
create trigger test before update on test execute procedure test();
update test set i=i;
ERROR: operator does not exist: test = test
HINT: No operator matches the given name and argument type(s). You
may need to add explicit type casts.
CONTEXT: SQL statement "SELECT $1 = $2 "
PL/pgSQL function "test" line 2 at if
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jnasby(at)pervasive(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 21:06:14
Message-ID:	6992.1151010374@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <jnasby(at)pervasive(dot)com> writes:
> Question: do we currently create new index entries even if the index
> key hasn't changed?

Yes.

> If so, what's the purpose of storing the CTID of
> the next version in the old version of the row?

So that UPDATE can always find the newest version of the row,
independently of what indexes exist or how it got to the row.
You may care to re-read the second paragraph here:
http://developer.postgresql.org/docs/postgres/transaction-iso.html#XACT-READ-COMMITTED

regards, tom lane

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jnasby(at)pervasive(dot)com>
Cc:	Lukas Smith <smith(at)pooteeweet(dot)org>, Jochem van Dieten <jochemd(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-22 21:31:16
Message-ID:	7243.1151011876@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jim Nasby <jnasby(at)pervasive(dot)com> writes:
> What would be nice to add is the ability to perform that check more
> easily. As of 8.1...
> ...
> if NEW=OLD then
> ...
> ERROR: operator does not exist: test = test
> HINT: No operator matches the given name and argument type(s). You
> may need to add explicit type casts.

Hmm, there seems to be some asymmetry in the handling, because it works
for anonymous row constructors:

regression=# select row(a.*) = row(a.*) from int8_tbl a;
ERROR: operator does not exist: int8_tbl = int8_tbl
LINE 1: select row(a.*) = row(a.*) from int8_tbl a;
^
HINT: No operator matches the given name and argument type(s). You may need to
add explicit type casts.

regression=# select row(a.q1,a.q2) = row(a.q1,a.q2) from int8_tbl a;
?column?
----------
t
t
t
t
t
(5 rows)

Someone oughta look into that...

regards, tom lane

From:	Agent M <agentm(at)themactionfaction(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 01:56:12
Message-ID:	f621d7f4c0aed572719bde8db7d8786f@themactionfaction.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 22, 2006, at 9:56 PM, Christopher Kings-Lynne wrote:

>> The example is a very active web site, the flow is this:
>> query for session information
>> process HTTP request
>> update session information
>> This happens for EVERY http request. Chances are that you won't have
>> concurrent requests for the same row, but you may have well over 100
>> HTTP
>> server processes/threads answering queries in your web server farm.
>
> You're crazy :) Use memcache, not the DB :)

Still, the database is the one central location that the apaches can
connect too- postgres already has a lot of application platform
features- locking synchronization, asynchronous notifications,
arbitrary pl code.

Personally, I think that a special non-MVCC table type could be
created- the catalogs are similarly flat. What I envision is a table
type that can only be accessed "outside" transactions (like AutoCommit
mode)- this is already possible to implement in plperl for a single
session. It would be more efficient to have something like a global
temp table hanging around...

Just some random ideas...

-M

¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬
AgentM
agentm(at)themactionfaction(dot)com
¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬ ¬

From:	Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Rod Taylor <pg(at)rbt(dot)ca>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 01:56:16
Message-ID:	449B4A40.80106@calorieking.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> The example is a very active web site, the flow is this:
>
> query for session information
> process HTTP request
> update session information
>
> This happens for EVERY http request. Chances are that you won't have
> concurrent requests for the same row, but you may have well over 100 HTTP
> server processes/threads answering queries in your web server farm.

You're crazy :) Use memcache, not the DB :)

Chris

From:	Steve Atkins <steve(at)blighty(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 02:08:48
Message-ID:	3E63D93C-76BD-46A4-97C2-D2E90A9E370D@blighty.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 22, 2006, at 6:56 PM, Agent M wrote:

>
> On Jun 22, 2006, at 9:56 PM, Christopher Kings-Lynne wrote:
>
>>> The example is a very active web site, the flow is this:
>>> query for session information
>>> process HTTP request
>>> update session information
>>> This happens for EVERY http request. Chances are that you won't have
>>> concurrent requests for the same row, but you may have well over
>>> 100 HTTP
>>> server processes/threads answering queries in your web server farm.
>>
>> You're crazy :) Use memcache, not the DB :)
>
> Still, the database is the one central location that the apaches
> can connect too- postgres already has a lot of application platform
> features- locking synchronization, asynchronous notifications,
> arbitrary pl code.
>
> Personally, I think that a special non-MVCC table type could be
> created- the catalogs are similarly flat. What I envision is a
> table type that can only be accessed "outside" transactions (like
> AutoCommit mode)- this is already possible to implement in plperl
> for a single session. It would be more efficient to have something
> like a global temp table hanging around...
>

Have you seen pgmemcache? It allows you to access a memcached
instance from within postgresql - which makes many of the problems
with using a separate caching / storage layer go away, or at least
get far easier to deal with.

Cheers,
Steve

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	Agent M <agentm(at)themactionfaction(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 02:30:03
Message-ID:	Pine.LNX.4.58.0606231205470.32171@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 22 Jun 2006, Agent M wrote:

>
> On Jun 22, 2006, at 9:56 PM, Christopher Kings-Lynne wrote:
>
> >> The example is a very active web site, the flow is this:
> >> query for session information
> >> process HTTP request
> >> update session information
> >> This happens for EVERY http request. Chances are that you won't have
> >> concurrent requests for the same row, but you may have well over 100
> >> HTTP
> >> server processes/threads answering queries in your web server farm.
> >
> > You're crazy :) Use memcache, not the DB :)
>
> Still, the database is the one central location that the apaches can
> connect too- postgres already has a lot of application platform
> features- locking synchronization, asynchronous notifications,
> arbitrary pl code.
>
> Personally, I think that a special non-MVCC table type could be
> created- the catalogs are similarly flat. What I envision is a table
> type that can only be accessed "outside" transactions (like AutoCommit
> mode)- this is already possible to implement in plperl for a single
> session. It would be more efficient to have something like a global
> temp table hanging around...
>
> Just some random ideas...

Unfortunately, it's not so simple.

What if a user enters a transaction block, modifies a normal table,
modifies this 'special table'... then rolls back? This is the problem
MySQL has with innodb and myisam tables.

Well... we could just document that. If only.

What happens if, as a part the update to the special table, we encounter
and error? MVCC currently guarantees that this modification will be
invisible. Without MVCC, we have no such capability.

There seems to be a bit of confusion about what MVCC is. PostgreSQL is not
the only MVCC database. InnoDB is MVCC. Oracle is MVCC. As far as I know,
PostgreSQL is the only MVCC database with a 'non-overwriting storage
manager'. The other MVCC databases maintain UNDO logs outside of the
table's data files. When an update occurs, the existing row version is
copied to te UNDO file, the new data replaces the old and a backward
pointer from the table row to the UNDO log is created. Concurrent reads
must go into the UNDO log and find the version visible to them. This
implements the principle of MVCC and uses snapshot isolation, like we do,
to isolate read/write concurrency.

Overwriting MVCC comes with its own baggage. Ask any Oracle user about
error ORA-01555[1]. There's also the added cost of managing the UNDO logs,
the cost of jumping around between files to get row versions and so on.

Also, it leads to inefficiency with variable size data types. The new
version of a row might be longer or shorter than the previous version and
this has to be causing them a headaches and performance penalties.

As for single version databases -- they have a tonne of problems
themselves. They need to maintain UNDO functionality so as to deal with
the error handling I detailed above but they also cannot implement MVCC
rules for all cases: readers to not block writes, writers do not block
readers. Instead, they implement a large lock matrix and certain types of
queries use certain types of granular locks.

Some potentially interesting reading material on these ideas:

J. Gray & A Reuter, Transaction Processing: Concepts and Techniques
US Patent Number 5,870,758 -- (one of?) Oracle's snapshot isolation patent
Tom Lane's MVCC talk:
http://www.postgresql.org/files/developer/transactions.pdf

Thanks,

Gavin

---
[1] Basically, the UNDO logs are a circular buffer and, remarkably, Oracle
doesnt seem to let the buffer expand if there is a long running
transaction. (We do this for WAL).

Basically, in this case Oracle finds itself in a compromising position
because if any more data is written to the UNDO log the query affected
will not be able to roll back and/or concurrent queries will not be able
to find the correct row version for their snapshot.

UNDO log size can be adjusted but this situation bites Oracle users
constantly.

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Jim Nasby <jnasby(at)pervasive(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Row comparison for tables (was Re: vacuum, performance, and MVCC)
Date:	2006-06-23 02:30:17
Message-ID:	10199.1151029817@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

I wrote:
> Jim Nasby <jnasby(at)pervasive(dot)com> writes:
>> What would be nice to add is the ability to perform that check more
>> easily. As of 8.1...
>> ...
>> if NEW=OLD then
>> ...
>> ERROR: operator does not exist: test = test
>> HINT: No operator matches the given name and argument type(s). You
>> may need to add explicit type casts.

> Hmm, there seems to be some asymmetry in the handling, because it works
> for anonymous row constructors:
> regression=# select row(a.q1,a.q2) = row(a.q1,a.q2) from int8_tbl a;

I poked into this a little, and it seems like the most flexible answer
might be for "foo.*" inside a ROW construct to behave the way that it
does at the top level of a SELECT list --- that is, implicitly expand to
a list of all the columns of foo. You'd have to write the mentioned
test as

if row(new.*) = row(old.*) then ...

but there is some additional flexibility because you could form rows
from combinations of things, eg

row(foo.*, bar, baz.*)

I can't find anything suggesting this syntax in the SQL99 spec, but
I don't think they expect equality to work on two unadorned table names
either.

Not sure if there are any backwards-compatibility issues. Right now the
system takes this syntax as creating a rowtype column within a rowtype,
which is possibly of some use but I kinda doubt people are doing much
with it. In any case, if you did want that behavior you could still get
it by leaving off the ".*".

Implementation would be pretty trivial, we'd just have to put logic into
transformRowExpr() comparable to what transformTargetList() does for
"foo.*" cases. With a little bit of refactoring, the code could be shared.

Comments?

regards, tom lane

From:	Christopher Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 03:02:42
Message-ID:	87k678bmlp.fsf@wolfe.cbbrowne.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The world rejoiced as tgl(at)sss(dot)pgh(dot)pa(dot)us (Tom Lane) wrote:
> Lukas Smith <smith(at)pooteeweet(dot)org> writes:
>> Jochem van Dieten wrote:
>>> make the session handler smarter? And if you can't do that, put some
>>> logic in the session table that turns an update without changes into a
>>> no-op?
>
>> err isnt that one the job of the database?
>
> No. That idea has been suggested and rejected before. Detecting
> that an UPDATE is a no-op would require a significant number of
> cycles, and in most applications, most or all of the time those
> cycles would be wasted effort. If you have a need for this
> behavior, you can attach a BEFORE UPDATE trigger to a table that
> checks for all-fields-the-same and suppresses the update. I don't
> think that should be automatic though.

If this be handled via a stored procedure, the stored proc could hide
this detail nice and completely...

select store_session(a,b,c,d,e);

store_session can be quite smart enough to not bother doing spurious
updates.
--
select 'cbbrowne' || '@' || 'linuxfinances.info';
http://linuxdatabases.info/info/x.html
Rules of the Evil Overlord #29. "I will dress in bright and cheery
colors, and so throw my enemies into confusion."
<http://www.eviloverlord.com/>

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>
Cc:	"Agent M" <agentm(at)themactionfaction(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 03:20:24
Message-ID:	36e682920606222020v55974e31wd537c2bb68f17446@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/06, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> wrote:
> There seems to be a bit of confusion about what MVCC is.

Yes, PostgreSQL implements MVTO-style concurrency, Oracle implements
MVRC, ... Let's not go into theory here, because there's plenty of
papers and books on the subject.

> The other MVCC databases maintain UNDO logs outside of the
> table's data files.

*most* other MVCC systems act this way, but not all.

> When an update occurs, the existing row version is
> copied to te UNDO file

Not in all systems. A few now perform in-memory UNDO and only write
it to disk if and when it is required.

> Overwriting MVCC comes with its own baggage. Ask any Oracle user about
> error ORA-01555[1]. There's also the added cost of managing the UNDO logs,
> the cost of jumping around between files to get row versions and so on.

This seems to be going in the direction of our common MySQL
discussions; relying on old failures and mistakes to base our
assumptions on the current version. Please stay apprised of current
developments in other systems.

As you're probably haven't done consulting for Oracle recently, I can
tell you that the once-dreaded ORA-01555 is nearly a thing of the
past. Continuing to dispense misinformation doesn't help anyone so
let's stop trying to use Oracle's past problems (which are largely due
to improper tuning) as preferring PostgreSQL's methods over others.

> Also, it leads to inefficiency with variable size data types. The new
> version of a row might be longer or shorter than the previous version and
> this has to be causing them a headaches and performance penalties.

I've seen many MVCC implementations using update-in-place (overwriting
storage manager) with plenty of variable-length attributes; not all
have these problems.

> J. Gray & A Reuter, Transaction Processing: Concepts and Techniques

Pretty much older than dirt, discusses locking, and barely touches on
MVCC. Still has some good concepts though.

> US Patent Number 5,870,758 -- (one of?) Oracle's snapshot isolation patent

Again, not the most optimal implementation of MVCC, but does certainly
have advantages.

> Tom Lane's MVCC talk:
> http://www.postgresql.org/files/developer/transactions.pdf

Always a good read.

I suggest reading Transactional Information Systems: Theory,
Algorithms, and the Practice of Concurrency Control by Weikum and
Vossen... it's basically the new Gray and Reuter. It covers multiple
MVCC techniques as well as differences between the different UNDO/REDO
database models.

From:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Agent M <agentm(at)themactionfaction(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 03:34:38
Message-ID:	Pine.LNX.4.58.0606231327100.353@linuxworld.com.au
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 22 Jun 2006, Jonah H. Harris wrote:

> Not in all systems. A few now perform in-memory UNDO and only write
> it to disk if and when it is required.

Interesting...

>
> > Overwriting MVCC comes with its own baggage. Ask any Oracle user about
> > error ORA-01555[1]. There's also the added cost of managing the UNDO logs,
> > the cost of jumping around between files to get row versions and so on.
>
> This seems to be going in the direction of our common MySQL
> discussions; relying on old failures and mistakes to base our
> assumptions on the current version. Please stay apprised of current
> developments in other systems.

Erm. Perhaps a poor example as I was not trying to put Oracle in a poor
light. Rather, I was trying to point out that each method has its
disadvantages. If the update in place method has no detractions we
shouldn't be hanging on to our existing mechanism.

> > J. Gray & A Reuter, Transaction Processing: Concepts and Techniques
>
> Pretty much older than dirt, discusses locking, and barely touches on
> MVCC. Still has some good concepts though.

The really useful section of this book is the discussion of snapshot
isolation. That's the important thing here. Conceivably we could have a
higher performance storage system but, IMHO, it must implement snapshot
isolation.

Thanks,

Gavin

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>, Agent M <agentm(at)themactionfaction(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 03:37:44
Message-ID:	20060623033744.GI16383@surnet.cl
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 6/22/06, Gavin Sherry <swm(at)linuxworld(dot)com(dot)au> wrote:

> >When an update occurs, the existing row version is
> >copied to te UNDO file
>
> Not in all systems. A few now perform in-memory UNDO and only write
> it to disk if and when it is required.

How does that work? If the last transaction is not finished after it
wrote the tuple when the power goes out, and the UNDO is not written to
disk either, how do you reconstruct the tuple at all?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Jim Nasby <jnasby(at)pervasive(dot)com>
Cc:	Greg Stark <gsstark(at)mit(dot)edu>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 05:44:13
Message-ID:	Pine.OSF.4.61.0606230842440.328888@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Thu, 22 Jun 2006, Jim Nasby wrote:

> DB2 switched to MVCC in version 8. ...

Um, no it didn't.

- Heikki

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Chris Browne <cbbrowne(at)acm(dot)org>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 08:59:50
Message-ID:	1151053189.3309.98.camel@coppola.muc.ecircle.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > Good advice, except if the table is huge :-)
>
> ... Then the table shouldn't be designed to be huge. That represents
> a design error.
[snip]
> This demonstrates that "archival" material and "active" data should be
> kept separately.
>
> They have different access patterns; kludging them into the same table
> turns out badly.

Well, then please help me find a better design cause I can't see one...
what we have here is a big "membership" table of email lists. When
there's a sendout then the memberships of the affected group are heavily
read/updated, otherwise they are idle. None of the memberships is
archive data, they are all active data... the only problem is that they
are so many. Is it so hard to believe that >100 million rows is all
active data, but only used in bursts once per week (that's an example,
some groups are more active, others less) ?

Cheers,
Csaba.

From:	PFC <lists(at)peufeu(dot)com>
To:	"Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Chris Browne" <cbbrowne(at)acm(dot)org>
Cc:	"postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 09:15:51
Message-ID:	op.tbldcp04cigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Well, then please help me find a better design cause I can't see one...
> what we have here is a big "membership" table of email lists. When
> there's a sendout then the memberships of the affected group are heavily
> read/updated, otherwise they are idle. None of the memberships is
> archive data, they are all active data... the only problem is that they
> are so many. Is it so hard to believe that >100 million rows is all
> active data, but only used in bursts once per week (that's an example,
> some groups are more active, others less) ?

I suppose you have a table memberships (user_id, group_id) or something
like it ; it should have as few columns as possible ; then try regularly
clustering on group_id (maybe once a week) so that all the records for a
particular group are close together. Getting the members of a group to
send them an email should be faster (less random seeks).

For tables with very few small fields (like a few integers) the
26-something bytes row overhead is significant ; MySQL can be faster
because MyISAM tables have no transaction support and thus have very
little things to store besides actual row data, and the table can then fit
in RAM...

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	Chris Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 09:25:36
Message-ID:	1151054736.3309.107.camel@coppola.muc.ecircle.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> I suppose you have a table memberships (user_id, group_id) or something
> like it ; it should have as few columns as possible ; then try regularly
> clustering on group_id (maybe once a week) so that all the records for a
> particular group are close together. Getting the members of a group to
> send them an email should be faster (less random seeks).

It is like this, and some more bookkeeping data which must be there...
we could split the table for smaller records or for updatable/stable
fields, but at the end of the day it doesn't make much sense, usually
all the data is needed and I wonder if more big/shallow tables instead
of one big/wider makes sense...

Regularly clustering is out of question as it would render the system
unusable for hours. There's no "0 activity hour" we could use for such
stuff. There's always something happening, only the overall load is
smaller at night...

Thanks,
Csaba.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Christopher Kings-Lynne" <chris(dot)kings-lynne(at)calorieking(dot)com>
Cc:	"Rod Taylor" <pg(at)rbt(dot)ca>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 10:37:01
Message-ID:	18701.24.91.171.78.1151059021.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> The example is a very active web site, the flow is this:
>>
>> query for session information
>> process HTTP request
>> update session information
>>
>> This happens for EVERY http request. Chances are that you won't have
>> concurrent requests for the same row, but you may have well over 100
>> HTTP
>> server processes/threads answering queries in your web server farm.
>
> You're crazy :) Use memcache, not the DB :)
>

I actually have what I consider a better and more complete session handler
system. MCache formally MSession. (http://www.mohawksoft.org/?q=node/8) I
mean, it implements a LOT of nifty features, loadable function modules,
collision safe counters and operators, ability to save session data to
file or SQL database and at varying levels of caching, but that doesn't
mean it is used.

Technologies like memcached and my mcache are a separate data store. Your
session data is not usable anywhere but in your web system. I have gone as
far as to write a session serializer for PHP that outputs XML, a
PostgreSQL plugin that can extract data from the XML session string, and a
set of functions for interfacing mcache with PostgreSQL and I still have a
hard time convincing clients that this is the right way to go.

While we all know session data is, at best, ephemeral, people still want
some sort of persistence, thus, you need a database. For mcache I have a
couple plugins that have a wide range of opitions, from read/write at
startup and shut down, to full write through cache to a database.

In general, my clients don't want this, they want the database to store
their data. When you try to explain to them that a database may not be the
right place to store this data, they ask why, sadly they have little hope
of understanding the nuances and remain unconvinced.

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	PFC <lists(at)peufeu(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 11:53:55
Message-ID:	1151063635.3309.121.camel@coppola.muc.ecircle.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Let me ask a question, you have this hundred million row table. OK, how
> much of that table is "read/write?" Would it be posible to divide the
> table into two (or more) tables where one is basically static, only
> infrequent inserts and deletes, and the other is highly updated?

Well, all of it is read write... some of the data might be updated less
frequently, but there's no way I would know which part of the data is
that. Logically is just the same type of data... so unless I find a way
to continuously move back and forth the data between an archive table
and the live table, based on how active the groups are, I can't imagine
any other way of partitioning it. And that would also mean some quite
big load given the pretty high dynamics of the groups.

> The "big" thing in performance is the amount of disk I/O, if you have a
> smaller active table with only a single index, then you may be able to cut
> your disk I/O time really down. The smaller the row size, the more rows
> fit into a block. The fewer blocks the less dissk I/O. The less disk I/O
> the bbetter the performance.

I agree, but it is quite hard to achieve that when the data set is both
big AND the partitioning criteria is highly dynamic. Not to mention that
deleting from that table is also a PITA performance-wise, so I wonder
how well the continuous back and forth between the active and inactive
table would do.

> Also, and anyone listening correct me if I'm wrong, you NEED to vacuum
> frequently because the indexes grow and vacuuming them doesnt remove
> everything, sometimes a REINDEX or a drop/recreate is the only way to get
> performance back. So if you wait too long between vacuums, your indexes
> grow and spread across more disk blocks than they should and thus use
> more disk I/O to search and/or shared memory to cache.

This is nice in theory, but kills performance. I vacuum the big tables
only overnight, otherwise the server is sluggish.

Cheers,
Csaba.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Csaba Nagy" <nagy(at)ecircle-ag(dot)com>
Cc:	"PFC" <lists(at)peufeu(dot)com>, "Chris Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 11:57:38
Message-ID:	18739.24.91.171.78.1151063858.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> I suppose you have a table memberships (user_id, group_id) or something
>> like it ; it should have as few columns as possible ; then try regularly
>> clustering on group_id (maybe once a week) so that all the records for a
>> particular group are close together. Getting the members of a group to
>> send them an email should be faster (less random seeks).
>
> It is like this, and some more bookkeeping data which must be there...
> we could split the table for smaller records or for updatable/stable
> fields, but at the end of the day it doesn't make much sense, usually
> all the data is needed and I wonder if more big/shallow tables instead
> of one big/wider makes sense...
>
> Regularly clustering is out of question as it would render the system
> unusable for hours. There's no "0 activity hour" we could use for such
> stuff. There's always something happening, only the overall load is
> smaller at night...
>

Let me ask a question, you have this hundred million row table. OK, how
much of that table is "read/write?" Would it be posible to divide the
table into two (or more) tables where one is basically static, only
infrequent inserts and deletes, and the other is highly updated?

The "big" thing in performance is the amount of disk I/O, if you have a
smaller active table with only a single index, then you may be able to cut
your disk I/O time really down. The smaller the row size, the more rows
fit into a block. The fewer blocks the less dissk I/O. The less disk I/O
the bbetter the performance.

Also, and anyone listening correct me if I'm wrong, you NEED to vacuum
frequently because the indexes grow and vacuuming them doesnt remove
everything, sometimes a REINDEX or a drop/recreate is the only way to get
performance back. So if you wait too long between vacuums, your indexes
grow and spread across more disk blocks than they should and thus use
more disk I/O to search and/or shared memory to cache.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Csaba Nagy" <nagy(at)ecircle-ag(dot)com>
Cc:	"PFC" <lists(at)peufeu(dot)com>, "Chris Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 13:00:32
Message-ID:	18114.24.91.171.78.1151067632.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>> Let me ask a question, you have this hundred million row table. OK, how
>> much of that table is "read/write?" Would it be posible to divide the
>> table into two (or more) tables where one is basically static, only
>> infrequent inserts and deletes, and the other is highly updated?
>
> Well, all of it is read write... some of the data might be updated less
> frequently, but there's no way I would know which part of the data is
> that. Logically is just the same type of data... so unless I find a way
> to continuously move back and forth the data between an archive table
> and the live table, based on how active the groups are, I can't imagine
> any other way of partitioning it. And that would also mean some quite
> big load given the pretty high dynamics of the groups.
>
>> The "big" thing in performance is the amount of disk I/O, if you have a
>> smaller active table with only a single index, then you may be able to
>> cut
>> your disk I/O time really down. The smaller the row size, the more rows
>> fit into a block. The fewer blocks the less dissk I/O. The less disk I/O
>> the bbetter the performance.
>
> I agree, but it is quite hard to achieve that when the data set is both
> big AND the partitioning criteria is highly dynamic. Not to mention that
> deleting from that table is also a PITA performance-wise, so I wonder
> how well the continuous back and forth between the active and inactive
> table would do.
>
>> Also, and anyone listening correct me if I'm wrong, you NEED to vacuum
>> frequently because the indexes grow and vacuuming them doesnt remove
>> everything, sometimes a REINDEX or a drop/recreate is the only way to
>> get
>> performance back. So if you wait too long between vacuums, your indexes
>> grow and spread across more disk blocks than they should and thus use
>> more disk I/O to search and/or shared memory to cache.
>
> This is nice in theory, but kills performance. I vacuum the big tables
> only overnight, otherwise the server is sluggish.

Well, the only thing left is to cluster the database. There are a couple
ways to do this, one switch to a platform that supports clustering or
create an API to wrap multiple databases. If your queries are simple and
limited, you could create an HTTP/XML service that wraps a number of
postgresql databases, issues a query across all databases, merges multiple
query sets, and returns one homoginous stream.

Inserts would be handled by hash to machine weighted by number of records
on each machine.

Updates and deletes would have two keys, machine and ID.

It sounds like you have a "big" problem and you need a "big" solution.

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	PFC <lists(at)peufeu(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 13:30:45
Message-ID:	1151069445.3309.161.camel@coppola.muc.ecircle.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Well, the only thing left is to cluster the database. There are a couple
> ways to do this, one switch to a platform that supports clustering or
> create an API to wrap multiple databases. If your queries are simple and
> limited, you could create an HTTP/XML service that wraps a number of
> postgresql databases, issues a query across all databases, merges multiple
> query sets, and returns one homoginous stream.
>
Our queries are not simple nor limited :-)

We have a big variety of users, with a big variety of needs and talented
user story writers who have imaginative minds... so the same code must
work in quite a few ways, and some of the resulting queries are
dynamically created. It's a tough job to optimize all the queries we
have.

> Inserts would be handled by hash to machine weighted by number of records
> on each machine.
>
> Updates and deletes would have two keys, machine and ID.
>
Such a setup might work for us but I fear it would be a major PITA to
make it reliable and to maintain it. It's not like I can say "let's
allot a month of work for trying out a clustering solution, but I'm not
sure if it will work fine at the end". We still have enough features to
develop, the DB is something to solve part of the problem, not to keep
us busy... the Oracle systems were there first, the application works
more or less fine on them (with occasional need to optimize here and
there). Supporting Postgres was a side-project to see if it works, and
it works decently, so we deployed some systems on it. Both of the DBs
have their quirks, and I can voice here the ones I don't like in
Postgres... and then some developer might do something about it or not,
and I find that OK. If my mind wouldn't refuse so categorically to learn
C style programming (tried that and gave up), I would probably scratch
my itches. I did it for smaller scaled ones, like truncating
timestamp(0) instead of rounding so that it is consistent with what
Oracle does, but that was just a one file modification... I simply don't
find it fun to browse C code, compared to how easy is to understand Java
code which I work with here. So unless somebody ports Postgres to Java,
I'll further need to voice my itches here in the hope that they'll be
solved by others... sorry for the long rant.

> It sounds like you have a "big" problem and you need a "big" solution.

Well, Postgres does a decent job as it is. The problem is under peek
load, sometimes it gets bogged down and the usual things like vacuum
will not help immediately. I think a few more features like the dead
space map for quick vacuum and even something like the original post's
proposition would make postgres fly under heavy load too...

Cheers,
Csaba.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 13:30:52
Message-ID:	1151069452.3828.20.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, N, 2006-06-22 kell 12:41, kirjutas Mark Woodward:

> > Depending on exact details and optimisations done, this can be either
> > slower or faster than postgresql's way, but they still need to do
> > something to get transactional visibility rules implemented.
>
> I think they have a different strategy. I think they maintain the notion
> of "current version" of a row, and hunt for previous versions when needed,
> at least that's how I suspect Oracle does it with redo logs.

Not "current" but "last" :)

And one side effect of redo logs is that it is practically impossible to
do large deletes on production databases. So you design around that,
like you have to design around limitations of MVCC.

> >> > > There has to be a more linear way of handling this scenario.
> >> >
> >> > So vacuum the table often.
> >>
> >> It's easy to say VACUUM often... but I'd bet that vacuuming is going
> >> to lessen the throughput in his tests even more; no matter how it's
> >> tuned.
> >
> > Running VACUUM often/continuously will likely keep his update rate
> > fluctuatons within a corridor of maybe 5-10%, at the cost of 1-2% extra
> > load. At least if vacuum is configured right and the server is not
> > already running at 100% IO saturation, in which case it will be worse.
>
> Assuming the table is a reasonable size, the I/O required for vacuum
> doesn't kill everything else!

I have solved the problem of unneccessary IO by keeping active and
finished rows in separate tables, with the finish() function moving the
row between tables.

In case of the number of actively modified rows being in only tens or
low hundreds of thousands of rows, (i.e. the modified set fits in
memory) the continuous vacuum process shows up as just another backend,
not really taking order of magnitude more resources. It mainly generates
WAL traffic, as modified pages are already in memory/cache and are
mostly synced by background writer and/or checkpoint.

Of course you have to adjust vacuum_cost_* variables so as to not
saturate IO.

> > The max throughput figure is not something you actually need very often
> > in production.
>
> No, but you need to have some degree of certainty and predictability in
> the system you are developing.

Yup. You have to design it so it has.

> > What is interesting is setting up the server so that you
> > can service your loads comfortably. Running the server at 100% lead is
> > not anything you want to do on production server. There will be things
> > you need to do anyway and you need some headroom for that.
>
> Of course, you design it so peaks are easily managed, but unless you run
> vacuum continuously, and that has its own set of problems, you run into
> this problem, and it can get really really bad.

Usually it gets really bad if you *don't* run vacuum continuously, maybe
hopeing to do it in slower times at night. For high-update db you have
to run it continuously, maybe having some 5-15 sec pauses between runs.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, PFC <lists(at)peufeu(dot)com>, Chris Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 13:56:27
Message-ID:	20060623135627.GE8900@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 23, 2006 at 03:30:45PM +0200, Csaba Nagy wrote:
> > It sounds like you have a "big" problem and you need a "big" solution.
>
> Well, Postgres does a decent job as it is. The problem is under peek
> load, sometimes it gets bogged down and the usual things like vacuum
> will not help immediately. I think a few more features like the dead
> space map for quick vacuum and even something like the original post's
> proposition would make postgres fly under heavy load too...

I know there have a been a number of suggestions in the past to deal
with this thing. Some I don't remember being mentioned in this thread
are:

- Once a tuple has been determined to be invisible to everyone,
truncate it to just the header. This would probably work wonders for
frequently updated wide tables. However, this required keeping track of
the oldest active xact, I'm not sure how that works at the moment.

- Have the bgwriter do cleanup work before writing out a block. It
could probably do the truncation bit above, but totally removing old
tuples requires cleaning out the indexes too, which AIUI is the hard
part of vacuuming.

One totally whacked out idea I just thought of: Instead of just
truncating tuples when they're invisible, mark them "dying" and make
the data section store an array of CTIDs pointing to the index tuples
pointing to it. Lookups that find the tuple via an index could store
the CTID of the index tuple before continuing. If the bgwriter sees it
has a full set, it can efficiently remove the tuple straight away.

There are ofcourse drawbacks to this approach, you'd probably need
something like the half-dirty pages to avoid a large increase in write
load. If it's even beneficial at all given concurrency issues.

Still, VACUUM has gotten faster in CVS so the issues are slowly being
addressed...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, "Agent M" <agentm(at)themactionfaction(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 14:04:16
Message-ID:	36e682920606230704x45c8108cvb8d4b5d41379a04d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/06, Alvaro Herrera <alvherre(at)commandprompt(dot)com> wrote:
> > Not in all systems. A few now perform in-memory UNDO and only write
> > it to disk if and when it is required.
>
> How does that work? If the last transaction is not finished after it
> wrote the tuple when the power goes out, and the UNDO is not written to
> disk either, how do you reconstruct the tuple at all?

It still follows normal transaction semantics... on commit, the REDO
record is written to disk. If the block was not written to disk and
the database crashes, the on-disk representation of the block remained
unchanged and recovery would pick up the REDO record from the log and
reapply it.

However, say someone updates a record which moves the old tuple value
to an in-memory UNDO location and changes the in-memory block
representation... if a checkpoint occurs before the user commits, the
block must be written to disk in changed form which means that the
UNDO record must also be written to the log (before the block is
written to disk). In the event of a crash, recovery would see that
the transaction never committed and would apply the UNDO to the block.

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 14:24:06
Message-ID:	20060623142406.GA8426@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Woodward wrote:

> > In case of the number of actively modified rows being in only tens or
> > low hundreds of thousands of rows, (i.e. the modified set fits in
> > memory) the continuous vacuum process shows up as just another backend,
> > not really taking order of magnitude more resources. It mainly generates
> > WAL traffic, as modified pages are already in memory/cache and are
> > mostly synced by background writer and/or checkpoint.
> >
> > Of course you have to adjust vacuum_cost_* variables so as to not
> > saturate IO.
>
> These sort of solutions, IMHO, don't show how good PostgreSQL is, but show
> where it is very lacking.

We all know Postgres is lacking; some of us try to improve it (some with
more success than others). People who know the current limitations but
like the capabilities, try to find workarounds to the problems. What
surprises me is that, if you have such a low opinion of Postgres, you
still use it.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	"A(dot)M(dot)" <agentm(at)themactionfaction(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 14:29:52
Message-ID:	54624.216.41.12.254.1151072992.squirrel@webmail.webopticon.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, June 23, 2006 9:56 am, Martijn van Oosterhout wrote:
> On Fri, Jun 23, 2006 at 03:30:45PM +0200, Csaba Nagy wrote:
>
>>> It sounds like you have a "big" problem and you need a "big"
>>> solution.
>>
>> Well, Postgres does a decent job as it is. The problem is under peek
>> load, sometimes it gets bogged down and the usual things like vacuum will
>> not help immediately. I think a few more features like the dead space
>> map for quick vacuum and even something like the original post's
>> proposition would make postgres fly under heavy load too...
>
> I know there have a been a number of suggestions in the past to deal
> with this thing. Some I don't remember being mentioned in this thread are:
>
>
> - Once a tuple has been determined to be invisible to everyone,
> truncate it to just the header. This would probably work wonders for
> frequently updated wide tables. However, this required keeping track of
> the oldest active xact, I'm not sure how that works at the moment.
>
> - Have the bgwriter do cleanup work before writing out a block. It
> could probably do the truncation bit above, but totally removing old tuples
> requires cleaning out the indexes too, which AIUI is the hard part of
> vacuuming.
>
> One totally whacked out idea I just thought of: Instead of just
> truncating tuples when they're invisible, mark them "dying" and make the
> data section store an array of CTIDs pointing to the index tuples pointing
> to it. Lookups that find the tuple via an index could store the CTID of
> the index tuple before continuing. If the bgwriter sees it has a full set,
> it can efficiently remove the tuple straight away.
>
> There are ofcourse drawbacks to this approach, you'd probably need
> something like the half-dirty pages to avoid a large increase in write
> load. If it's even beneficial at all given concurrency issues.

A lot of these recommendations sound like garbage collection ideas found
in modern programming languages. Perhaps it would be worth considering
allowing sessions to keep track of which pages they alter and spawn a
separate process per connection to sweep up slowly along the way.

Also, it's nice that vacuum now has slow-down settings, but why isn't
there a option to autovacuum during periods of idleness and pause when
busy?

-M

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 14:30:15
Message-ID:	17943.24.91.171.78.1151073015.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Ãhel kenal pÃ¤eval, N, 2006-06-22 kell 12:41, kirjutas Mark Woodward:
>
>> > Depending on exact details and optimisations done, this can be either
>> > slower or faster than postgresql's way, but they still need to do
>> > something to get transactional visibility rules implemented.
>>
>> I think they have a different strategy. I think they maintain the notion
>> of "current version" of a row, and hunt for previous versions when
>> needed,
>> at least that's how I suspect Oracle does it with redo logs.
>
> Not "current" but "last" :)

True
>
> And one side effect of redo logs is that it is practically impossible to
> do large deletes on production databases. So you design around that,
> like you have to design around limitations of MVCC.

Think that's bad, try doing an update in PostgreSQL on a table with 20
million rows and a few indexes. I had to write a script to chunk up the
block update into segments and vacuum between each.

>
>> >> > > There has to be a more linear way of handling this scenario.
>> >> >
>> >> > So vacuum the table often.
>> >>
>> >> It's easy to say VACUUM often... but I'd bet that vacuuming is going
>> >> to lessen the throughput in his tests even more; no matter how it's
>> >> tuned.
>> >
>> > Running VACUUM often/continuously will likely keep his update rate
>> > fluctuatons within a corridor of maybe 5-10%, at the cost of 1-2%
>> extra
>> > load. At least if vacuum is configured right and the server is not
>> > already running at 100% IO saturation, in which case it will be worse.
>>
>> Assuming the table is a reasonable size, the I/O required for vacuum
>> doesn't kill everything else!
>
> I have solved the problem of unneccessary IO by keeping active and
> finished rows in separate tables, with the finish() function moving the
> row between tables.

Sorry, an RDBMS is a "relational database management system," if you are
doing the "database management," it isn't a very good RDBMS.

>
> In case of the number of actively modified rows being in only tens or
> low hundreds of thousands of rows, (i.e. the modified set fits in
> memory) the continuous vacuum process shows up as just another backend,
> not really taking order of magnitude more resources. It mainly generates
> WAL traffic, as modified pages are already in memory/cache and are
> mostly synced by background writer and/or checkpoint.
>
> Of course you have to adjust vacuum_cost_* variables so as to not
> saturate IO.

These sort of solutions, IMHO, don't show how good PostgreSQL is, but show
where it is very lacking.

>
>> > The max throughput figure is not something you actually need very
>> often
>> > in production.
>>
>> No, but you need to have some degree of certainty and predictability in
>> the system you are developing.
>
> Yup. You have to design it so it has.

I was refereing to the system as a whole and the individual components.
PostgreSQL's performance under some pathalogical condictions is not very
predictable or reliable.

>
>> > What is interesting is setting up the server so that you
>> > can service your loads comfortably. Running the server at 100% lead is
>> > not anything you want to do on production server. There will be things
>> > you need to do anyway and you need some headroom for that.
>>
>> Of course, you design it so peaks are easily managed, but unless you run
>> vacuum continuously, and that has its own set of problems, you run into
>> this problem, and it can get really really bad.
>
> Usually it gets really bad if you *don't* run vacuum continuously, maybe
> hopeing to do it in slower times at night. For high-update db you have
> to run it continuously, maybe having some 5-15 sec pauses between runs.

And how much I/O does this take?

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 14:33:42
Message-ID:	1151073222.3309.173.camel@coppola.muc.ecircle.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > These sort of solutions, IMHO, don't show how good PostgreSQL is, but show
> > where it is very lacking.
>
> We all know Postgres is lacking; some of us try to improve it (some with
> more success than others). People who know the current limitations but
> like the capabilities, try to find workarounds to the problems. What
> surprises me is that, if you have such a low opinion of Postgres, you
> still use it.

Alvaro, I understand your reaction, you're on the developer side... but
please try to understand us mortals who can't write good C code too:
some of us like what postgres already offers, but we would also like to
not need those workarounds. And the only thing we can do is make big
noise so somebody from the other side (sour side) will notice it and at
one point do something about it... the noise here by no means means we
have a low opinion about postgres. On the contrary, we appreciate enough
postgres and it's developers to ask for what we would like to see, and
we are sure there are some gifted developers out there who can program
those features we scream about...

Cheers,
Csaba.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 14:43:58
Message-ID:	1151073838.3828.40.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, N, 2006-06-22 kell 13:49, kirjutas Mark Woodward:
> > Christopher Browne <cbbrowne(at)acm(dot)org> writes:

> > Basically there's no free lunch: if you want the benefits of MVCC it's
> > going to cost you somewhere. In the Postgres design you pay by having
> > to do VACUUM pretty often for heavily-updated tables. I don't think
> > that decision is fundamentally wrong --- the attractive thing about it
> > is that the overhead is pushed out of the foreground query-processing
> > code paths.
>
> Under certain circumstances, it is a very poor design. Think of a single
> row table that keeps a scoreboard or a session table that keeps a limited
> number of rows that are updated very frequently.

A single row table that every session updates is a really bad design on
any database, as it is a sure point of lock contention and thus removes
any possibility of concurrency.

But except for locking problems, it will perform really well when you
vacuum often enough :)

> > We still have lots of work to do in making autovacuum
> > smarter, avoiding vacuuming parts of relations that have not changed,
> > and so on. But I have no desire to go over to an Oracle-style solution
> > instead. We can't beat them by trying to be like them, and we run no
> > small risk of falling foul of some of their patents if we do.
>
> I proposed having a "key row entry" for each logical row. The key row
> entry points to the latest version of the row. There, each row entry is a
> linked list, in descending order, of previous row versions.

Do I understand right, that you are proposing a redesign of how indexing
works, by updating indexes in-place.

How would older rows be found then by transactions needing to see
them ?

Do you suggest reverting to seqscan when we see _any_ newer
transactions ?

Or if you want to have index pointing to "latest" row with each value in
indexed field, how would you find the last time this value was used ?

Don't tell me that you plan to trace the full update-chain on each
update.

Or would this new indexing mechanism be used only for non-changing key
fields ? How would you check for that ?

> The vast majority of the time, the latest version will be the first version.

Not in a web scenario. In my experience more complicated web-pages tend
to produce lots of concurrent accesses.

> It is
> only when you have a previously started long running or concurrent
> transaction will you ever look at previous versions.
>
> I'm not saying it is an easy slam dunk, as I can think of a few
> difficulties off the top of my head, but it would solve the steady
> degradation of performance between vacuums and, to a possibly lesser
> extent, the cost of updating a row in a heavily indexed table.

VACUUMing often also solves the problem of "steady degradation of
performance between vacuums" :)

No need to be afraid of vacuum. Vacuum is your friend! Just learn to use
it right.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 14:50:05
Message-ID:	18346.24.91.171.78.1151074205.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Mark Woodward wrote:
>
>> > In case of the number of actively modified rows being in only tens or
>> > low hundreds of thousands of rows, (i.e. the modified set fits in
>> > memory) the continuous vacuum process shows up as just another
>> backend,
>> > not really taking order of magnitude more resources. It mainly
>> generates
>> > WAL traffic, as modified pages are already in memory/cache and are
>> > mostly synced by background writer and/or checkpoint.
>> >
>> > Of course you have to adjust vacuum_cost_* variables so as to not
>> > saturate IO.
>>
>> These sort of solutions, IMHO, don't show how good PostgreSQL is, but
>> show
>> where it is very lacking.
>
> We all know Postgres is lacking; some of us try to improve it (some with
> more success than others). People who know the current limitations but
> like the capabilities, try to find workarounds to the problems. What
> surprises me is that, if you have such a low opinion of Postgres, you
> still use it.

Actually I love PostgreSQL, I've been using it for about 10 years on a lot
of projects. There are some serious issues with it, however, and it is
important to expose them, discuss them, and resolve them. Work arounds are
great, but in the end, they are work arounds.

From:	mark(at)mark(dot)mielke(dot)cc
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 15:25:02
Message-ID:	20060623152502.GA3538@mark.mielke.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 23, 2006 at 10:24:06AM -0400, Alvaro Herrera wrote:
> Mark Woodward wrote:
> > > In case of the number of actively modified rows being in only tens or
> > > low hundreds of thousands of rows, (i.e. the modified set fits in
> > > memory) the continuous vacuum process shows up as just another backend,
> > > not really taking order of magnitude more resources. It mainly generates
> > > WAL traffic, as modified pages are already in memory/cache and are
> > > mostly synced by background writer and/or checkpoint.
> > > Of course you have to adjust vacuum_cost_* variables so as to not
> > > saturate IO.
> > These sort of solutions, IMHO, don't show how good PostgreSQL is, but show
> > where it is very lacking.
> We all know Postgres is lacking; some of us try to improve it (some with
> more success than others). People who know the current limitations but
> like the capabilities, try to find workarounds to the problems. What
> surprises me is that, if you have such a low opinion of Postgres, you
> still use it.

If everybody had good opinions, where would the development come from?

It's the parts that suck that need fixing the most... :-)

Cheers,
mark

--
mark(at)mielke(dot)cc / markm(at)ncf(dot)ca / markm(at)nortel(dot)com __________________________
. . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder
|\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ |
| | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 15:41:20
Message-ID:	1151077280.3828.48.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, R, 2006-06-23 kell 10:30, kirjutas Mark Woodward:

> >> > What is interesting is setting up the server so that you
> >> > can service your loads comfortably. Running the server at 100% lead is
> >> > not anything you want to do on production server. There will be things
> >> > you need to do anyway and you need some headroom for that.
> >>
> >> Of course, you design it so peaks are easily managed, but unless you run
> >> vacuum continuously, and that has its own set of problems, you run into
> >> this problem, and it can get really really bad.
> >
> > Usually it gets really bad if you *don't* run vacuum continuously, maybe
> > hopeing to do it in slower times at night. For high-update db you have
> > to run it continuously, maybe having some 5-15 sec pauses between runs.
>
> And how much I/O does this take?

Surprisingly its mostly WAL traffic, the heap/index pages themselves are
often not yet synced to disk by time of vacuum, so no additional traffic
there. If you had made 5 updates per page and then vacuum it, then you
make effectively 1 extra WAL write meaning 20% increase in WAL traffic.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 16:05:33
Message-ID:	1151078733.3309.211.camel@coppola.muc.ecircle.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > > Usually it gets really bad if you *don't* run vacuum continuously, maybe
> > > hopeing to do it in slower times at night. For high-update db you have
> > > to run it continuously, maybe having some 5-15 sec pauses between runs.
> >
> > And how much I/O does this take?
>
> Surprisingly its mostly WAL traffic, the heap/index pages themselves are
> often not yet synced to disk by time of vacuum, so no additional traffic
> there. If you had made 5 updates per page and then vacuum it, then you
> make effectively 1 extra WAL write meaning 20% increase in WAL traffic.

Is this also holding about read traffic ? I thought vacuum will make a
full table scan... for big tables a full table scan is always badly
influencing the performance of the box. If the full table scan would be
avoided, then I wouldn't mind running vacuum in a loop...

In fact I think that it would make sense to replace the whole current
vacuum stuff with a background thread which does that continuously using
a dead space map. That could be a heap sorted by tuple deletion time,
and always cleaned up up to the oldest running transaction's start
time... there would be no need for any other autovacuum then.

Cheers,
Csaba.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 16:27:53
Message-ID:	1151080073.3828.66.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, R, 2006-06-23 kell 18:05, kirjutas Csaba Nagy:
> > > > Usually it gets really bad if you *don't* run vacuum continuously, maybe
> > > > hopeing to do it in slower times at night. For high-update db you have
> > > > to run it continuously, maybe having some 5-15 sec pauses between runs.
> > >
> > > And how much I/O does this take?
> >
> > Surprisingly its mostly WAL traffic, the heap/index pages themselves are
> > often not yet synced to disk by time of vacuum, so no additional traffic
> > there. If you had made 5 updates per page and then vacuum it, then you
> > make effectively 1 extra WAL write meaning 20% increase in WAL traffic.
>
> Is this also holding about read traffic ? I thought vacuum will make a
> full table scan... for big tables a full table scan is always badly
> influencing the performance of the box. If the full table scan would be
> avoided, then I wouldn't mind running vacuum in a loop...

I was referring to a design that keeps frequently updated tuples in a
separate table.

> In fact I think that it would make sense to replace the whole current
> vacuum stuff with a background thread which does that continuously using
> a dead space map. That could be a heap sorted by tuple deletion time,
> and always cleaned up up to the oldest running transaction's start
> time... there would be no need for any other autovacuum then.

This has been on todo list for some time already.

>
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 17:08:49
Message-ID:	16709.1151082529@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Csaba Nagy <nagy(at)ecircle-ag(dot)com> writes:
>> Surprisingly its mostly WAL traffic, the heap/index pages themselves are
>> often not yet synced to disk by time of vacuum, so no additional traffic
>> there. If you had made 5 updates per page and then vacuum it, then you
>> make effectively 1 extra WAL write meaning 20% increase in WAL traffic.

> Is this also holding about read traffic ? I thought vacuum will make a
> full table scan... for big tables a full table scan is always badly
> influencing the performance of the box. If the full table scan would be
> avoided, then I wouldn't mind running vacuum in a loop...

If you're doing heavy updates of a big table then it's likely to end up
visiting most of the table anyway, no? There is talk of keeping a map
of dirty pages, but I think it'd be a win for infrequently-updated
tables, not ones that need constant vacuuming.

I think a lot of our problems in this area could be solved with fairly
straightforward tuning efforts on the existing autovacuum
infrastructure. In particular, someone should be looking into
recommendable default vacuum-cost-delay settings so that a background
vacuum doesn't affect performance too much. Another problem with the
current autovac infrastructure is that it doesn't respond very well to
the case where there are individual tables that need constant attention
as well as many that don't. If you have N databases then you can visit
a particular table at most once every N*autovacuum_naptime seconds, and
*every* table in the entire cluster gets reconsidered at that same rate.
I'm not sure if we need the ability to have multiple autovac daemons
running at the same time, but we definitely could use something with a
more flexible table-visiting pattern. Perhaps it would be enough to
look through the per-table stats for each database before selecting the
database to autovacuum in each cycle, instead of going by "least
recently autovacuumed".

Bottom line: there's still lots of low-hanging fruit. Why are people
feeling that we need to abandon or massively complicate our basic
architecture to make progress?

regards, tom lane

From:	David Fetter <david(at)fetter(dot)org>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 18:20:41
Message-ID:	20060623182040.GA24375@fetter.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 23, 2006 at 02:30:29PM -0400, Mark Woodward wrote:
> >
> > Bottom line: there's still lots of low-hanging fruit. Why are
> > people feeling that we need to abandon or massively complicate our
> > basic architecture to make progress?
> >
> > regards, tom lane
>
> I, for one, see a particularly nasty unscalable behavior in the
> implementation of MVCC with regards to updates.

You're not answering the question Tom asked. Why not?

> For each update to a row additional work needs to be done to access
> that row. Surely a better strategy can be done, especially
> considering that the problem being solved is a brief one.
>
> The only reason why you need previous versions of a row is for
> transactions that started before or during the transaction that
> seeks to modify a row. After which time, the previous versions
> continue to affect performance and take up space even though they
> are of no value. (Caveats for rollback, etc. but the point is still
> valid).

I wouldn't be so quick to dismiss those as parenthetical "caveats."

> This is a very pessimistic behavior and penalizes the more common
> and optimistic operations. Now, if a tool were to be created that
> could roll back an entire database to some arbitrary transaction ID
> between vacuums, then I can see the usefulnes of the older versions.

There was one called time travel. Somebody might put it back in some
day :)

> I still think an in-place indirection to the current row could fix
> the problem and speed up the database, there are some sticky
> situations that need to be considered, but it shouldn't break much.

We're eagerly awaiting your patch.

Cheers,
D
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
phone: +1 415 235 3778 AIM: dfetter666
Skype: davidfetter

Remember to vote!

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 18:28:26
Message-ID:	36e682920606231128j76c1655do413c5edfbba522db@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/23/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
> I, for one, see a particularly nasty unscalable behavior in the
> implementation of MVCC with regards to updates.

I think this is a fairly common acceptance. The overhead required to
perform an UPDATE in PostgreSQL is pretty heavy. Actually, it's not
really PostgreSQL's implementation, but anything that employs basic
multi-version timestamp ordering (MVTO) style MVCC. Basically,
MVTO-style systems require additional work to be done in an UPDATE so
that queries can find the most current row more quickly.

> This is a very pessimistic behavior

Yes, and that's basically the point of MVTO in general. The nice
thing about MVTO-style MVCC is that it isn't super complicated. No
big UNDO strategy is needed because the old versions are always there
and just have to satisfy a snapshot.

> I still think an in-place indirection to the current row could fix the
> problem and speed up the database, there are some sticky situations that
> need to be considered, but it shouldn't break much.

I agree, but should make clear that moving to an in-place update isn't
a quick-fix; it will require a good amount of design and planning.

What I find in these discussions is that we always talk about over
complicating vacuum in order to fix the poor behavior in MVCC. Fixing
autovacuum does not eliminate the overhead required to add index
entries and everything associated with performing an UPDATE... it's
just cleaning up the mess after the fact. As I see it, fixing the
root problem by moving to update-in-place may add a little more
complication to the core, but will eliminate a lot of the headaches we
have in overhead, performance, and manageability.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 18:30:29
Message-ID:	18057.24.91.171.78.1151087429.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

>
> Bottom line: there's still lots of low-hanging fruit. Why are people
> feeling that we need to abandon or massively complicate our basic
> architecture to make progress?
>
> regards, tom lane

I, for one, see a particularly nasty unscalable behavior in the
implementation of MVCC with regards to updates.

For each update to a row additional work needs to be done to access that
row. Surely a better strategy can be done, especially considering that the
problem being solved is a brief one.

The only reason why you need previous versions of a row is for
transactions that started before or during the transaction that seeks to
modify a row. After which time, the previous versions continue to affect
performance and take up space even though they are of no value.
(Caveats for rollback, etc. but the point is still valid).

This is a very pessimistic behavior and penalizes the more common and
optimistic operations. Now, if a tool were to be created that could roll
back an entire database to some arbitrary transaction ID between vacuums,
then I can see the usefulnes of the older versions.

I still think an in-place indirection to the current row could fix the
problem and speed up the database, there are some sticky situations that
need to be considered, but it shouldn't break much.

From:	"Jochem van Dieten" <jochemd(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 18:31:42
Message-ID:	f96a9b830606231131y459db4c2vcf9d0616a1d40c9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/23/06, Mark Woodward wrote:
>
> For each update to a row additional work needs to be done to access that
> row. Surely a better strategy can be done, especially considering that the
> problem being solved is a brief one.
>
> The only reason why you need previous versions of a row is for
> transactions that started before or during the transaction that seeks to
> modify a row. After which time, the previous versions continue to affect
> performance and take up space even though they are of no value.
> (Caveats for rollback, etc. but the point is still valid).
>
> This is a very pessimistic behavior and penalizes the more common and
> optimistic operations.

Are you sure about that? ISTM that for the most common cases the TID
returned by an indexscan is the one of the last version and only if
that vbersion is too new a second TID is tried etc.

Jochem

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 18:40:44
Message-ID:	200606231840.k5NIeix19797@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> If you're doing heavy updates of a big table then it's likely to end up
> visiting most of the table anyway, no? There is talk of keeping a map
> of dirty pages, but I think it'd be a win for infrequently-updated
> tables, not ones that need constant vacuuming.
>
> I think a lot of our problems in this area could be solved with fairly
> straightforward tuning efforts on the existing autovacuum
> infrastructure. In particular, someone should be looking into
> recommendable default vacuum-cost-delay settings so that a background
> vacuum doesn't affect performance too much. Another problem with the
> current autovac infrastructure is that it doesn't respond very well to
> the case where there are individual tables that need constant attention
> as well as many that don't. If you have N databases then you can visit
> a particular table at most once every N*autovacuum_naptime seconds, and
> *every* table in the entire cluster gets reconsidered at that same rate.
> I'm not sure if we need the ability to have multiple autovac daemons
> running at the same time, but we definitely could use something with a
> more flexible table-visiting pattern. Perhaps it would be enough to
> look through the per-table stats for each database before selecting the
> database to autovacuum in each cycle, instead of going by "least
> recently autovacuumed".
>
> Bottom line: there's still lots of low-hanging fruit. Why are people
> feeling that we need to abandon or massively complicate our basic
> architecture to make progress?

I think at some point we have to admit that _polling_ the tables, which
is what autovacuum does, just isn't going to work well, no matter how
much it is tweeked, and another approach should be considered for
certain workload cases.

At some point, the autovacuum approach starts to look like a car with
fifty bumper stickers. The first few were fine, but at some point, the
tweeks (bumper stickers) start to overwhelm the car, and it is time to
look for a new car.

I think particularly for the UPDATE with no index key changes, a new
approach must be considred.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Florian Weimer <fw(at)deneb(dot)enyo(dot)de>
To:	Gavin Sherry <swm(at)linuxworld(dot)com(dot)au>
Cc:	Agent M <agentm(at)themactionfaction(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 18:49:25
Message-ID:	87odwjhfm2.fsf@mid.deneb.enyo.de
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

* Gavin Sherry:

> Tom Lane's MVCC talk:
> http://www.postgresql.org/files/developer/transactions.pdf

Is this still up-to-date with regard to to partial page writes?
I hope that has been fixed (like the fsync issue).

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 18:56:02
Message-ID:	17653.1151088962@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian <bruce(at)momjian(dot)us> writes:
> I think at some point we have to admit that _polling_ the tables, which
> is what autovacuum does, just isn't going to work well, no matter how
> much it is tweeked, and another approach should be considered for
> certain workload cases.

Autovacuum polls in its current, first-generation implementation;
what I said upthread was it needs to be smarter than that. I am not
sure how you get from that to the conclusion that the very next step
is to abandon the vacuuming approach altogether.

What I see in this discussion is a huge amount of "the grass must be
greener on the other side" syndrome, and hardly any recognition that
every technique has its downsides and complications. Furthermore,
I do not believe that this project has the ability to support multiple
fundamental storage models, as a number of people seem to be blithely
suggesting. We're having a hard enough time debugging and optimizing
*one* storage model. I think the correct path forward is to stick with
the same basic storage model and vacuuming concept, and address the
known performance issues with better-optimized vacuuming. No, it will
never be perfect for every scenario, but we can certainly make it much
better than it is now, without risking killing the project by
introducing undebuggable, unmaintainable complexity.

regards, tom lane

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 19:08:34
Message-ID:	200606231908.k5NJ8Yx24587@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
> > I think at some point we have to admit that _polling_ the tables, which
> > is what autovacuum does, just isn't going to work well, no matter how
> > much it is tweeked, and another approach should be considered for
> > certain workload cases.
>
> Autovacuum polls in its current, first-generation implementation;
> what I said upthread was it needs to be smarter than that. I am not
> sure how you get from that to the conclusion that the very next step
> is to abandon the vacuuming approach altogether.

I am not ready to abandon autovacuum, but as I stated later the UPDATE
with no key change case is common enought that it could be handled
better without involving autovacuum and its limitations.

As I remember, most databases have problem with DELETE/INSERT cycles,
but we seem to be hit by UPDATE performance more than most, and more
than is wise.

> What I see in this discussion is a huge amount of "the grass must be
> greener on the other side" syndrome, and hardly any recognition that
> every technique has its downsides and complications. Furthermore,
> I do not believe that this project has the ability to support multiple
> fundamental storage models, as a number of people seem to be blithely
> suggesting. We're having a hard enough time debugging and optimizing
> *one* storage model. I think the correct path forward is to stick with
> the same basic storage model and vacuuming concept, and address the
> known performance issues with better-optimized vacuuming. No, it will
> never be perfect for every scenario, but we can certainly make it much
> better than it is now, without risking killing the project by
> introducing undebuggable, unmaintainable complexity.

Well, are you suggesting we just stop improving the database? I am sure
not. But, your suggestion is that we can't do better without incurring
more complexity (true), and that complexity will not be worth it. I
don't agree with that until I see some proposals, and shutting down
discussion because they will add complexity or are fruitless seems
unwise.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 19:10:39
Message-ID:	18415.24.91.171.78.1151089839.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 6/23/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
>> I, for one, see a particularly nasty unscalable behavior in the
>> implementation of MVCC with regards to updates.
>
> I think this is a fairly common acceptance. The overhead required to
> perform an UPDATE in PostgreSQL is pretty heavy. Actually, it's not
> really PostgreSQL's implementation, but anything that employs basic
> multi-version timestamp ordering (MVTO) style MVCC. Basically,
> MVTO-style systems require additional work to be done in an UPDATE so
> that queries can find the most current row more quickly.
>
>> This is a very pessimistic behavior
>
> Yes, and that's basically the point of MVTO in general. The nice
> thing about MVTO-style MVCC is that it isn't super complicated. No
> big UNDO strategy is needed because the old versions are always there
> and just have to satisfy a snapshot.
>
>> I still think an in-place indirection to the current row could fix the
>> problem and speed up the database, there are some sticky situations that
>> need to be considered, but it shouldn't break much.
>
> I agree, but should make clear that moving to an in-place update isn't
> a quick-fix; it will require a good amount of design and planning.

This is NOT an "in-place" update. The whole MVCC strategy of keeping old
versions around doesn't change. The only thing that does change is one
level of indirection. Rather than keep references to all versions of all
rows in indexes, keep only a reference to the first or "key" row of each
row, and have the first version of a row form the head of a linked list to
subsequent versions of each row. The list will be in decending order.

In the vast majority of cases, the overhead of this action will be
trivial. In an unmodified row, you're there. In a modified row, you have
one extra lookup. In extream cases, you may have to go back a few
versions, but I don't see that as a common behavior.

On a heavily updated row, you are never more than one jump away, the
indexes shouldn't grow overly much.

>
> What I find in these discussions is that we always talk about over
> complicating vacuum in order to fix the poor behavior in MVCC. Fixing
> autovacuum does not eliminate the overhead required to add index
> entries and everything associated with performing an UPDATE... it's
> just cleaning up the mess after the fact. As I see it, fixing the
> root problem by moving to update-in-place may add a little more
> complication to the core, but will eliminate a lot of the headaches we
> have in overhead, performance, and manageability.

Vacuum is a tool for removing old versions. I think there is an overly
eager tendency to have it fix other problems.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 19:12:42
Message-ID:	36e682920606231212r1436ba97q47ab8abf18f5b6d2@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/23/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
> Rather than keep references to all versions of all
> rows in indexes, keep only a reference to the first or "key" row of each
> row, and have the first version of a row form the head of a linked list to
> subsequent versions of each row. The list will be in decending order.

By all means, please go ahead and try it because it's not quite that
easy. You're going to run into serious locking and contention issues
this way. In the end, it's not much better than running a sequential
scan to query a row that's been updated several thousand times on a
table that hasn't been vacuumed... follow that pointer :)

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 19:30:35
Message-ID:	18121.24.91.171.78.1151091035.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Tom Lane wrote:
>> If you're doing heavy updates of a big table then it's likely to end up
>> visiting most of the table anyway, no? There is talk of keeping a map
>> of dirty pages, but I think it'd be a win for infrequently-updated
>> tables, not ones that need constant vacuuming.
>>
>> I think a lot of our problems in this area could be solved with fairly
>> straightforward tuning efforts on the existing autovacuum
>> infrastructure. In particular, someone should be looking into
>> recommendable default vacuum-cost-delay settings so that a background
>> vacuum doesn't affect performance too much. Another problem with the
>> current autovac infrastructure is that it doesn't respond very well to
>> the case where there are individual tables that need constant attention
>> as well as many that don't. If you have N databases then you can visit
>> a particular table at most once every N*autovacuum_naptime seconds, and
>> *every* table in the entire cluster gets reconsidered at that same rate.
>> I'm not sure if we need the ability to have multiple autovac daemons
>> running at the same time, but we definitely could use something with a
>> more flexible table-visiting pattern. Perhaps it would be enough to
>> look through the per-table stats for each database before selecting the
>> database to autovacuum in each cycle, instead of going by "least
>> recently autovacuumed".
>>
>> Bottom line: there's still lots of low-hanging fruit. Why are people
>> feeling that we need to abandon or massively complicate our basic
>> architecture to make progress?
>
> I think at some point we have to admit that _polling_ the tables, which
> is what autovacuum does, just isn't going to work well, no matter how
> much it is tweeked, and another approach should be considered for
> certain workload cases.

Thank you, that is *eactly* the anaology I have been unable to formulate.
It was on my mind but I could not put my finger on it.

Vacuum is findimentally inefficient as it does not know what has changed
and must go through an entirety to find the specific each time. Going
through the whole table each time is messed up and wasteful.

>
> At some point, the autovacuum approach starts to look like a car with
> fifty bumper stickers. The first few were fine, but at some point, the
> tweeks (bumper stickers) start to overwhelm the car, and it is time to
> look for a new car.
>
> I think particularly for the UPDATE with no index key changes, a new
> approach must be considred.
>

I have been ranting about a first row strategy, one where the first
version of a row is the top of a linked list of versions.

(1) The indexes point to the first "key" row.
(2) When a row is updated, it is found in the various indexes, if the key
row currenlty exists in the index, no changes to the index are made. If it
is not found, the old version of the row is orphaned and behaves as
PostgreSQL always behaves.
(3) If the row is not orphaned, its "last version" reference is updated.

For the most part, this should only affect updates where the index entries
don't change. If the index value is always change, PostgreSQL will behave
as it currently does. If the index values do not change, updates will be
faster to do and won't impact queries.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 19:32:34
Message-ID:	36e682920606231232m3a3d0eb1o1b3555c79829e3b8@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/23/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
> Vacuum is findimentally inefficient

The theory of database cleanup (VACUUM) is good, but has nothing to do
with the overhead we currently encounter in performing an update.

> I have been ranting about a first row strategy, one where the first
> version of a row is the top of a linked list of versions.

> (1) The indexes point to the first "key" row.

I'm with ya.

> (2) When a row is updated, [...] If it is not found [in the index], the
> old version of the row is orphaned and behaves as PostgreSQL
> always behaves.

aka current bad update behavior.

> (3) If the row is not orphaned, its "last version" reference is updated.

The *last version* reference? Where is this, another entry on the
tuple header (extra size overhead on every tuple) or updating the
index (additional locking+WAL)? Following something like the ctid?
How is that really any different than performing a sequential scan
from the index as a starting point? That could still encounter
multiple physical I/Os to get to the current row and is potentially
very slow. Sure, we can add extra free space to each block to limit
tuple version migration across blocks... combined with constant
vacuuming and FSM updates we'd get some great performance.

Of course, we could just avoid the overcomplication and locking issues
of having to tune block-level freespace, vacuum, and tweaking FSM for
each type of application, but where's the fun in that? (yes, that was
a badly placed sarchastic joke)

> For the most part, this should only affect updates where the index entries
> don't change. If the index value is always change, PostgreSQL will behave
> as it currently does. If the index values do not change, updates will be
> faster to do and won't impact queries.

I like the idea of a new strategy when index entries do not change.
But, like Tom said, there is no free lunch; in your example you're
moving the high cost of an UPDATE to even more costly SELECTs.

This is something that requires a good amount of planning and design,
and I'm really glad to see some ideas being thrown into this
discussion.

From:	Stefan Kaltenbrunner <stefan(at)kaltenbrunner(dot)cc>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 19:35:49
Message-ID:	449C4295.2090303@kaltenbrunner.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Bruce Momjian <bruce(at)momjian(dot)us> writes:
>> I think at some point we have to admit that _polling_ the tables, which
>> is what autovacuum does, just isn't going to work well, no matter how
>> much it is tweeked, and another approach should be considered for
>> certain workload cases.
>
> Autovacuum polls in its current, first-generation implementation;
> what I said upthread was it needs to be smarter than that. I am not
> sure how you get from that to the conclusion that the very next step
> is to abandon the vacuuming approach altogether.

yeah autovacuum still can be improved quite a lot, but as always this
can be done on a step by step base.

>
> What I see in this discussion is a huge amount of "the grass must be
> greener on the other side" syndrome, and hardly any recognition that
> every technique has its downsides and complications. Furthermore,
> I do not believe that this project has the ability to support multiple
> fundamental storage models, as a number of people seem to be blithely
> suggesting. We're having a hard enough time debugging and optimizing
> *one* storage model. I think the correct path forward is to stick with
> the same basic storage model and vacuuming concept, and address the
> known performance issues with better-optimized vacuuming. No, it will
> never be perfect for every scenario, but we can certainly make it much
> better than it is now, without risking killing the project by
> introducing undebuggable, unmaintainable complexity.

While I'm not an expert on MVCC - it certainly seems that sticking to
the current storage model and continuing to improve on it (especially
wrt vacuum performance) gradually over time (as it has happened for the
last years) is a much better and safer approach than trying to do
something revolutionary which in theory might (or might not) be better
than the current approach for this or that workload.

PostgreSQL got a _LOT_ faster for each of the last releases and by my
testing -HEAD is already significantly(20-30%) faster for some of our
apps than 8.1 and all that was achieved without radically redesigning a
proven (reliability wise) storage engine.
Maybe and only maybe one day(or when somebody comes up with a usable
patch - as always) we will at the point where we really need to think
about doing that but for now there seems to be still a lot of low
hanging fruit left to improve for month and years to come.

Stefan

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 19:46:12
Message-ID:	36e682920606231246k284db2b0g7775e3ccea78944f@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> What I see in this discussion is a huge amount of "the grass must be
> greener on the other side" syndrome, and hardly any recognition that
> every technique has its downsides and complications.

I'm being totally objective. I don't think we should abandon
PostgreSQL's overall design at all, because we do perform INSERTs and
DELETEs much better than most systems. However, I've looked at many
systems and how they implement UPDATE so that it is a scalable
operation. Sure, there are costs and benefits to each implementation,
but I think we have some pretty brilliant people in this community and
can come up with an elegant design for scalable UPDATEs.

> Furthermore, I do not believe that this project has the ability to
> support multiple fundamental storage models, as a number of
> people seem to be blithely suggesting.

The fundamental storage model was not designed to be scalable. Many
improvements have been done to get it where it is today. The
improvements are certainly good but the model itself it isn't perfect
and sometimes things need to be redesigned a bit every now and again.
To be clear though, I'm in no way suggesting that we trash and rewrite
PostgreSQL's storage architecture.

> we can certainly make it much
> better than it is now, without risking killing the project by
> introducing undebuggable, unmaintainable complexity.

I'm not saying we rewrite the entire system in C++ (as Firebird
did)... that was a risky and possibly project-killing move. I see us
changing the way a single operation works through community-oriented
planning and design. This approach seems to have worked for years...
and includes the MVCC implementation itself.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-23 21:27:30
Message-ID:	200606232127.k5NLRUJ23517@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jonah H. Harris wrote:
> On 6/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > What I see in this discussion is a huge amount of "the grass must be
> > greener on the other side" syndrome, and hardly any recognition that
> > every technique has its downsides and complications.
>
> I'm being totally objective. I don't think we should abandon
> PostgreSQL's overall design at all, because we do perform INSERTs and
> DELETEs much better than most systems. However, I've looked at many
> systems and how they implement UPDATE so that it is a scalable
> operation. Sure, there are costs and benefits to each implementation,
> but I think we have some pretty brilliant people in this community and
> can come up with an elegant design for scalable UPDATEs.

I think the UPDATE case is similar to the bitmap index scan or perhaps
bitmap indexes on disk --- there are cases we know can not be handled
well by our existing code, so we have added (or might add) these
features to try to address those difficult cases.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Rick Gigger <rick(at)alpinenetworking(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 00:15:54
Message-ID:	8BD20EE1-DE53-46AB-854C-826E10ED16A8@alpinenetworking.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Just out of curiosity Mark, didn't you write your session daemon so
that you don't have to put sessions in postgres anymore? Or are you
just giving that as an example of a very wide, very heavily updated
table? My session tables have been an extreme case of this problem,
but no other table that I have is so adversely affected by this
behavior. My decision was not to pull postgres out entirely, just
using other session handlers.

Rick

On Jun 22, 2006, at 7:59 AM, Mark Woodward wrote:

>> After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark
>> Woodward"), an earthling, wrote:
>>>> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward")
>>>> mumbled into
>>>> her beard:
> [snip]
>>>>
>>>> 1. The index points to all the versions, until they get
>>>> vacuumed out.
>>>
>>> It can't point to "all" versions, it points to the last "current"
>>> version
>>> as updated by vacuum, or the first version of the row.
>>
>> No, it points to *all* the versions.
>>
>> Suppose I take a table with two rows:
>>
>> INFO: analyzing "public.test"
>> INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
>> VACUUM
>>
>> Then, over and over, I remove and insert one entry with the same PK:
>>
>> sample=# delete from test where id = 2;insert into test (id)
>> values (2);
>> DELETE 1
>
> [snip]
>>
>> Now, I vacuum it.
>>
>> sample=# vacuum verbose analyze test;
>> INFO: vacuuming "public.test"
>> INFO: index "test_id_key" now contains 2 row versions in 2 pages
>> DETAIL: 10 index row versions were removed.
>> 0 index pages have been deleted, 0 are currently reusable.
>> CPU 0.00s/0.00u sec elapsed 0.00 sec.
>> INFO: "test": removed 10 row versions in 1 pages
>> DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec.
>> INFO: "test": found 10 removable, 2 nonremovable row versions in
>> 1 pages
>> DETAIL: 0 dead row versions cannot be removed yet.
>> There were 0 unused item pointers.
>> 0 pages are entirely empty.
>> CPU 0.00s/0.00u sec elapsed 0.00 sec.
>> INFO: analyzing "public.test"
>> INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
>> VACUUM
>>
>> Notice that the index contained 10 versions of that one row.
>>
>> It pointed to *ALL* the versions.
>
> Hmm, OK, then the problem is more serious than I suspected.
> This means that every index on a row has to be updated on every
> transaction that modifies that row. Is that correct?
>
> I am attaching some code that shows the problem with regard to
> applications such as web server session management, when run, each
> second
> the system can handle fewer and fewer connections. Here is a brief
> output:
>
> markw(at)ent:~/pgfoo$ ./footest
> 1307 sessions per second, elapsed: 1
> 1292 sessions per second, elapsed: 2
> 1287 sessions per second, elapsed: 3
> ....
> 1216 sessions per second, elapsed: 25
> 1213 sessions per second, elapsed: 26
> 1208 sessions per second, elapsed: 27
> ....
> 1192 sessions per second, elapsed: 36
> 1184 sessions per second, elapsed: 37
> 1183 sessions per second, elapsed: 38
> ....
> 1164 sessions per second, elapsed: 58
> 1170 sessions per second, elapsed: 59
> 1168 sessions per second, elapsed: 60
>
> As you can see, in about a minute at high load, this very simple table
> lost about 10% of its performance, and I've seen worse based on update
> frequency. Before you say this is an obscure problem, I can tell
> you it
> isn't. I have worked with more than a few projects that had to
> switch away
> from PostgreSQL because of this behavior.
>
> Obviously this is not a problem with small sites, but this is a real
> problem with an enterprise level web site with millions of visitors
> and
> actions a day. Quite frankly it is a classic example of something that
> does not scale. The more and more updates there are, the higher the
> load
> becomes. You can see it on "top" as the footest program runs.
>
> There has to be a more linear way of handling this scenario.
> <footest.c>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: explain analyze is your friend

From:	Rick Gigger <rick(at)alpinenetworking(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 00:39:17
Message-ID:	B930486B-E0F1-4D4F-BD1C-2C9F61703136@alpinenetworking.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 22, 2006, at 10:33 AM, Mark Woodward wrote:

>> Ãœhel kenal pÃ¤eval, N, 2006-06-22 kell 09:59, kirjutas Mark
>> Woodward:
>>>> After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark
>>>> Woodward"), an earthling, wrote:
>>>>>> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward")
>>>>>> mumbled
>>> into
>>>> It pointed to *ALL* the versions.
>>>
>>> Hmm, OK, then the problem is more serious than I suspected.
>>> This means that every index on a row has to be updated on every
>>> transaction that modifies that row. Is that correct?
>>
>> Yes.
>>
>>> I am attaching some code that shows the problem with regard to
>>> applications such as web server session management, when run, each
>>> second
>>> the system can handle fewer and fewer connections. Here is a brief
>>> output:
>>>
>>> markw(at)ent:~/pgfoo$ ./footest
>>> 1307 sessions per second, elapsed: 1
>>> 1292 sessions per second, elapsed: 2
>>> 1287 sessions per second, elapsed: 3
>>> ....
>>> 1216 sessions per second, elapsed: 25
>>> 1213 sessions per second, elapsed: 26
>>> 1208 sessions per second, elapsed: 27
>>> ....
>>> 1192 sessions per second, elapsed: 36
>>> 1184 sessions per second, elapsed: 37
>>> 1183 sessions per second, elapsed: 38
>>> ....
>>> 1164 sessions per second, elapsed: 58
>>> 1170 sessions per second, elapsed: 59
>>> 1168 sessions per second, elapsed: 60
>>>
>>> As you can see, in about a minute at high load, this very simple
>>> table
>>> lost about 10% of its performance, and I've seen worse based on
>>> update
>>> frequency. Before you say this is an obscure problem, I can tell
>>> you it
>>> isn't. I have worked with more than a few projects that had to
>>> switch
>>> away
>>> from PostgreSQL because of this behavior.
>>
>> You mean systems that are designed so exactly, that they can't
>> take 10%
>> performance change ?
>
> No, that's not really the point, performance degrades over time, in
> one
> minute it degraded 10%.
>
> The update to session ratio has a HUGE impact on PostgreSQL. If you
> have a
> thousand active sessions, it may take a minute to degrade 10% assuming
> some level of active vs operations per session per action.

How big are your session? Running with about 1000 sessions, running
vacuum on just the session table is so fast it is barely noticeable.
Vacuuming my session table every 5 minutes keeps them very, very
small and easy to vacuum and performance degradation is not an
issue. I could probably do it every minute if I had to and it would
be fine. But my sessions are only about 5k on average.

What is a bigger concern for me is the massive amount of writes to
the disk that happen in postgres to make sure the data is safe. It's
just a waste of disk bandwidth for data for data that is transient
anyway.

To me postgres (and rdbms's in general) are just not good for
handling session data for web apps. Once again isn't that why you
wrote mcache?

>
> If an active user causes a session update once a second, that is
> not too
> bad, but if an active user updates a session more often, then it is
> worse.
>
> Generally speaking, sessions aren't updated when they change, they are
> usually updated per HTTP request. The data in a session may not
> change,
> but the session handling code doesn't know this and simply updates
> anyway.

This problem is more or less specific to php no? Because it reads
the whole session on session open and writes the whole thing on
close. Because of this I am looking into smarter ways of handling
sessions than this. Because yes, the session data RARELY changes. I
am looking into ways of only doing updates when the data changes. In
fact for a very similar problem, where I had tons of tiny requests
coming in that would NEVER alter the sessions I skipped session_start
and used my own session_touch function to update the timestamp on the
session because that's all I needed to do. It saved TONS of wasted
overhead.

I don't mean to get off topic but it seems like these sorts of
problems are better solved outside of postgres. I think your session
daemon is in fact the right approach here. If you have other tables
with similar problems that is one thing but if it is just php session
tables then I think we need to look for a better use case to look
into this.

>
> In a heavily AJAX site, you may have many smaller HTTP requests
> returning
> items in a page. So, a single page may consist of multiple HTTP
> requests.
> Worse yet, as a user drags an image around, there are lots of
> background
> requests being made. Each request typically means a session lookup
> and a
> session update. This is compounded by the number of active users.
> Since
> the object of a site is to have many active users, this is always a
> problem. It is less intrusive now that non-locking vacuum is there,
> but
> that doesn't mean it isn't a problem.

Once again I think to run an "Enterprise" app (and by that I mean it
scales well) you need "Enterprise" class session management. The php
model is not good for this and using postgres is not good for this.
It's just not the right tool for this job in my opinion. I would
think you could gain WAY more by using a more intelligent session
handling model then you could ever eek out of postgres for this problem.

>
>>
>> Or just that they did not vacuum for so long, that performance was
>> less
>> than needed in the end?
>
> In an active site or application, vacuuming often enough to prevent
> this
> often is, itself, a load on the system.

Once again this is just anecdotal evidence but for me that load is
virtually unnoticeable it is so small. The key here is that session
tables are SMALL. That is less than a few thousand rows. Maybe you
are storing a lot more data in there than I am but for me a vacuum
every 5 minutes solves the problem performance degradation. It is
still too much disk bandwidth though so I am just going to stop
storing sessions in postgres.

>
>>
>> btw, what did they switch to ?
>
> One switched to oracle and one is using a session handler I wrote
> for PHP.

Just switching the session handler sounds like a pretty good
solutions to this problem to me.

just my $0.02

Rick

From:	Rick Gigger <rick(at)alpinenetworking(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"PFC" <lists(at)peufeu(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 01:00:57
Message-ID:	E7EC710D-4D7E-4E6B-9164-004D0E7B1006@alpinenetworking.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Jun 22, 2006, at 2:36 PM, Mark Woodward wrote:

>>
>>> What you seem not to grasp at this point is a large web-farm,
>>> about 10
>>> or
>>> more servers running PHP, Java, ASP, or even perl. The database is
>>> usually
>>> the most convenient and, aside from the particular issue we are
>>> talking
>>> about, best suited.
>>
>> The answer is sticky sessions : each user is assigned to one and
>> only one
>> webserver in the cluster and his session is maintained locally, in
>> RAM. No
>> locks, no need to manage distributed session...
>>
>>> I actually have a good number of years of experience in this
>>> topic, and
>>> memcached or file system files are NOT the best solutions for a
>>> server
>>> farm.
>>
>> If sessions are distributed, certainly, but if sessions are
>> sticky to
>> their own server ?
>
> And what if a particulr server goes down? or gets too high a
> percentage of
> the load?

Yes, I don't think that sticky sessions are the answer. But phps
session handling behavior could be greatly improved on.

From:	mark(at)mark(dot)mielke(dot)cc
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 01:56:40
Message-ID:	20060624015640.GA9234@mark.mielke.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 23, 2006 at 03:08:34PM -0400, Bruce Momjian wrote:
> Tom Lane wrote:
> > ...
> > suggesting. We're having a hard enough time debugging and optimizing
> > *one* storage model. I think the correct path forward is to stick with
> > the same basic storage model and vacuuming concept, and address the
> > known performance issues with better-optimized vacuuming. No, it will
> > never be perfect for every scenario, but we can certainly make it much
> > better than it is now, without risking killing the project by
> > introducing undebuggable, unmaintainable complexity.
> Well, are you suggesting we just stop improving the database? I am sure
> not. But, your suggestion is that we can't do better without incurring
> more complexity (true), and that complexity will not be worth it. I
> don't agree with that until I see some proposals, and shutting down
> discussion because they will add complexity or are fruitless seems
> unwise.

It sounds like everybody agrees that things need to be fixed, and genuinely
motivated people are trying to offer what they have to the table.

Tom already has enough on his plate, as do most others here - so unless
a competent champion can take up the challenge, discussion is all we have.

I'm not liking the "we should do it this way," "no, we should do it that."
My read of the situation is that both may be useful, and that both should
be pursued. But one set of people can't pursue both.

Is any who is able, able to take up this challenge? Perhaps more than one,
from both major directions? (vacuum on one side, and improved storage on
the other) Somebody with the time and skill, who can work through the
design discussions on one of the aspects?

I want to contribute soon, and this is the sort of thing that interests me -
but I still don't have time yet, and there would be no guarantee that I
succeeded. Somebody else? :-)

Cheers,
mark

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Rick Gigger" <rick(at)alpinenetworking(dot)com>
Cc:	"Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 02:57:53
Message-ID:	18829.24.91.171.78.1151117873.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Just out of curiosity Mark, didn't you write your session daemon so
> that you don't have to put sessions in postgres anymore?

The original project started as a shared key/value system for a beowulf
cluster in the late 90s, but got reworked to be a session handler for PHP
when I worked with Stig, MSession.

> Or are you
> just giving that as an example of a very wide, very heavily updated
> table? My session tables have been an extreme case of this problem,
> but no other table that I have is so adversely affected by this
> behavior. My decision was not to pull postgres out entirely, just
> using other session handlers.

I have been working as a consultant since 2001, and prior to that, as CTO
at at a dot.com startup. MSession (the previous name) was used to
circumvent shortcomings in PostgreSQL, specificially the problem we are
talking about.

As a consultant, I have to convince the customer that "all is well." My
MCache system does not guarentee that no session data lost, nor does
memcached or other non-ACID system.

The technical arguments we can make, no matter how correct, leave us on
the defensive when asked "What if the server crashes, do you lose data?"
of course the answer is "yes." Then we get drawn into a conversation about
transient and unimportant data vs persistent and valuable data. At which
point you've lost the customer. A solid SQL database is the defacto
standard, perhaps not the best choice, but unavoidable.

The update behavior of PostgreSQL is probably the *last* serious issue.
Debate all you want, vacuum mitigates the problem to varying levels,
fixing the problem will be a huge win. If the update behavior gets fixed,
I can't think of a single issue with postgresql that would be a show
stopper.

>
> Rick
>
> On Jun 22, 2006, at 7:59 AM, Mark Woodward wrote:
>
>>> After a long battle with technology, pgsql(at)mohawksoft(dot)com ("Mark
>>> Woodward"), an earthling, wrote:
>>>>> Clinging to sanity, pgsql(at)mohawksoft(dot)com ("Mark Woodward")
>>>>> mumbled into
>>>>> her beard:
>> [snip]
>>>>>
>>>>> 1. The index points to all the versions, until they get
>>>>> vacuumed out.
>>>>
>>>> It can't point to "all" versions, it points to the last "current"
>>>> version
>>>> as updated by vacuum, or the first version of the row.
>>>
>>> No, it points to *all* the versions.
>>>
>>> Suppose I take a table with two rows:
>>>
>>> INFO: analyzing "public.test"
>>> INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
>>> VACUUM
>>>
>>> Then, over and over, I remove and insert one entry with the same PK:
>>>
>>> sample=# delete from test where id = 2;insert into test (id)
>>> values (2);
>>> DELETE 1
>>
>> [snip]
>>>
>>> Now, I vacuum it.
>>>
>>> sample=# vacuum verbose analyze test;
>>> INFO: vacuuming "public.test"
>>> INFO: index "test_id_key" now contains 2 row versions in 2 pages
>>> DETAIL: 10 index row versions were removed.
>>> 0 index pages have been deleted, 0 are currently reusable.
>>> CPU 0.00s/0.00u sec elapsed 0.00 sec.
>>> INFO: "test": removed 10 row versions in 1 pages
>>> DETAIL: CPU 0.00s/0.00u sec elapsed 0.00 sec.
>>> INFO: "test": found 10 removable, 2 nonremovable row versions in
>>> 1 pages
>>> DETAIL: 0 dead row versions cannot be removed yet.
>>> There were 0 unused item pointers.
>>> 0 pages are entirely empty.
>>> CPU 0.00s/0.00u sec elapsed 0.00 sec.
>>> INFO: analyzing "public.test"
>>> INFO: "test": 1 pages, 2 rows sampled, 2 estimated total rows
>>> VACUUM
>>>
>>> Notice that the index contained 10 versions of that one row.
>>>
>>> It pointed to *ALL* the versions.
>>
>> Hmm, OK, then the problem is more serious than I suspected.
>> This means that every index on a row has to be updated on every
>> transaction that modifies that row. Is that correct?
>>
>> I am attaching some code that shows the problem with regard to
>> applications such as web server session management, when run, each
>> second
>> the system can handle fewer and fewer connections. Here is a brief
>> output:
>>
>> markw(at)ent:~/pgfoo$ ./footest
>> 1307 sessions per second, elapsed: 1
>> 1292 sessions per second, elapsed: 2
>> 1287 sessions per second, elapsed: 3
>> ....
>> 1216 sessions per second, elapsed: 25
>> 1213 sessions per second, elapsed: 26
>> 1208 sessions per second, elapsed: 27
>> ....
>> 1192 sessions per second, elapsed: 36
>> 1184 sessions per second, elapsed: 37
>> 1183 sessions per second, elapsed: 38
>> ....
>> 1164 sessions per second, elapsed: 58
>> 1170 sessions per second, elapsed: 59
>> 1168 sessions per second, elapsed: 60
>>
>> As you can see, in about a minute at high load, this very simple table
>> lost about 10% of its performance, and I've seen worse based on update
>> frequency. Before you say this is an obscure problem, I can tell
>> you it
>> isn't. I have worked with more than a few projects that had to
>> switch away
>> from PostgreSQL because of this behavior.
>>
>> Obviously this is not a problem with small sites, but this is a real
>> problem with an enterprise level web site with millions of visitors
>> and
>> actions a day. Quite frankly it is a classic example of something that
>> does not scale. The more and more updates there are, the higher the
>> load
>> becomes. You can see it on "top" as the footest program runs.
>>
>> There has to be a more linear way of handling this scenario.
>> <footest.c>
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 6: explain analyze is your friend
>

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 06:54:56
Message-ID:	449CE1C0.7020309@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/23/2006 3:10 PM, Mark Woodward wrote:

> This is NOT an "in-place" update. The whole MVCC strategy of keeping old
> versions around doesn't change. The only thing that does change is one
> level of indirection. Rather than keep references to all versions of all
> rows in indexes, keep only a reference to the first or "key" row of each
> row, and have the first version of a row form the head of a linked list to
> subsequent versions of each row. The list will be in decending order.

Where exactly do you intend to keep all those links (for a table with N
indexes)?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck(at)Yahoo(dot)com #

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	mark(at)mark(dot)mielke(dot)cc
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 07:29:47
Message-ID:	449CE9EB.7060403@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/23/2006 9:56 PM, mark(at)mark(dot)mielke(dot)cc wrote:

> On Fri, Jun 23, 2006 at 03:08:34PM -0400, Bruce Momjian wrote:
>> Tom Lane wrote:
>> > ...
>> > suggesting. We're having a hard enough time debugging and optimizing
>> > *one* storage model. I think the correct path forward is to stick with
>> > the same basic storage model and vacuuming concept, and address the
>> > known performance issues with better-optimized vacuuming. No, it will
>> > never be perfect for every scenario, but we can certainly make it much
>> > better than it is now, without risking killing the project by
>> > introducing undebuggable, unmaintainable complexity.
>> Well, are you suggesting we just stop improving the database? I am sure
>> not. But, your suggestion is that we can't do better without incurring
>> more complexity (true), and that complexity will not be worth it. I
>> don't agree with that until I see some proposals, and shutting down
>> discussion because they will add complexity or are fruitless seems
>> unwise.
>
> It sounds like everybody agrees that things need to be fixed, and genuinely
> motivated people are trying to offer what they have to the table.

One singe core team member responds vaguely in a way, you feel being
supportive of your case, and you conclude that "everybody agrees"?
Sorry, x'use me?

There are a couple of side effects on this "update in place" issue that
aren't even mentioned yet. Nobody with any significant in depth
knowledge of the Postgres non-overwriting storage engine concept seems
to suggest any move towards a storage system, that does updates in place
that require "undo" operations in case of a process/system failure.
You're ready to "fix" all those places to support the undo you need? You
must have a big table.

Jan

>
> Tom already has enough on his plate, as do most others here - so unless
> a competent champion can take up the challenge, discussion is all we have.
>
> I'm not liking the "we should do it this way," "no, we should do it that."
> My read of the situation is that both may be useful, and that both should
> be pursued. But one set of people can't pursue both.
>
> Is any who is able, able to take up this challenge? Perhaps more than one,
> from both major directions? (vacuum on one side, and improved storage on
> the other) Somebody with the time and skill, who can work through the
> design discussions on one of the aspects?
>
> I want to contribute soon, and this is the sort of thing that interests me -
> but I still don't have time yet, and there would be no guarantee that I
> succeeded. Somebody else? :-)
>
> Cheers,
> mark
>

From:	mark(at)mark(dot)mielke(dot)cc
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 08:32:20
Message-ID:	20060624083220.GA13154@mark.mielke.cc
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 24, 2006 at 03:29:47AM -0400, Jan Wieck wrote:
> >It sounds like everybody agrees that things need to be fixed, and genuinely
> >motivated people are trying to offer what they have to the table.
> One singe core team member responds vaguely in a way, you feel being
> supportive of your case, and you conclude that "everybody agrees"?
> Sorry, x'use me?

> There are a couple of side effects on this "update in place" issue that
> aren't even mentioned yet. Nobody with any significant in depth
> knowledge of the Postgres non-overwriting storage engine concept seems
> to suggest any move towards a storage system, that does updates in place
> that require "undo" operations in case of a process/system failure.
> You're ready to "fix" all those places to support the undo you need? You
> must have a big table.

Jan: Who on the list has claimed that nothing is broken and nothing needs
to be improved? Are you making this claim?

Shutting down ideas is not constructive. Encouraging those with ideas to
step up and do more than talk could be.

Cheers,
mark

One ring to rule them all, one ring to find them, one ring to bring them all
and in the darkness bind them...

http://mark.mielke.cc/

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 09:40:23
Message-ID:	Pine.OSF.4.61.0606241215230.44498@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, 23 Jun 2006, Jonah H. Harris wrote:

> On 6/23/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
>> Rather than keep references to all versions of all
>> rows in indexes, keep only a reference to the first or "key" row of each
>> row, and have the first version of a row form the head of a linked list to
>> subsequent versions of each row. The list will be in decending order.
>
> By all means, please go ahead and try it because it's not quite that
> easy. You're going to run into serious locking and contention issues
> this way. In the end, it's not much better than running a sequential
> scan to query a row that's been updated several thousand times on a
> table that hasn't been vacuumed... follow that pointer :)

Can you elaborate what kind of locking and contention issues you're
thinking of?

You could update the index tuple to point to a newer version of the row,
when an index scan determines that the heap tuple it points to is not
visible to anyone. We already check that to update the XMAX_COMMITTED hint
bit. Updating the index tuple comes with a cost, of course, but
alleviates the "follow that pointer" issue.

The biggest challenge that I see is that an index scan would somehow
need to know when to follow the t_ctid chain and when not. If you follow
the pointer and there's another index tuple for the row, the scan could
see the same tuple twice. Some kind of bookkeeping would be needed to
solve that.

Also, vacuuming would become a bit more complex, since it would need to
update the index tuples to point to newer row versions instead of just
removing them.

All in all, I think this solution to the "an update needs to update all
indexes, even when none of the indexed columns changed" issue requires
less changes than implementing Oracle style rollback segments and/or an
undo log.

- Heikki

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 12:14:10
Message-ID:	18564.24.91.171.78.1151151250.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 6/23/2006 3:10 PM, Mark Woodward wrote:
>
>> This is NOT an "in-place" update. The whole MVCC strategy of keeping old
>> versions around doesn't change. The only thing that does change is one
>> level of indirection. Rather than keep references to all versions of all
>> rows in indexes, keep only a reference to the first or "key" row of each
>> row, and have the first version of a row form the head of a linked list
>> to
>> subsequent versions of each row. The list will be in decending order.
>
> Where exactly do you intend to keep all those links (for a table with N
> indexes)?
>

I'm probably mistaken, but aren't there already forward references in
tuples to later versions? If so, I'm only sugesting reversing the order
and referencing the latest version.

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 12:18:08
Message-ID:	20060624121808.GB24111@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 24, 2006 at 08:14:10AM -0400, Mark Woodward wrote:
> > On 6/23/2006 3:10 PM, Mark Woodward wrote:
> >
> >> This is NOT an "in-place" update. The whole MVCC strategy of keeping old
> >> versions around doesn't change. The only thing that does change is one
> >> level of indirection. Rather than keep references to all versions of all
> >> rows in indexes, keep only a reference to the first or "key" row of each
> >> row, and have the first version of a row form the head of a linked list
> >> to
> >> subsequent versions of each row. The list will be in decending order.
> >
> > Where exactly do you intend to keep all those links (for a table with N
> > indexes)?
> >
>
> I'm probably mistaken, but aren't there already forward references in
> tuples to later versions? If so, I'm only sugesting reversing the order
> and referencing the latest version.

You can't do that. The links exist so that in READ COMMITTED mode you
can always find the newest version. You would need to add additional
links to go backwards.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 12:36:05
Message-ID:	20060624123605.GC24111@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 23, 2006 at 03:10:39PM -0400, Mark Woodward wrote:
> This is NOT an "in-place" update. The whole MVCC strategy of keeping old
> versions around doesn't change. The only thing that does change is one
> level of indirection. Rather than keep references to all versions of all
> rows in indexes, keep only a reference to the first or "key" row of each
> row, and have the first version of a row form the head of a linked list to
> subsequent versions of each row. The list will be in decending order.

I thought of another issue with this. If you move away from storing
each row in the indexes, you can pretty much forget bitmap index scans.
They pretty much rely on every row being represented, not just a
subset. Until you go to the heap you don't know if a tuple will match,
which is precisely what the bitmap scan is trying to avoid. You could
follow the links, but that destroys the nice sequential access
properties.

> In the vast majority of cases, the overhead of this action will be
> trivial. In an unmodified row, you're there. In a modified row, you have
> one extra lookup. In extream cases, you may have to go back a few
> versions, but I don't see that as a common behavior.

I wonder if looking at old versions is really all that uncommon. A
large reporting query which runs for hours will probably be looking at
a lot of old versions. These are the queries that will be hit the
hardest.

If you're trying to avoid index bloat, I wonder if it wouldn't be
better to tackle this from the other end. In indexes, allow a row to
carry multiple CTIDs. That way a new version only requires adding six
more bytes, rather than a whole new tuple.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 12:43:13
Message-ID:	Pine.OSF.4.61.0606241536290.44498@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, 24 Jun 2006, Mark Woodward wrote:

> I'm probably mistaken, but aren't there already forward references in
> tuples to later versions? If so, I'm only sugesting reversing the order
> and referencing the latest version.

I thought I understood your idea, but now you lost me again. I thought
what you want is that the older heap tuple has a pointer to the
newer one. Which it already has, it's called t_ctid.

Can you try to explain more carefully how the whole thing would work?
What would an index tuple point to? What pointers would a heap tuple
have? What would an index scan do to find the row version it's interested
in? What exactly would an update do?

- Heikki

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 13:13:28
Message-ID:	36e682920606240613i518a022bg80a4dfc1c06448f9@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
> Currently it looks like this:
>
> ver001->ver002->ver003->...-verN
>
> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
> have it do this:
>
> ver001->verN->...->ver003->ver002->|

Heh, because that's crazy. The first time you insert a key into the
index it will point to v1 of a tuple... say after 5 updates you have
v2,v3,v4,v5... your c_tid pointer chain looks like v1
(original)->v2->v3->v4-v5 (newest). However, your whole idea is based
on not having to do another index insert for unchanged keys, so the
index still points to v1... which means you have to follow the c_tid
chain to get to the newest version just like a sequential scan. I
don't see how you think you can reverse pointer it.

> This will speed up almost *all* queries when there are more than two
> version of rows.

Nope.

> When you vacuum, simply make the latest version (verN) the key row (ver001).

How are you going to do this without a ton of locking... remember, the
index is pointing to v1 with a tid... so you'll have to physically
move the newest version v5 to v1's tid from wherever it was... like a
vacuum full on steroids. Unless of course, you rebuild the index...
but that's not a solution either.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 13:14:24
Message-ID:	36e682920606240614g605afcf8y6afe5a7addc206b3@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/24/06, Jonah H. Harris <jonah(dot)harris(at)gmail(dot)com> wrote:

Grr... need coffee... s/c_tid/ctid/g

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>
Cc:	"Jan Wieck" <janwieck(at)yahoo(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 13:23:28
Message-ID:	18926.24.91.171.78.1151155408.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Sat, 24 Jun 2006, Mark Woodward wrote:
>
>> I'm probably mistaken, but aren't there already forward references in
>> tuples to later versions? If so, I'm only sugesting reversing the order
>> and referencing the latest version.
>
> I thought I understood your idea, but now you lost me again. I thought
> what you want is that the older heap tuple has a pointer to the
> newer one. Which it already has, it's called t_ctid.

Perfect!
>
> Can you try to explain more carefully how the whole thing would work?
> What would an index tuple point to? What pointers would a heap tuple
> have? What would an index scan do to find the row version it's interested
> in? What exactly would an update do?

Since we already allocate space for some notion of linked list, then all
I'm suggesting is we reverse the order, sort of. Currently it looks like
this:

ver001->ver002->ver003->...-verN

That's what t_ctid does now, right? Well, that's sort of stupid. Why not
have it do this:

ver001->verN->...->ver003->ver002->|
^---------------------------------/

This will speed up almost *all* queries when there are more than two
version of rows.

OK, here is the behavior of an update:
(1) Find the latest version of the row
(2) Duplicate row and modify as per plan
(3) Set the t_ctid of the new row to the last "latest"
(4) Set the t_ctid of the first row to that of the new row
(5) Attempt to index the row
(6) If the first version of the row is in the index already (ver001) Don't
modify the index, otherwise, add the new version (just as before)

When you vacuum, simply make the latest version (verN) the key row (ver001).

There are, no doubt, issues that need to be resolved (I can think of a
coouple off the top of my head), but overall I think it is workable and
don't think this will affect performance in the simple case and improve
performance in the cases where there are more than one or two version of a
row.

From:	"Jochem van Dieten" <jochemd(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 14:12:03
Message-ID:	f96a9b830606240712m6fb00637g560715e197ae6c00@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/24/06, Mark Woodward wrote:
>
> ver001->verN->...->ver003->ver002->|
> ^---------------------------------/
>
> This will speed up almost *all* queries when there are more than two
> version of rows.
>
> OK, here is the behavior of an update:
> (1) Find the latest version of the row
> (2) Duplicate row and modify as per plan
> (3) Set the t_ctid of the new row to the last "latest"
> (4) Set the t_ctid of the first row to that of the new row
> (5) Attempt to index the row
> (6) If the first version of the row is in the index already (ver001) Don't
> modify the index, otherwise, add the new version (just as before)

I am not sure I understand your algorithm. If we take as a starting
point the following situation of a fresh tuple, in very schematic form
it looks like:

Heap:
TID T_CTID XMIN XMAX Col1 Col2
xxx1 xxx1 ttt1 null 1 1

Index on Col1:
1 xxx1

Index on Col2:
1 xxx1

Now, after an update to this tuple changing the Value2 field, in what
state should the heap, index 1 and index 2 be? If I understand you
correctly, you want it to be:

Heap:
TID T_CTID XMIN XMAX Col1 Col2
xxx1 xxx2 ttt1 ttt2 1 1
xxx2 xxx1 ttt2 null 1 2

Index on Col1:
1 xxx2

Index on Col2:
1 xxx1
2 xxx2

Three questions:
1. Do I understand your intention correctly?
2. Could you extend this for an update to increment value2 (because
the T_CTID logic gets a bit unclear for me there).
3. The net benefit of this would be 1 less entry in the index on Col1?

Jochem

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jan Wieck <janwieck(at)yahoo(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 14:12:27
Message-ID:	20060624141227.GD24111@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 24, 2006 at 09:23:28AM -0400, Mark Woodward wrote:
> > Can you try to explain more carefully how the whole thing would work?
> > What would an index tuple point to? What pointers would a heap tuple
> > have? What would an index scan do to find the row version it's interested
> > in? What exactly would an update do?
>
>
> Since we already allocate space for some notion of linked list, then all
> I'm suggesting is we reverse the order, sort of. Currently it looks like
> this:
>
> ver001->ver002->ver003->...-verN
>
> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
> have it do this:
>
> ver001->verN->...->ver003->ver002->|
> ^---------------------------------/

You don't say where the index points or the order, but I'm assuming
from your diagram that ver1 is the oldest, verN is the newest.
Currently there is an index entry for each version, but in your version
there is only an index entry for ver1, right?

> This will speed up almost *all* queries when there are more than two
> version of rows.
>
> OK, here is the behavior of an update:
> (1) Find the latest version of the row

You mean, find the version of the row which satisfies your snapshot. If
the version pointed to by the index is it, you're done. Otherwise you
follow the chain. The most common option being one step, because ver01
is likely to be invisible.

> (2) Duplicate row and modify as per plan
> (3) Set the t_ctid of the new row to the last "latest"
> (4) Set the t_ctid of the first row to that of the new row
> (5) Attempt to index the row
> (6) If the first version of the row is in the index already (ver001) Don't
> modify the index, otherwise, add the new version (just as before)

This looks OK, I guess. I wouldn't know about locking...

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 14:16:10
Message-ID:	36e682920606240716p249030a1l3788437b0af18b4d@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
> In the scenario, as previously outlined:
>
> ver001->verN->...->ver003->ver2->|
> ^-----------------------------/

So you want to always keep an old version around?

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 14:21:02
Message-ID:	18543.24.91.171.78.1151158862.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
>> Currently it looks like this:
>>
>> ver001->ver002->ver003->...-verN
>>
>> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
>> have it do this:
>>
>> ver001->verN->...->ver003->ver002->|
>
> Heh, because that's crazy. The first time you insert a key into the
> index it will point to v1 of a tuple... say after 5 updates you have
> v2,v3,v4,v5... your c_tid pointer chain looks like v1
> (original)->v2->v3->v4-v5 (newest). However, your whole idea is based
> on not having to do another index insert for unchanged keys, so the
> index still points to v1... which means you have to follow the c_tid
> chain to get to the newest version just like a sequential scan. I
> don't see how you think you can reverse pointer it.

In the scenario, as previously outlined:

ver001->verN->...->ver003->ver2->|
^-----------------------------/

The index points to version 1 (ver001) which points to the latest version
(verN).

>
>> This will speed up almost *all* queries when there are more than two
>> version of rows.
>
> Nope.

Of course it will.

>
>> When you vacuum, simply make the latest version (verN) the key row
>> (ver001).
>
> How are you going to do this without a ton of locking... remember, the
> index is pointing to v1 with a tid... so you'll have to physically
> move the newest version v5 to v1's tid from wherever it was... like a
> vacuum full on steroids. Unless of course, you rebuild the index...
> but that's not a solution either.

I don't understand how you can assume this. In fact, it wil proably reduce
locking and disk IO by not having to modify indexes.
\

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 14:53:43
Message-ID:	18682.24.91.171.78.1151160823.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
>> In the scenario, as previously outlined:
>>
>> ver001->verN->...->ver003->ver2->|
>> ^-----------------------------/
>
> So you want to always keep an old version around?

Prior to vacuum, it will be there anyway, and after vacuum, the new
version will become ver001.

From:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
To:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 15:37:02
Message-ID:	36e682920606240837g4c508ef3s62dcbadbfff5d5fe@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
> > On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
> >> In the scenario, as previously outlined:
> >>
> >> ver001->verN->...->ver003->ver2->|
> >> ^-----------------------------/
> >
> > So you want to always keep an old version around?
>
> Prior to vacuum, it will be there anyway, and after vacuum, the new
> version will become ver001.

So you do intend to move verN into ver001's slot? What about the
other conditions you had mentioned where you have to follow
PostgreSQL's current behavior? How are you going to have a pointer
chain in that case?

From:	PFC <lists(at)peufeu(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 16:59:24
Message-ID:	op.tbnthavhcigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What I see in this discussion is a huge amount of "the grass must be
> greener on the other side" syndrome, and hardly any recognition that
> every technique has its downsides and complications.

Sure ;)

MVCC generates dead rows, by its very nature ; however I see two trends
in this :

1* A large transaction that updates/deletes many rows.
For instance suppose you UPDATE an entire table whose size is larger than
memory.

Old row versions have to be kept somewhere until commit, be it in the
table itself or in some accessory undo-log.
So, there will be a lot of harddisk grinding anyway, be it MVCC or
Oracle-style, or whatever. MVCC will bloat the table and indexes, then
VACUUM will shrink them. Update-in-place systems will bloat an undo log.

It seems to me the current MVCC+VACUUM is the right tool for this job,
requiring about the same amount of IO that the others.
Vacuum scans sequentially, so it's the best way to process large volumes
of data.

2* Many short transactions update rows in a table
Like the sessions problem, for instance.

Current VACUUM sucks for this case, I guess that's known.

-------

So, we have two different problems, and one tool which is adapted to one
problem only. Should the tool (Vacuum) be fixed to handle both problems,
making it more complex and difficult to maintain, or should another tool
be created specifically for the second problem ?
Problem 2 is very different from problem 1. The only case when they meet
is when there is a continuous stream of small updates running concurrently
with a long transaction.
So, what is the ideal tool for case 2 ?

We'd want a vacuuming machine that can be run very often, or even better,
continuously.
The table can be large, larger than the disk cache, so scanning it is not
an option.
The updates are probably randomly distributed into the table. Therefore,
VACUUMing a long time after these transactions are commited and the pages
are no longer in cache would require a lot of random seeks, which is also
bad.
Besides, we want this vacuum to be continuous and transparent.

The best time to vacuum pages is, thus, when they are still in the
background writer's memory, or the disk cache, waiting to be flushed to
disk. There, they can be re-read, vacuumed and re-written with no seek
penalty, only additional WAL traffic. However the amount of WAL traffic in
bytes/s is less important that the frequency of WAL syncs. Emitting more
WAL data shouldn't be a problem if those sync writes are coalesced with
the sync writes of current reansactions.

So, I guess the best solution for case 2 is to have the background writer
perform on-the-fly VACUUM :

An UPDATE or DELETE transaction hands over dirty pages to be written to
the bgwriter. It also tells the bgwriter the ID of the current transaction
and flags specifying if they contain candidate dead rows.
The bgwriter should have a sufficiently large buffer in order to be able
to keep these pages in memory until all the transactions that can see the
dead rows in these pages are finished.
Then, the pages are vacuumed and written.

The key is the size of the buffer. It should be large enough to contain
enough pages so that it is actually possible to vacuum something out of
them before writing them. However if the buffer capacity is exceeded (for
instance, because there is a long running transaction), this is not a
problem : the pages are simply written to disk normally, they will contain
dead rows, which will need to be handled lated by the standard VACUUM.

I think this would maximize code reuse by using the current bgwriter's
architecture... did I miss something ?

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 18:24:31
Message-ID:	18804.24.91.171.78.1151173471.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
>> > On 6/24/06, Mark Woodward <pgsql(at)mohawksoft(dot)com> wrote:
>> >> In the scenario, as previously outlined:
>> >>
>> >> ver001->verN->...->ver003->ver2->|
>> >> ^-----------------------------/
>> >
>> > So you want to always keep an old version around?
>>
>> Prior to vacuum, it will be there anyway, and after vacuum, the new
>> version will become ver001.
>
> So you do intend to move verN into ver001's slot? What about the
> other conditions you had mentioned where you have to follow
> PostgreSQL's current behavior? How are you going to have a pointer
> chain in that case?

Who said anything about moving anything. When vacuum comes along, it
cleans out previous versions of rows. Very little will change.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 19:04:43
Message-ID:	1151175883.3884.30.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:
> Csaba Nagy <nagy(at)ecircle-ag(dot)com> writes:
> >> Surprisingly its mostly WAL traffic, the heap/index pages themselves are
> >> often not yet synced to disk by time of vacuum, so no additional traffic
> >> there. If you had made 5 updates per page and then vacuum it, then you
> >> make effectively 1 extra WAL write meaning 20% increase in WAL traffic.
>
> > Is this also holding about read traffic ? I thought vacuum will make a
> > full table scan... for big tables a full table scan is always badly
> > influencing the performance of the box. If the full table scan would be
> > avoided, then I wouldn't mind running vacuum in a loop...
>
> If you're doing heavy updates of a big table then it's likely to end up
> visiting most of the table anyway, no? There is talk of keeping a map
> of dirty pages, but I think it'd be a win for infrequently-updated
> tables, not ones that need constant vacuuming.
>
> I think a lot of our problems in this area could be solved with fairly
> straightforward tuning efforts on the existing autovacuum
> infrastructure. In particular, someone should be looking into
> recommendable default vacuum-cost-delay settings so that a background
> vacuum doesn't affect performance too much.

One thing that would help updates quite a lot in some scenarios is
keeping the pages only partially-filled, so that most updates could keep
the new version in the same page. I think that has also been discussed
as an option to vacuum and maybe as part of initial inserts. Maybe some
of it even ended up as a todo item.

> Another problem with the
> current autovac infrastructure is that it doesn't respond very well to
> the case where there are individual tables that need constant attention
> as well as many that don't. If you have N databases then you can visit
> a particular table at most once every N*autovacuum_naptime seconds, and
> *every* table in the entire cluster gets reconsidered at that same rate.
> I'm not sure if we need the ability to have multiple autovac daemons
> running at the same time,

My patch enabling effective continuous vacuum of fast-update tables,
while still being able to vacuum huge slowly changing ones is still not
applied. Without that patch there is no reason to vacuum the small and
fast changingg tables while vacuum on bigger tables is running, as it
won't clean out dead tuples anyway.

> but we definitely could use something with a
> more flexible table-visiting pattern. Perhaps it would be enough to
> look through the per-table stats for each database before selecting the
> database to autovacuum in each cycle, instead of going by "least
> recently autovacuumed".
>
> Bottom line: there's still lots of low-hanging fruit. Why are people
> feeling that we need to abandon or massively complicate our basic
> architecture to make progress?

Maybe we could start from reusing the index tuples which point to
invisible tuples ? The index is not MVCC anyway, so maybe it is easier
to do in-place replacement there ?

This probably has the same obstacles which have prevented us from
removing those in the first place (removing instead of marking as
invisible). Does it cause some locking issues ? Or does it go against
some other constraints of our index lookups ?

I think that just setting the invisible bit in an index leaf node causes
nearly as much disk io as removing the node.

If we could delete/reuse old index tuples, it would solve a sizable
chunk of index-growth problem, especially for cases where referenced key
value does not change.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	"Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 19:05:49
Message-ID:	1151175950.3884.31.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
> Jonah H. Harris wrote:
> > On 6/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> > > What I see in this discussion is a huge amount of "the grass must be
> > > greener on the other side" syndrome, and hardly any recognition that
> > > every technique has its downsides and complications.
> >
> > I'm being totally objective. I don't think we should abandon
> > PostgreSQL's overall design at all, because we do perform INSERTs and
> > DELETEs much better than most systems. However, I've looked at many
> > systems and how they implement UPDATE so that it is a scalable
> > operation. Sure, there are costs and benefits to each implementation,
> > but I think we have some pretty brilliant people in this community and
> > can come up with an elegant design for scalable UPDATEs.
>
> I think the UPDATE case is similar to the bitmap index scan or perhaps
> bitmap indexes on disk --- there are cases we know can not be handled
> well by our existing code, so we have added (or might add) these
> features to try to address those difficult cases.

Not really. Bitmap index scan and bitmap index are both new additions
working well with existing framework.

While the problem of slowdown on frequent updates is real, the suggested
fix is just plain wrong, as it is based on someones faulty assumption on
how index lookup works, and very much simplified view of how different
parts of the system work to implement MVCC.

The original fix he "suggests" was to that imagined behaviour and thus
ignored all the real problems of such change.

All the next suggestions were variations of the first ones, and failed
to address or even research any problems brought up.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 19:09:46
Message-ID:	449D8DFA.7010700@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/24/2006 9:23 AM, Mark Woodward wrote:

>> On Sat, 24 Jun 2006, Mark Woodward wrote:
>>
>>> I'm probably mistaken, but aren't there already forward references in
>>> tuples to later versions? If so, I'm only sugesting reversing the order
>>> and referencing the latest version.
>>
>> I thought I understood your idea, but now you lost me again. I thought
>> what you want is that the older heap tuple has a pointer to the
>> newer one. Which it already has, it's called t_ctid.
>
> Perfect!
>>
>> Can you try to explain more carefully how the whole thing would work?
>> What would an index tuple point to? What pointers would a heap tuple
>> have? What would an index scan do to find the row version it's interested
>> in? What exactly would an update do?
>
>
> Since we already allocate space for some notion of linked list, then all
> I'm suggesting is we reverse the order, sort of. Currently it looks like
> this:
>
> ver001->ver002->ver003->...-verN
>
> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
> have it do this:
>
> ver001->verN->...->ver003->ver002->|
> ^---------------------------------/
>
> This will speed up almost *all* queries when there are more than two
> version of rows.
>
> OK, here is the behavior of an update:
> (1) Find the latest version of the row
> (2) Duplicate row and modify as per plan
> (3) Set the t_ctid of the new row to the last "latest"
> (4) Set the t_ctid of the first row to that of the new row
> (5) Attempt to index the row
> (6) If the first version of the row is in the index already (ver001) Don't
> modify the index, otherwise, add the new version (just as before)
>
> When you vacuum, simply make the latest version (verN) the key row (ver001).

This isn't done "simply". Currently, vacuum collects a trivial array of
ctid's it is removing and every now and then does a bulk remove of the
index tuples pointing to them. Now lets consider a table with two
indexed columns with the following row versions resulting from an insert
and 3 updates to that same row:

v1: a,b
v2: a,c
v3: a,d
v4: b,d

In your new scheme, there would be two index tuples for column 1 (one
pointing to v1, one pointing to v4) and 3 index tuples for column 2 (one
for each different value pointing to v1, v2 and v3). Did I get that
right so far?

If vacuum now can remove v1, it has to update index 1 to point to v2 and
remove the pointer to v1 from index 2. If it can remove v1 and v2, it
has to update index 1 to point to v3 and remove v1 and v2 from index 2.
If it can remove v1, v2 and v3 it must delete the index 1 tuple pointing
to v1, delete the index 2 entries pointing to v1 and v2 and update the
index 2 entry for v3 to point to v4. Figuring out which index tuples to
remove and which ones to update can only be done by comparing each and
every indexed columns old and new values. To do so, vacuum will have to
fetch all the row versions, which can be scattered all over the place,
with all possible locking issues including but not limited to deadlocks.

Jan

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Mark Woodward <pgsql(at)mohawksoft(dot)com>, Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 19:44:41
Message-ID:	449D9629.5050608@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/22/2006 2:37 PM, Alvaro Herrera wrote:

> Adding back pgsql-hackers.
>
> Mark Woodward wrote:
>> > Mark Woodward wrote:
>> >
>> >> Hmm, OK, then the problem is more serious than I suspected.
>> >> This means that every index on a row has to be updated on every
>> >> transaction that modifies that row. Is that correct?
>> >
>> > Add an index entry, yes.
>> >
>> >> I am attaching some code that shows the problem with regard to
>> >> applications such as web server session management, when run, each
>> >> second
>> >> the system can handle fewer and fewer connections. Here is a brief
>> >> output:
>> >> [...]
>> >> There has to be a more linear way of handling this scenario.
>> >
>> > So vacuum the table often.
>>
>> That fixes the symptom, not the problem. The problem is performance
>> steadily degrades over time.
>
> No, you got it backwards. The performance degradation is the symptom.
> The problem is that there are too many dead tuples in the table. There
> is one way to solve that problem -- remove them, which is done by
> running vacuum.

Precisely.

> There are some problems with vacuum itself, that I agree with. For
> example it would be good if a long-running vacuum wouldn't affect a
> vacuum running in another table because of the long-running transaction
> effect it has. It would be good if vacuum could be run partially over a
> table. It would be good if there was a way to speed up vacuum by using
> a dead space map or something.

It would be good if vacuum wouldn't waste time on blocks that don't have
any possible work in them. Vacuum has two main purposes. A) remove dead
rows and B) freeze xids. Once a block has zero deleted rows and all xids
are frozen, there is nothing to do with this block and vacuum should
skip it until a transaction updates that block.

This requires 2 bits per block, which is 32K per 1G segment of a heap.
Clearing the bits is done when the block is marked dirty. This way
vacuum would not waste any time and IO on huge slow changing tables.
That part, sequentially scanning huge tables that didn't change much is
what keeps us from running vacuum every couple of seconds.

Jan

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 20:10:15
Message-ID:	1151179816.3884.43.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, L, 2006-06-24 kell 15:44, kirjutas Jan Wieck:

> >> That fixes the symptom, not the problem. The problem is performance
> >> steadily degrades over time.
> >
> > No, you got it backwards. The performance degradation is the symptom.
> > The problem is that there are too many dead tuples in the table. There
> > is one way to solve that problem -- remove them, which is done by
> > running vacuum.
>
> Precisely.
>
> > There are some problems with vacuum itself, that I agree with. For
> > example it would be good if a long-running vacuum wouldn't affect a
> > vacuum running in another table because of the long-running transaction
> > effect it has. It would be good if vacuum could be run partially over a
> > table. It would be good if there was a way to speed up vacuum by using
> > a dead space map or something.
>
> It would be good if vacuum wouldn't waste time on blocks that don't have
> any possible work in them. Vacuum has two main purposes. A) remove dead
> rows and B) freeze xids. Once a block has zero deleted rows and all xids
> are frozen, there is nothing to do with this block and vacuum should
> skip it until a transaction updates that block.
>
> This requires 2 bits per block, which is 32K per 1G segment of a heap.
> Clearing the bits is done when the block is marked dirty. This way
> vacuum would not waste any time and IO on huge slow changing tables.
> That part, sequentially scanning huge tables that didn't change much is
> what keeps us from running vacuum every couple of seconds.

Seems like a plan.

Still, there is another problem which is not solved by map approach
only, at least with current implementation of vacuum.

This is the fact that we need to do full scan over index(es) to clean up
pointers to removed tuples. And huge tables tend to have huge indexes.

As indexes have no MVCC info inside them, it may be possible to start
reusing index entries pointing to rows that are invisible to all running
transactions. Currently we just mark these index entries as dead, but
maybe there is a way to reuse them. This could solve the index bloat
problem for may cases.

Another possible solution for indexes with mostly dead pointers is doing
a reindex, but this will become possible only after we have implemented
a concurrent, non-blocking CREATE INDEX.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 20:52:23
Message-ID:	20060624205223.GA32408@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Sat, Jun 24, 2006 at 10:04:43PM +0300, Hannu Krosing wrote:
> Maybe we could start from reusing the index tuples which point to
> invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> to do in-place replacement there ?
>
> This probably has the same obstacles which have prevented us from
> removing those in the first place (removing instead of marking as
> invisible). Does it cause some locking issues ? Or does it go against
> some other constraints of our index lookups ?

The problem with updating an index is that you have to do it in a way
that concurrent scans (forwards and backwards) don't get confused
because the tuple they stopped on vanished.

AIUI, the current approach is two step. The first time round you mark
it deleted but don't actually delete it. Thus, any scan currently
stopped on that tuple won't have a problem. Sometime later you remove
the actual tuple, once you know there's no scan stopped on it (because
no scan will deliberatly stop on a deleted tuple).

I forget the actual locking steps that ensure this though.

> If we could delete/reuse old index tuples, it would solve a sizable
> chunk of index-growth problem, especially for cases where referenced key
> value does not change.

I think we recently changed the code to always scan an index a page at
a time so maybe scans no longer stop in the middle of a page anymore...
Or perhaps that was VACUUM only.

Have a noce day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Daniel Xavier de Sousa <danielucg(at)yahoo(dot)com(dot)br>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Buffer for inner and outer table
Date:	2006-06-24 21:41:11
Message-ID:	20060624214111.70577.qmail@web52402.mail.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi,
Please,

Somebody can tell me, where the postgres control the buffer for inner and outer table, when it execute Nest_loop_join? I would want how to change the size this buffer and see all statistics about this

There is another doubt, how can I see the pages access (on Ram and HD) when it execute Nest-loop-join?

Obrigado,
Daniel Xavier de Sousa

Martijn van Oosterhout <kleptog(at)svana(dot)org> escreveu: On Sat, Jun 24, 2006 at 10:04:43PM +0300, Hannu Krosing wrote:
> Maybe we could start from reusing the index tuples which point to
> invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> to do in-place replacement there ?
>
> This probably has the same obstacles which have prevented us from
> removing those in the first place (removing instead of marking as
> invisible). Does it cause some locking issues ? Or does it go against
> some other constraints of our index lookups ?

The problem with updating an index is that you have to do it in a way
that concurrent scans (forwards and backwards) don't get confused
because the tuple they stopped on vanished.

I forget the actual locking steps that ensure this though.

> If we could delete/reuse old index tuples, it would solve a sizable
> chunk of index-growth problem, especially for cases where referenced key
> value does not change.

I think we recently changed the code to always scan an index a page at
a time so maybe scans no longer stop in the middle of a page anymore...
Or perhaps that was VACUUM only.

Have a noce day,
--
Martijn van Oosterhout http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

---------------------------------
Yahoo! Copa 2006 - cobertura dos jogos em tempo real e tudo sobre a seleção brasileira!

From:	Christopher Browne <cbbrowne(at)acm(dot)org>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 21:57:20
Message-ID:	87ejxeb4jj.fsf@wolfe.cbbrowne.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martha Stewart called it a Good Thing when JanWieck(at)Yahoo(dot)com (Jan Wieck) wrote:
> On 6/22/2006 2:37 PM, Alvaro Herrera wrote:
>
>> Adding back pgsql-hackers.
>> Mark Woodward wrote:
>>> > Mark Woodward wrote:
>>> >
>>> >> Hmm, OK, then the problem is more serious than I suspected.
>>> >> This means that every index on a row has to be updated on every
>>> >> transaction that modifies that row. Is that correct?
>>> >
>>> > Add an index entry, yes.
>>> >
>>> >> I am attaching some code that shows the problem with regard to
>>> >> applications such as web server session management, when run, each
>>> >> second
>>> >> the system can handle fewer and fewer connections. Here is a brief
>>> >> output:
>>> >> [...]
>>> >> There has to be a more linear way of handling this scenario.
>>> >
>>> > So vacuum the table often.
>>> That fixes the symptom, not the problem. The problem is performance
>>> steadily degrades over time.
>> No, you got it backwards. The performance degradation is the
>> symptom.
>> The problem is that there are too many dead tuples in the table. There
>> is one way to solve that problem -- remove them, which is done by
>> running vacuum.
>
> Precisely.
>
>> There are some problems with vacuum itself, that I agree with. For
>> example it would be good if a long-running vacuum wouldn't affect a
>> vacuum running in another table because of the long-running transaction
>> effect it has. It would be good if vacuum could be run partially over a
>> table. It would be good if there was a way to speed up vacuum by using
>> a dead space map or something.
>
> It would be good if vacuum wouldn't waste time on blocks that don't
> have any possible work in them. Vacuum has two main purposes. A)
> remove dead rows and B) freeze xids. Once a block has zero deleted
> rows and all xids are frozen, there is nothing to do with this block
> and vacuum should skip it until a transaction updates that block.
>
> This requires 2 bits per block, which is 32K per 1G segment of a
> heap. Clearing the bits is done when the block is marked dirty. This
> way vacuum would not waste any time and IO on huge slow changing
> tables. That part, sequentially scanning huge tables that didn't
> change much is what keeps us from running vacuum every couple of
> seconds.

This is, in effect, the "VACUUM Space Map."

I see one unfortunate thing about that representation of it, namely
that it would in effect require that non-frozen pages be kept on the
VSM for potentially a long time.

Based on *present* VACUUM strategy, at least.

Would it not be the case, here, that any time a page could be
"frozen," it would have to be? In effect, we are always trying to run
VACUUM FREEZE?
--
output = ("cbbrowne" "@" "gmail.com")
http://cbbrowne.com/info/finances.html
Rules of the Evil Overlord #72. "If all the heroes are standing
together around a strange device and begin to taunt me, I will pull
out a conventional weapon instead of using my unstoppable superweapon
on them. <http://www.eviloverlord.com/>

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Daniel Xavier de Sousa <danielucg(at)yahoo(dot)com(dot)br>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Buffer for inner and outer table
Date:	2006-06-24 22:05:45
Message-ID:	20060624220545.GB5316@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Daniel Xavier de Sousa wrote:

> Somebody can tell me, where the postgres control the buffer for
> inner and outer table, when it execute Nest_loop_join? I would want
> how to change the size this buffer and see all statistics about
> this

There is no such buffer. Buffers used in scans are kept in
shared_buffers, just like for everything else.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Rick Gigger <rick(at)alpinenetworking(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 23:13:35
Message-ID:	20060624231334.GC5316@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Mark Woodward wrote:

> The update behavior of PostgreSQL is probably the *last* serious issue.
> Debate all you want, vacuum mitigates the problem to varying levels,
> fixing the problem will be a huge win. If the update behavior gets fixed,
> I can't think of a single issue with postgresql that would be a show
> stopper.

Nah, it's just *your* pet peeve. Everyone has theirs. Some people may
share yours, of course. I agree it's a problem, but from there to
saying "it's _the last_ issue" there's a lot of distance.

Your idea of reusing a tuple's self pointer (t_ctid) does not work BTW,
because the self pointer must point to self. The case where the pointer
does not point to exactly the same tuple, it must point to a newer
version. If you change that invariant, a lot of things break; see for
example heap_get_lastest_tid.

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-24 23:36:05
Message-ID:	200606242336.k5ONa5b04286@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> ?hel kenal p?eval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:
> > Csaba Nagy <nagy(at)ecircle-ag(dot)com> writes:
> > >> Surprisingly its mostly WAL traffic, the heap/index pages themselves are
> > >> often not yet synced to disk by time of vacuum, so no additional traffic
> > >> there. If you had made 5 updates per page and then vacuum it, then you
> > >> make effectively 1 extra WAL write meaning 20% increase in WAL traffic.
> >
> > > Is this also holding about read traffic ? I thought vacuum will make a
> > > full table scan... for big tables a full table scan is always badly
> > > influencing the performance of the box. If the full table scan would be
> > > avoided, then I wouldn't mind running vacuum in a loop...
> >
> > If you're doing heavy updates of a big table then it's likely to end up
> > visiting most of the table anyway, no? There is talk of keeping a map
> > of dirty pages, but I think it'd be a win for infrequently-updated
> > tables, not ones that need constant vacuuming.
> >
> > I think a lot of our problems in this area could be solved with fairly
> > straightforward tuning efforts on the existing autovacuum
> > infrastructure. In particular, someone should be looking into
> > recommendable default vacuum-cost-delay settings so that a background
> > vacuum doesn't affect performance too much.
>
> One thing that would help updates quite a lot in some scenarios is
> keeping the pages only partially-filled, so that most updates could keep
> the new version in the same page. I think that has also been discussed
> as an option to vacuum and maybe as part of initial inserts. Maybe some
> of it even ended up as a todo item.

We have a patch in the queue for index fillfactor which will be in 8.2.
I am also hoping the frequently updated rows will migrate out to the
empty pages.

> > Another problem with the
> > current autovac infrastructure is that it doesn't respond very well to
> > the case where there are individual tables that need constant attention
> > as well as many that don't. If you have N databases then you can visit
> > a particular table at most once every N*autovacuum_naptime seconds, and
> > *every* table in the entire cluster gets reconsidered at that same rate.
> > I'm not sure if we need the ability to have multiple autovac daemons
> > running at the same time,
>
> My patch enabling effective continuous vacuum of fast-update tables,
> while still being able to vacuum huge slowly changing ones is still not
> applied. Without that patch there is no reason to vacuum the small and
> fast changingg tables while vacuum on bigger tables is running, as it
> won't clean out dead tuples anyway.

I think it will be applied, but I am looking for someone else to eyeball
it since Tom has come concerns.

> > but we definitely could use something with a
> > more flexible table-visiting pattern. Perhaps it would be enough to
> > look through the per-table stats for each database before selecting the
> > database to autovacuum in each cycle, instead of going by "least
> > recently autovacuumed".
> >
> > Bottom line: there's still lots of low-hanging fruit. Why are people
> > feeling that we need to abandon or massively complicate our basic
> > architecture to make progress?
>
> Maybe we could start from reusing the index tuples which point to
> invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> to do in-place replacement there ?
>
> This probably has the same obstacles which have prevented us from
> removing those in the first place (removing instead of marking as
> invisible). Does it cause some locking issues ? Or does it go against
> some other constraints of our index lookups ?
>
> I think that just setting the invisible bit in an index leaf node causes
> nearly as much disk io as removing the node.
>
> If we could delete/reuse old index tuples, it would solve a sizable
> chunk of index-growth problem, especially for cases where referenced key
> value does not change.

I think heap _and_ index reuse is the only useful direction. Index or
heap reuse alone seems too marginal for the added complexity.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 06:49:32
Message-ID:	1151218173.3884.47.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, L, 2006-06-24 kell 19:36, kirjutas Bruce Momjian:
> Hannu Krosing wrote:
> > ?hel kenal p?eval, R, 2006-06-23 kell 13:08, kirjutas Tom Lane:

> > >
> > > Bottom line: there's still lots of low-hanging fruit. Why are people
> > > feeling that we need to abandon or massively complicate our basic
> > > architecture to make progress?
> >
> > Maybe we could start from reusing the index tuples which point to
> > invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> > to do in-place replacement there ?
> >
> > This probably has the same obstacles which have prevented us from
> > removing those in the first place (removing instead of marking as
> > invisible). Does it cause some locking issues ? Or does it go against
> > some other constraints of our index lookups ?
> >
> > I think that just setting the invisible bit in an index leaf node causes
> > nearly as much disk io as removing the node.
> >
> > If we could delete/reuse old index tuples, it would solve a sizable
> > chunk of index-growth problem, especially for cases where referenced key
> > value does not change.
>
> I think heap _and_ index reuse is the only useful direction. Index or
> heap reuse alone seems too marginal for the added complexity.

Sure, but index reuse seems a lot easier, as there is nothing additional
to remember or clean out when doing it.

When reusing a heap tuple you have to clean out all index entries
pointing to it.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 10:52:24
Message-ID:	18595.24.91.171.78.1151232744.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On 6/24/2006 9:23 AM, Mark Woodward wrote:
>
>>> On Sat, 24 Jun 2006, Mark Woodward wrote:
>>>
>>>> I'm probably mistaken, but aren't there already forward references in
>>>> tuples to later versions? If so, I'm only sugesting reversing the
>>>> order
>>>> and referencing the latest version.
>>>
>>> I thought I understood your idea, but now you lost me again. I thought
>>> what you want is that the older heap tuple has a pointer to the
>>> newer one. Which it already has, it's called t_ctid.
>>
>> Perfect!
>>>
>>> Can you try to explain more carefully how the whole thing would work?
>>> What would an index tuple point to? What pointers would a heap tuple
>>> have? What would an index scan do to find the row version it's
>>> interested
>>> in? What exactly would an update do?
>>
>>
>> Since we already allocate space for some notion of linked list, then all
>> I'm suggesting is we reverse the order, sort of. Currently it looks like
>> this:
>>
>> ver001->ver002->ver003->...-verN
>>
>> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
>> have it do this:
>>
>> ver001->verN->...->ver003->ver002->|
>> ^---------------------------------/
>>
>> This will speed up almost *all* queries when there are more than two
>> version of rows.
>>
>> OK, here is the behavior of an update:
>> (1) Find the latest version of the row
>> (2) Duplicate row and modify as per plan
>> (3) Set the t_ctid of the new row to the last "latest"
>> (4) Set the t_ctid of the first row to that of the new row
>> (5) Attempt to index the row
>> (6) If the first version of the row is in the index already (ver001)
>> Don't
>> modify the index, otherwise, add the new version (just as before)
>>
>> When you vacuum, simply make the latest version (verN) the key row
>> (ver001).
>
> This isn't done "simply". Currently, vacuum collects a trivial array of
> ctid's it is removing and every now and then does a bulk remove of the
> index tuples pointing to them. Now lets consider a table with two
> indexed columns with the following row versions resulting from an insert
> and 3 updates to that same row:
>
> v1: a,b
> v2: a,c
> v3: a,d
> v4: b,d
>
> In your new scheme, there would be two index tuples for column 1 (one
> pointing to v1, one pointing to v4) and 3 index tuples for column 2 (one
> for each different value pointing to v1, v2 and v3). Did I get that
> right so far?
>
> If vacuum now can remove v1, it has to update index 1 to point to v2 and
> remove the pointer to v1 from index 2. If it can remove v1 and v2, it
> has to update index 1 to point to v3 and remove v1 and v2 from index 2.
> If it can remove v1, v2 and v3 it must delete the index 1 tuple pointing
> to v1, delete the index 2 entries pointing to v1 and v2 and update the
> index 2 entry for v3 to point to v4. Figuring out which index tuples to
> remove and which ones to update can only be done by comparing each and
> every indexed columns old and new values. To do so, vacuum will have to
> fetch all the row versions, which can be scattered all over the place,
> with all possible locking issues including but not limited to deadlocks.

I'm not sure why vacuum can't run similarly to the way it does now.

From:	Daniel Xavier de Sousa <danielucg(at)yahoo(dot)com(dot)br>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Buffer for inner and outer table
Date:	2006-06-25 15:06:36
Message-ID:	20060625150636.90296.qmail@web52415.mail.yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hi Álvaro,

Thank you about your answer.
I thought that Postgres could management the space for outer and inner tables. Because, some articles has showed how this make the difference.

Alvaro please,
And about count of pages that postgres read on query inner join (on RAM and HD). Do you know some thing about?

Thanks
Daniel

Alvaro Herrera <alvherre(at)commandprompt(dot)com> escreveu: Daniel Xavier de Sousa wrote:

There is no such buffer. Buffers used in scans are kept in
shared_buffers, just like for everything else.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

---------------------------------
Yahoo! Copa 2006 - cobertura dos jogos em tempo real e tudo sobre a seleção brasileira!

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 16:27:10
Message-ID:	200606251627.k5PGRAU03585@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> > > Maybe we could start from reusing the index tuples which point to
> > > invisible tuples ? The index is not MVCC anyway, so maybe it is easier
> > > to do in-place replacement there ?
> > >
> > > This probably has the same obstacles which have prevented us from
> > > removing those in the first place (removing instead of marking as
> > > invisible). Does it cause some locking issues ? Or does it go against
> > > some other constraints of our index lookups ?
> > >
> > > I think that just setting the invisible bit in an index leaf node causes
> > > nearly as much disk io as removing the node.
> > >
> > > If we could delete/reuse old index tuples, it would solve a sizable
> > > chunk of index-growth problem, especially for cases where referenced key
> > > value does not change.
> >
> > I think heap _and_ index reuse is the only useful direction. Index or
> > heap reuse alone seems too marginal for the added complexity.
>
> Sure, but index reuse seems a lot easier, as there is nothing additional
> to remember or clean out when doing it.

Yes, seems so. TODO added:

* Reuse index tuples that point to heap tuples that are not visible to
anyone?

> When reusing a heap tuple you have to clean out all index entries
> pointing to it.

Well, not for UPDATE for no key changes on the same page, if we do that.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 17:02:05
Message-ID:	449EC18D.2000605@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/25/2006 6:52 AM, Mark Woodward wrote:
>> On 6/24/2006 9:23 AM, Mark Woodward wrote:
>>
>>>> On Sat, 24 Jun 2006, Mark Woodward wrote:
>>>>
>>>>> I'm probably mistaken, but aren't there already forward references in
>>>>> tuples to later versions? If so, I'm only sugesting reversing the
>>>>> order
>>>>> and referencing the latest version.
>>>>
>>>> I thought I understood your idea, but now you lost me again. I thought
>>>> what you want is that the older heap tuple has a pointer to the
>>>> newer one. Which it already has, it's called t_ctid.
>>>
>>> Perfect!
>>>>
>>>> Can you try to explain more carefully how the whole thing would work?
>>>> What would an index tuple point to? What pointers would a heap tuple
>>>> have? What would an index scan do to find the row version it's
>>>> interested
>>>> in? What exactly would an update do?
>>>
>>>
>>> Since we already allocate space for some notion of linked list, then all
>>> I'm suggesting is we reverse the order, sort of. Currently it looks like
>>> this:
>>>
>>> ver001->ver002->ver003->...-verN
>>>
>>> That's what t_ctid does now, right? Well, that's sort of stupid. Why not
>>> have it do this:
>>>
>>> ver001->verN->...->ver003->ver002->|
>>> ^---------------------------------/
>>>
>>> This will speed up almost *all* queries when there are more than two
>>> version of rows.
>>>
>>> OK, here is the behavior of an update:
>>> (1) Find the latest version of the row
>>> (2) Duplicate row and modify as per plan
>>> (3) Set the t_ctid of the new row to the last "latest"
>>> (4) Set the t_ctid of the first row to that of the new row
>>> (5) Attempt to index the row
>>> (6) If the first version of the row is in the index already (ver001)
>>> Don't
>>> modify the index, otherwise, add the new version (just as before)
>>>
>>> When you vacuum, simply make the latest version (verN) the key row
>>> (ver001).
>>
>> This isn't done "simply". Currently, vacuum collects a trivial array of
>> ctid's it is removing and every now and then does a bulk remove of the
>> index tuples pointing to them. Now lets consider a table with two
>> indexed columns with the following row versions resulting from an insert
>> and 3 updates to that same row:
>>
>> v1: a,b
>> v2: a,c
>> v3: a,d
>> v4: b,d
>>
>> In your new scheme, there would be two index tuples for column 1 (one
>> pointing to v1, one pointing to v4) and 3 index tuples for column 2 (one
>> for each different value pointing to v1, v2 and v3). Did I get that
>> right so far?
>>
>> If vacuum now can remove v1, it has to update index 1 to point to v2 and
>> remove the pointer to v1 from index 2. If it can remove v1 and v2, it
>> has to update index 1 to point to v3 and remove v1 and v2 from index 2.
>> If it can remove v1, v2 and v3 it must delete the index 1 tuple pointing
>> to v1, delete the index 2 entries pointing to v1 and v2 and update the
>> index 2 entry for v3 to point to v4. Figuring out which index tuples to
>> remove and which ones to update can only be done by comparing each and
>> every indexed columns old and new values. To do so, vacuum will have to
>> fetch all the row versions, which can be scattered all over the place,
>> with all possible locking issues including but not limited to deadlocks.
>
> I'm not sure why vacuum can't run similarly to the way it does now.

What's that supposed to mean?

Jan

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 17:09:35
Message-ID:	449EC34F.6000004@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/25/2006 12:27 PM, Bruce Momjian wrote:

> Hannu Krosing wrote:
>> > > Maybe we could start from reusing the index tuples which point to
>> > > invisible tuples ? The index is not MVCC anyway, so maybe it is easier
>> > > to do in-place replacement there ?
>> > >
>> > > This probably has the same obstacles which have prevented us from
>> > > removing those in the first place (removing instead of marking as
>> > > invisible). Does it cause some locking issues ? Or does it go against
>> > > some other constraints of our index lookups ?
>> > >
>> > > I think that just setting the invisible bit in an index leaf node causes
>> > > nearly as much disk io as removing the node.
>> > >
>> > > If we could delete/reuse old index tuples, it would solve a sizable
>> > > chunk of index-growth problem, especially for cases where referenced key
>> > > value does not change.
>> >
>> > I think heap _and_ index reuse is the only useful direction. Index or
>> > heap reuse alone seems too marginal for the added complexity.
>>
>> Sure, but index reuse seems a lot easier, as there is nothing additional
>> to remember or clean out when doing it.
>
> Yes, seems so. TODO added:
>
> * Reuse index tuples that point to heap tuples that are not visible to
> anyone?
>
>> When reusing a heap tuple you have to clean out all index entries
>> pointing to it.
>
> Well, not for UPDATE for no key changes on the same page, if we do that.
>

An update that results in all the same values of every indexed column of
a known deleted invisible tuple. This reused tuple can by definition not
be the one currently updated. So unless it is a table without a primary
key, this assumes that at least 3 versions of the same row exist within
the same block. How likely is that to happen?

Jan

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 17:30:45
Message-ID:	449EC845.1000608@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/24/2006 4:10 PM, Hannu Krosing wrote:

> Ühel kenal päeval, L, 2006-06-24 kell 15:44, kirjutas Jan Wieck:
>
>> >> That fixes the symptom, not the problem. The problem is performance
>> >> steadily degrades over time.
>> >
>> > No, you got it backwards. The performance degradation is the symptom.
>> > The problem is that there are too many dead tuples in the table. There
>> > is one way to solve that problem -- remove them, which is done by
>> > running vacuum.
>>
>> Precisely.
>>
>> > There are some problems with vacuum itself, that I agree with. For
>> > example it would be good if a long-running vacuum wouldn't affect a
>> > vacuum running in another table because of the long-running transaction
>> > effect it has. It would be good if vacuum could be run partially over a
>> > table. It would be good if there was a way to speed up vacuum by using
>> > a dead space map or something.
>>
>> It would be good if vacuum wouldn't waste time on blocks that don't have
>> any possible work in them. Vacuum has two main purposes. A) remove dead
>> rows and B) freeze xids. Once a block has zero deleted rows and all xids
>> are frozen, there is nothing to do with this block and vacuum should
>> skip it until a transaction updates that block.
>>
>> This requires 2 bits per block, which is 32K per 1G segment of a heap.
>> Clearing the bits is done when the block is marked dirty. This way
>> vacuum would not waste any time and IO on huge slow changing tables.
>> That part, sequentially scanning huge tables that didn't change much is
>> what keeps us from running vacuum every couple of seconds.
>
> Seems like a plan.
>
> Still, there is another problem which is not solved by map approach
> only, at least with current implementation of vacuum.
>
> This is the fact that we need to do full scan over index(es) to clean up
> pointers to removed tuples. And huge tables tend to have huge indexes.

Right, now that you say it I remember why this wasn't so easy as it
sounded at the beginning.

Obviously there is no other way to find an index tuple without a
sequential scan other than doing an index scan. So vacuum would have to
estimate based on the bitmaps if it could be beneficial (huge table,
little vacuumable pages) to actually remove/flag single index tuples
before removing the heap tuple. This can be done in advance to removing
the heap tuple because index tuples might not be there to begin with.

However, that is a very costly thing to do and not trivial to implement.

Jan

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 18:24:02
Message-ID:	200606251824.k5PIO2f21011@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> >> Sure, but index reuse seems a lot easier, as there is nothing additional
> >> to remember or clean out when doing it.
> >
> > Yes, seems so. TODO added:
> >
> > * Reuse index tuples that point to heap tuples that are not visible to
> > anyone?
> >
> >> When reusing a heap tuple you have to clean out all index entries
> >> pointing to it.
> >
> > Well, not for UPDATE for no key changes on the same page, if we do that.
> >
>
> An update that results in all the same values of every indexed column of
> a known deleted invisible tuple. This reused tuple can by definition not
> be the one currently updated. So unless it is a table without a primary
> key, this assumes that at least 3 versions of the same row exist within
> the same block. How likely is that to happen?

Good question. You take the current tuple, and make another one on the
same page. Later, an update can reuse the original tuple if it is no
longer visible to anyone (by changing the item id), so you only need two
tuples, not three. My hope is that a repeated update would eventually
move to a page that enough free space for two (or more) versions.

Does that help explain it?

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 19:21:27
Message-ID:	449EE237.2030009@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/25/2006 2:24 PM, Bruce Momjian wrote:

> Jan Wieck wrote:
>> >> Sure, but index reuse seems a lot easier, as there is nothing additional
>> >> to remember or clean out when doing it.
>> >
>> > Yes, seems so. TODO added:
>> >
>> > * Reuse index tuples that point to heap tuples that are not visible to
>> > anyone?
>> >
>> >> When reusing a heap tuple you have to clean out all index entries
>> >> pointing to it.
>> >
>> > Well, not for UPDATE for no key changes on the same page, if we do that.
>> >
>>
>> An update that results in all the same values of every indexed column of
>> a known deleted invisible tuple. This reused tuple can by definition not
>> be the one currently updated. So unless it is a table without a primary
>> key, this assumes that at least 3 versions of the same row exist within
>> the same block. How likely is that to happen?
>
> Good question. You take the current tuple, and make another one on the
> same page. Later, an update can reuse the original tuple if it is no
> longer visible to anyone (by changing the item id), so you only need two
> tuples, not three. My hope is that a repeated update would eventually
> move to a page that enough free space for two (or more) versions.
>
> Does that help explain it?
>

That's exactly what I meant. You need space for 3 or more tuple versions
within one page and the luck that one of them is invisible at the time
of the update. I don't know how likely or unlikely this is in reality,
but it doesn't sound very promising to me so far.

Another problem with this is that even if you find such row, it doesn't
spare you the index traversal. The dead row whos item id you're reusing
might have resulted from an insert that aborted or crashed before it
finished creating all index entries. Or some of its index entries might
already be flagged known dead, and you better reset those flags.

Jan

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 19:42:01
Message-ID:	1151264521.4096.14.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, P, 2006-06-25 kell 14:24, kirjutas Bruce Momjian:
> Jan Wieck wrote:
> > >> Sure, but index reuse seems a lot easier, as there is nothing additional
> > >> to remember or clean out when doing it.
> > >
> > > Yes, seems so. TODO added:
> > >
> > > * Reuse index tuples that point to heap tuples that are not visible to
> > > anyone?
> > >
> > >> When reusing a heap tuple you have to clean out all index entries
> > >> pointing to it.
> > >
> > > Well, not for UPDATE for no key changes on the same page, if we do that.
> > >
> >
> > An update that results in all the same values of every indexed column of
> > a known deleted invisible tuple. This reused tuple can by definition not
> > be the one currently updated. So unless it is a table without a primary
> > key, this assumes that at least 3 versions of the same row exist within
> > the same block. How likely is that to happen?
>
> Good question. You take the current tuple, and make another one on the
> same page. Later, an update can reuse the original tuple if it is no
> longer visible to anyone (by changing the item id), so you only need two
> tuples, not three. My hope is that a repeated update would eventually
> move to a page that enough free space for two (or more) versions.

I can confirm that this is exactly what happens when running an
update-heavy load with frequent vacuums. Eventually most rows get their
own db pages or share the same page with 2-3 rows. And there will be
lots of unused (filed up, or cleaned and not yet reused) pages.

The overall performance could be made a little better by tuning the
system to not put more than N new rows on the same page at initial
insert or when the row move to a new page during update. Currently
several new rows are initially put on the same page and then move around
during repeated updates until they slow(ish)ly claim their own page.

> Does that help explain it?
>
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 19:51:28
Message-ID:	1151265089.4096.20.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, P, 2006-06-25 kell 06:52, kirjutas Mark Woodward:

> I'm not sure why vacuum can't run similarly to the way it does now.

What do you mean ?

Currently vacuum runs a three-step process

1) runs a full scan over heap and collects all dead tuple ctids from
heap

2) run full scans over all indexes of the relation and removes any
pointers pointing to dead tuples.

3) runs another full scan over heap and removes the tuples in the list
collected at step 1.

There is no modifications done to live tuples (ok, they *may* get frozen
if they are above certain age, but this is not relevant to current
discussion).

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 21:18:10
Message-ID:	200606252118.k5PLIAj19750@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> >> An update that results in all the same values of every indexed column of
> >> a known deleted invisible tuple. This reused tuple can by definition not
> >> be the one currently updated. So unless it is a table without a primary
> >> key, this assumes that at least 3 versions of the same row exist within
> >> the same block. How likely is that to happen?
> >
> > Good question. You take the current tuple, and make another one on the
> > same page. Later, an update can reuse the original tuple if it is no
> > longer visible to anyone (by changing the item id), so you only need two
> > tuples, not three. My hope is that a repeated update would eventually
> > move to a page that enough free space for two (or more) versions.
> >
> > Does that help explain it?
> >
>
> That's exactly what I meant. You need space for 3 or more tuple versions
> within one page and the luck that one of them is invisible at the time
> of the update. I don't know how likely or unlikely this is in reality,
> but it doesn't sound very promising to me so far.

Why three? I explained using only two heap tuples:

[item1]...[tuple1]

becomes on UPDATE:
---------->
[item1]...[tuple1][tuple2]
----->

on another UPDATE, if tuple1 is no longer visible:

------------------>
[item1]...[tuple1][tuple2]
<------

> Another problem with this is that even if you find such row, it doesn't
> spare you the index traversal. The dead row whos item id you're reusing
> might have resulted from an insert that aborted or crashed before it
> finished creating all index entries. Or some of its index entries might
> already be flagged known dead, and you better reset those flags.

You can only reuse heap rows that were created and expired by committed
transactions. In fact, you can only UPDATE a row that was created by a
committed transaction. You cannot _reuse_ any row, but only a row that
is being UPDATEd. Also, it cannot be known dead because it are are in
the process of updating it.

I am thinking my idea was not fully understood. Hopefully this email
helps.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-25 21:19:34
Message-ID:	200606252119.k5PLJYN20108@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> ?hel kenal p?eval, P, 2006-06-25 kell 14:24, kirjutas Bruce Momjian:
> > Jan Wieck wrote:
> > > >> Sure, but index reuse seems a lot easier, as there is nothing additional
> > > >> to remember or clean out when doing it.
> > > >
> > > > Yes, seems so. TODO added:
> > > >
> > > > * Reuse index tuples that point to heap tuples that are not visible to
> > > > anyone?
> > > >
> > > >> When reusing a heap tuple you have to clean out all index entries
> > > >> pointing to it.
> > > >
> > > > Well, not for UPDATE for no key changes on the same page, if we do that.
> > > >
> > >
> > > An update that results in all the same values of every indexed column of
> > > a known deleted invisible tuple. This reused tuple can by definition not
> > > be the one currently updated. So unless it is a table without a primary
> > > key, this assumes that at least 3 versions of the same row exist within
> > > the same block. How likely is that to happen?
> >
> > Good question. You take the current tuple, and make another one on the
> > same page. Later, an update can reuse the original tuple if it is no
> > longer visible to anyone (by changing the item id), so you only need two
> > tuples, not three. My hope is that a repeated update would eventually
> > move to a page that enough free space for two (or more) versions.
>
> I can confirm that this is exactly what happens when running an
> update-heavy load with frequent vacuums. Eventually most rows get their
> own db pages or share the same page with 2-3 rows. And there will be
> lots of unused (filed up, or cleaned and not yet reused) pages.

Right, that was my guess because heavily updated rows start to move
around in the table, and because UPDATE tries to stay on the same page,
once it the row hits a mostly-empty page, it stays there.

> The overall performance could be made a little better by tuning the
> system to not put more than N new rows on the same page at initial
> insert or when the row move to a new page during update. Currently
> several new rows are initially put on the same page and then move around
> during repeated updates until they slow(ish)ly claim their own page.

We have a fillfactor patch that will be in 8.2.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 02:08:50
Message-ID:	449F41B2.2000906@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/25/2006 5:18 PM, Bruce Momjian wrote:

> Jan Wieck wrote:
>> >> An update that results in all the same values of every indexed column of
>> >> a known deleted invisible tuple. This reused tuple can by definition not
>> >> be the one currently updated. So unless it is a table without a primary
>> >> key, this assumes that at least 3 versions of the same row exist within
>> >> the same block. How likely is that to happen?
>> >
>> > Good question. You take the current tuple, and make another one on the
>> > same page. Later, an update can reuse the original tuple if it is no
>> > longer visible to anyone (by changing the item id), so you only need two
>> > tuples, not three. My hope is that a repeated update would eventually
>> > move to a page that enough free space for two (or more) versions.
>> >
>> > Does that help explain it?
>> >
>>
>> That's exactly what I meant. You need space for 3 or more tuple versions
>> within one page and the luck that one of them is invisible at the time
>> of the update. I don't know how likely or unlikely this is in reality,
>> but it doesn't sound very promising to me so far.
>
> Why three? I explained using only two heap tuples:

For some reason I counted in the new tuple ... sorry that. Yes, it can
work with two tuples.

>
> [item1]...[tuple1]
>
> becomes on UPDATE:
> ---------->
> [item1]...[tuple1][tuple2]
> ----->
>
> on another UPDATE, if tuple1 is no longer visible:
>
> ------------------>
> [item1]...[tuple1][tuple2]
> <------
>
>> Another problem with this is that even if you find such row, it doesn't
>> spare you the index traversal. The dead row whos item id you're reusing
>> might have resulted from an insert that aborted or crashed before it
>> finished creating all index entries. Or some of its index entries might
>> already be flagged known dead, and you better reset those flags.
>
> You can only reuse heap rows that were created and expired by committed
> transactions. In fact, you can only UPDATE a row that was created by a
> committed transaction. You cannot _reuse_ any row, but only a row that
> is being UPDATEd. Also, it cannot be known dead because it are are in
> the process of updating it.

Now you lost me. What do you mean "a row that is being UPDATEd"? The row
(version) being UPDATEd right now cannot be expired, or why would you
update that one? And if your transaction rolls back later, the row you
update right now must be the one surviving.

Any row that was created by a committed transaction does indeed have all
the index entries created. But if it is deleted and expired, that means
that the transaction that stamped xmax has committed and is outside of
every existing snapshot. You can only reuse a slot that is used by a
tuple that satisfies the vacuum snapshot. And a tuple that satisfies
that snapshot has potentially index entries flagged known dead.

> I am thinking my idea was not fully understood. Hopefully this email
> helps.

I must be missing something because I still don't see how it can work.

Jan

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 02:12:01
Message-ID:	200606260212.k5Q2C1S10045@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> > [item1]...[tuple1]
> >
> > becomes on UPDATE:
> > ---------->
> > [item1]...[tuple1][tuple2]
> > ----->
> >
> > on another UPDATE, if tuple1 is no longer visible:
> >
> > ------------------>
> > [item1]...[tuple1][tuple2]
> > <------
> >
> >> Another problem with this is that even if you find such row, it doesn't
> >> spare you the index traversal. The dead row whos item id you're reusing
> >> might have resulted from an insert that aborted or crashed before it
> >> finished creating all index entries. Or some of its index entries might
> >> already be flagged known dead, and you better reset those flags.
> >
> > You can only reuse heap rows that were created and expired by committed
> > transactions. In fact, you can only UPDATE a row that was created by a
> > committed transaction. You cannot _reuse_ any row, but only a row that
> > is being UPDATEd. Also, it cannot be known dead because it are are in
> > the process of updating it.
>
> Now you lost me. What do you mean "a row that is being UPDATEd"? The row
> (version) being UPDATEd right now cannot be expired, or why would you
> update that one? And if your transaction rolls back later, the row you
> update right now must be the one surviving.

It can only be a non-visible version of the row earlier in the UPDATE
chain, not the actual one being updated.

> Any row that was created by a committed transaction does indeed have all
> the index entries created. But if it is deleted and expired, that means
> that the transaction that stamped xmax has committed and is outside of
> every existing snapshot. You can only reuse a slot that is used by a
> tuple that satisfies the vacuum snapshot. And a tuple that satisfies
> that snapshot has potentially index entries flagged known dead.

When you are using the update chaining, you can't mark that index row as
dead because it actually points to more than one row on the page, some
are non-visible, some are visible.

> > I am thinking my idea was not fully understood. Hopefully this email
> > helps.
>
> I must be missing something because I still don't see how it can work.

I just posted pseudo-code. Hope that helps.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 05:17:29
Message-ID:	449F6DE9.4070101@Yahoo.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> When you are using the update chaining, you can't mark that index row as
> dead because it actually points to more than one row on the page, some
> are non-visible, some are visible.

Back up the truck ... you mean in the current code base we have heap
tuples that are visible in index scans because of heap tuple chaining
but without index tuples pointing directly at them?

Jan

From:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 05:49:28
Message-ID:	20060626054927.GA6341@alvh.no-ip.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> >When you are using the update chaining, you can't mark that index row as
> >dead because it actually points to more than one row on the page, some
> >are non-visible, some are visible.
>
> Back up the truck ... you mean in the current code base we have heap
> tuples that are visible in index scans because of heap tuple chaining
> but without index tuples pointing directly at them?

I don't know where this idea came from, but it's not true. All heap
tuples, dead or otherwise, have index entries. Unless the idea is to
extend update chaining to mean something different from the current
meaning.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 06:34:20
Message-ID:	Pine.OSF.4.61.0606260931380.106743@kosh.hut.fi
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, 26 Jun 2006, Jan Wieck wrote:

> On 6/25/2006 10:12 PM, Bruce Momjian wrote:
>> When you are using the update chaining, you can't mark that index row as
>> dead because it actually points to more than one row on the page, some
>> are non-visible, some are visible.
>
> Back up the truck ... you mean in the current code base we have heap tuples
> that are visible in index scans because of heap tuple chaining but without
> index tuples pointing directly at them?

In current code, no. Every heap tuple has corresponding index tuples.

In Bruce's proposal, yes. You would have heap tuples without index tuples
pointing directly at them. An index scan could only find them by following
t_ctid chains.

Correct me if I understood you incorrectly, Bruce.

- Heikki

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Jan Wieck <JanWieck(at)Yahoo(dot)com>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 10:38:00
Message-ID:	200606261038.k5QAc0726546@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Jan Wieck wrote:
> On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > When you are using the update chaining, you can't mark that index row as
> > dead because it actually points to more than one row on the page, some
> > are non-visible, some are visible.
>
> Back up the truck ... you mean in the current code base we have heap
> tuples that are visible in index scans because of heap tuple chaining
> but without index tuples pointing directly at them?

No, this would be new code added. The basic idea is with the new
same-page update chaining, a single index points to the head of a chain,
not to a single tuple, so you can't mark a tuple as pointing to dead
rows if any of the tuples in the chain are visible.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 10:39:45
Message-ID:	200606261039.k5QAdjZ27014@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Alvaro Herrera wrote:
> Jan Wieck wrote:
> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > >When you are using the update chaining, you can't mark that index row as
> > >dead because it actually points to more than one row on the page, some
> > >are non-visible, some are visible.
> >
> > Back up the truck ... you mean in the current code base we have heap
> > tuples that are visible in index scans because of heap tuple chaining
> > but without index tuples pointing directly at them?
>
> I don't know where this idea came from, but it's not true. All heap
> tuples, dead or otherwise, have index entries. Unless the idea is to
> extend update chaining to mean something different from the current
> meaning.

It does mean something different. Single-Index-Tuple Chain (CITC) is a
special type of update chaining where the updates are all on the same
row, and a single index entry points to the entire chain.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc:	Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 11:17:31
Message-ID:	200606261117.k5QBHVT02863@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas wrote:
> On Mon, 26 Jun 2006, Jan Wieck wrote:
>
> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> >> When you are using the update chaining, you can't mark that index row as
> >> dead because it actually points to more than one row on the page, some
> >> are non-visible, some are visible.
> >
> > Back up the truck ... you mean in the current code base we have heap tuples
> > that are visible in index scans because of heap tuple chaining but without
> > index tuples pointing directly at them?
>
> In current code, no. Every heap tuple has corresponding index tuples.
>
> In Bruce's proposal, yes. You would have heap tuples without index tuples
> pointing directly at them. An index scan could only find them by following
> t_ctid chains.
>
> Correct me if I understood you incorrectly, Bruce.

Correct! We use the same pointers used by normal UPDATEs, except we set
a bit on the old tuple indicating it is a single-index tuple, and we
don't create index entries for the new tuple. Index scan routines will
need to be taught about the new chains, but because only one tuple in
the chain is visible to a single backend, the callers should not need to
be modified.

(All tuples in the chain have page item ids. It is just that when they
are freed, the pointers are adjusted so the index points to the chain
head.)

One problem is that once you find the row you want to update, it is
difficult to see if it is part of a single-index chain because there are
only forward pointers, so I think we have to scan the entire page to
find the chains. To reduce that overhead, I am thinking we free the
non-visible tuples only when the page has no more free space. This
allows us to free not just our own non-visible tuples, but perhaps
others as well.

We have never been able to free non-visible tuples before because of
index cleanup overhead, but with single-index chains, we can, and reduce
the requirements of vacuum for many workloads.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Zeugswetter Andreas DCP SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 12:34:34
Message-ID:	E1539E0ED7043848906A8FF995BDA579011F003A@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > >When you are using the update chaining, you can't mark that index
row
> > >as dead because it actually points to more than one row on the
page,
> > >some are non-visible, some are visible.
> >
> > Back up the truck ... you mean in the current code base we have heap

> > tuples that are visible in index scans because of heap tuple
chaining
> > but without index tuples pointing directly at them?
>
> I don't know where this idea came from, but it's not true.
> All heap tuples, dead or otherwise, have index entries.

When using CITC you would be "reusing" the index tuples from the current
heap tuple, so you can only reuse free space or a dead member of a CITC
chain.
You cannot reuse a dead tuple not member of a CITC chain because that
has separate
(invalid) index tuples pointing at it.

Part of the trick was moving slots (==ctid) around, so I still do not
really see how
you can represent the CITC chain as part of the update chain.
Unless you intend to break dead parts of the update chain ? Maybe that
is ok ?

Andreas

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Zeugswetter Andreas DCP SD <ZeugswetterA(at)spardat(dot)at>
Cc:	Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 12:45:15
Message-ID:	200606261245.k5QCjFP04373@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Zeugswetter Andreas DCP SD wrote:
>
> > > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
> > > >When you are using the update chaining, you can't mark that index
> row
> > > >as dead because it actually points to more than one row on the
> page,
> > > >some are non-visible, some are visible.
> > >
> > > Back up the truck ... you mean in the current code base we have heap
>
> > > tuples that are visible in index scans because of heap tuple
> chaining
> > > but without index tuples pointing directly at them?
> >
> > I don't know where this idea came from, but it's not true.
> > All heap tuples, dead or otherwise, have index entries.
>
> When using CITC you would be "reusing" the index tuples from the current
> heap tuple, so you can only reuse free space or a dead member of a CITC
> chain.
> You cannot reuse a dead tuple not member of a CITC chain because that
> has separate
> (invalid) index tuples pointing at it.
>
> Part of the trick was moving slots (==ctid) around, so I still do not
> really see how
> you can represent the CITC chain as part of the update chain.
> Unless you intend to break dead parts of the update chain ? Maybe that
> is ok ?

Yes, you have to remove dead (non-visible) parts of the update chain.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 12:56:16
Message-ID:	20060626125616.GE24611@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> Correct! We use the same pointers used by normal UPDATEs, except we set
> a bit on the old tuple indicating it is a single-index tuple, and we
> don't create index entries for the new tuple. Index scan routines will
> need to be taught about the new chains, but because only one tuple in
> the chain is visible to a single backend, the callers should not need to
> be modified.

I suppose we would also change the index_getmulti() function to return
a set of ctids plus flags so the caller knows to follow the chains,
right? And for bitmap index scans you would only remember the page in
the case of such a tuple, since you can't be sure the exact ctid you've
got is the one you want.

Seems doable.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 13:07:43
Message-ID:	200606261307.k5QD7hF09297@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > Correct! We use the same pointers used by normal UPDATEs, except we set
> > a bit on the old tuple indicating it is a single-index tuple, and we
> > don't create index entries for the new tuple. Index scan routines will
> > need to be taught about the new chains, but because only one tuple in
> > the chain is visible to a single backend, the callers should not need to
> > be modified.
>
> I suppose we would also change the index_getmulti() function to return
> a set of ctids plus flags so the caller knows to follow the chains,
> right? And for bitmap index scans you would only remember the page in
> the case of such a tuple, since you can't be sure the exact ctid you've
> got is the one you want.
>
> Seems doable.

Yes, it just is an issue of where you want to add the complexity ---
scan entire page when no free space, or only an UPDATE.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 13:10:03
Message-ID:	18313.24.91.171.78.1151327403.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Ãhel kenal pÃ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
>> Jonah H. Harris wrote:
>> > On 6/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> > > What I see in this discussion is a huge amount of "the grass must be
>> > > greener on the other side" syndrome, and hardly any recognition that
>> > > every technique has its downsides and complications.
>> >
>> > I'm being totally objective. I don't think we should abandon
>> > PostgreSQL's overall design at all, because we do perform INSERTs and
>> > DELETEs much better than most systems. However, I've looked at many
>> > systems and how they implement UPDATE so that it is a scalable
>> > operation. Sure, there are costs and benefits to each implementation,
>> > but I think we have some pretty brilliant people in this community and
>> > can come up with an elegant design for scalable UPDATEs.
>>
>> I think the UPDATE case is similar to the bitmap index scan or perhaps
>> bitmap indexes on disk --- there are cases we know can not be handled
>> well by our existing code, so we have added (or might add) these
>> features to try to address those difficult cases.
>
> Not really. Bitmap index scan and bitmap index are both new additions
> working well with existing framework.
>
> While the problem of slowdown on frequent updates is real, the suggested
> fix is just plain wrong, as it is based on someones faulty assumption on
> how index lookup works, and very much simplified view of how different
> parts of the system work to implement MVCC.

Yes, the suggestion was based on MVCC concepts, not a particular
implementation.
>
> The original fix he "suggests" was to that imagined behaviour and thus
> ignored all the real problems of such change.

The original suggestion, was nothing more than a hypothetical for the
purpose of discussion.

The problem was the steady degradation of performance on frequent updates.
That was the point of discussion. I brought up "one possible way" to
start a "brain storm." The discussion then morphed into critisizing the
example and not addressing the problem.

Anyway, I think some decent discussion about the problem did happen, and
that is good.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	"Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <janwieck(at)yahoo(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, "postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 13:17:26
Message-ID:	18329.24.91.171.78.1151327846.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Heikki Linnakangas wrote:
>> On Mon, 26 Jun 2006, Jan Wieck wrote:
>>
>> > On 6/25/2006 10:12 PM, Bruce Momjian wrote:
>> >> When you are using the update chaining, you can't mark that index row
>> as
>> >> dead because it actually points to more than one row on the page,
>> some
>> >> are non-visible, some are visible.
>> >
>> > Back up the truck ... you mean in the current code base we have heap
>> tuples
>> > that are visible in index scans because of heap tuple chaining but
>> without
>> > index tuples pointing directly at them?
>>
>> In current code, no. Every heap tuple has corresponding index tuples.
>>
>> In Bruce's proposal, yes. You would have heap tuples without index
>> tuples
>> pointing directly at them. An index scan could only find them by
>> following
>> t_ctid chains.
>>
>> Correct me if I understood you incorrectly, Bruce.
>
> Correct! We use the same pointers used by normal UPDATEs, except we set
> a bit on the old tuple indicating it is a single-index tuple, and we
> don't create index entries for the new tuple. Index scan routines will
> need to be taught about the new chains, but because only one tuple in
> the chain is visible to a single backend, the callers should not need to
> be modified.
>
> (All tuples in the chain have page item ids. It is just that when they
> are freed, the pointers are adjusted so the index points to the chain
> head.)
>
> One problem is that once you find the row you want to update, it is
> difficult to see if it is part of a single-index chain because there are
> only forward pointers, so I think we have to scan the entire page to
> find the chains. To reduce that overhead, I am thinking we free the
> non-visible tuples only when the page has no more free space. This
> allows us to free not just our own non-visible tuples, but perhaps
> others as well.

This sort of incorporates the "vacuum row" I suggested.

>
> We have never been able to free non-visible tuples before because of
> index cleanup overhead, but with single-index chains, we can, and reduce
> the requirements of vacuum for many workloads.
>

This is great!

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 13:18:16
Message-ID:	200606261318.k5QDIGN10086@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Martijn van Oosterhout wrote:
> -- Start of PGP signed section.
> > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > Correct! We use the same pointers used by normal UPDATEs, except we set
> > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > don't create index entries for the new tuple. Index scan routines will
> > > need to be taught about the new chains, but because only one tuple in
> > > the chain is visible to a single backend, the callers should not need to
> > > be modified.
> >
> > I suppose we would also change the index_getmulti() function to return
> > a set of ctids plus flags so the caller knows to follow the chains,
> > right? And for bitmap index scans you would only remember the page in
> > the case of such a tuple, since you can't be sure the exact ctid you've
> > got is the one you want.
> >
> > Seems doable.
>
> Yes, it just is an issue of where you want to add the complexity ---
> scan entire page when no free space, or only an UPDATE.

Oh, and because you want to do this when doing an update via sequential
scan as well as an index scan, I am thinking you might need to do the
per-page method because you might not have even seen the head of the
chain yet. With an index scan, finding the head is easy, but for a
sequential scan, it seems more difficult, and we don't have any free
space in the tail of the chain to maintain a pointer to the head.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>
Cc:	Martijn van Oosterhout <kleptog(at)svana(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 13:37:11
Message-ID:	200606261337.k5QDbBH11403@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Bruce Momjian wrote:
> Bruce Momjian wrote:
> > Martijn van Oosterhout wrote:
> > -- Start of PGP signed section.
> > > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > > Correct! We use the same pointers used by normal UPDATEs, except we set
> > > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > > don't create index entries for the new tuple. Index scan routines will
> > > > need to be taught about the new chains, but because only one tuple in
> > > > the chain is visible to a single backend, the callers should not need to
> > > > be modified.
> > >
> > > I suppose we would also change the index_getmulti() function to return
> > > a set of ctids plus flags so the caller knows to follow the chains,
> > > right? And for bitmap index scans you would only remember the page in
> > > the case of such a tuple, since you can't be sure the exact ctid you've
> > > got is the one you want.
> > >
> > > Seems doable.
> >
> > Yes, it just is an issue of where you want to add the complexity ---
> > scan entire page when no free space, or only an UPDATE.
>
> Oh, and because you want to do this when doing an update via sequential
> scan as well as an index scan, I am thinking you might need to do the
> per-page method because you might not have even seen the head of the
> chain yet. With an index scan, finding the head is easy, but for a
> sequential scan, it seems more difficult, and we don't have any free
> space in the tail of the chain to maintain a pointer to the head.

Thinking some more, there will need to be a bit to uniquely identify the
head of a CITC. With that, you could just scan the page tuples looking
for CITC heads, and checking those to see if they are not visible, and
re-using them, rather than doing a full page reorganization where all
free spaces is collected in the middle of the page. That should limit
the overhead of reuse.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 14:15:54
Message-ID:	1151331354.3885.18.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
Oosterhout:
> On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > Correct! We use the same pointers used by normal UPDATEs, except we set
> > a bit on the old tuple indicating it is a single-index tuple, and we
> > don't create index entries for the new tuple. Index scan routines will
> > need to be taught about the new chains, but because only one tuple in
> > the chain is visible to a single backend, the callers should not need to
> > be modified.
>
> I suppose we would also change the index_getmulti() function to return
> a set of ctids plus flags so the caller knows to follow the chains,
> right?

It is probably better to always return the pointer to the head of CITC
chain (the one an index points to) and do extra visibility checks and
chain-following on each access. This would keep the change internal to
tuple fetching functions.

> And for bitmap index scans you would only remember the page in
> the case of such a tuple, since you can't be sure the exact ctid you've
> got is the one you want.

no, you should only use the pointer to CITC head outside tuple access
funtions. And this pointer to CITC head is what is always passed to
those access functions/macros.

The VACUUM would run its passes thus:

pass 1: run over heap, collect pointers to single dead tuples, and fully
dead CITC chains (fully dead = no live tuples on this page). Clean up
old tuples from CITC chains and move live tuples around so that CITC
points to oldest possibly visible (not vacuumed) tuple. Doing this there
frees us from need to collect a separate set of pointers for those. Or
have you planned that old tuples from CITC chains are collected on the
go/as needed ? Of course we could do both.

pass 2: clean indexes based on ctid from pass 1

pass 3: clean heap based on ctid from pass 1

If yo do it this way, you dont need to invent new data structures to
pass extra info about CITC internals to passes 2 and 3

On more thing - when should free space map be notified about free space
in pages with CITC chains ?

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 14:47:53
Message-ID:	1151333274.3885.24.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, E, 2006-06-26 kell 09:10, kirjutas Mark Woodward:
> > Ãœhel kenal pÃ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce Momjian:
> >> Jonah H. Harris wrote:
> >> > On 6/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> >> > > What I see in this discussion is a huge amount of "the grass must be
> >> > > greener on the other side" syndrome, and hardly any recognition that
> >> > > every technique has its downsides and complications.
> >> >
> >> > I'm being totally objective. I don't think we should abandon
> >> > PostgreSQL's overall design at all, because we do perform INSERTs and
> >> > DELETEs much better than most systems. However, I've looked at many
> >> > systems and how they implement UPDATE so that it is a scalable
> >> > operation. Sure, there are costs and benefits to each implementation,
> >> > but I think we have some pretty brilliant people in this community and
> >> > can come up with an elegant design for scalable UPDATEs.
> >>
> >> I think the UPDATE case is similar to the bitmap index scan or perhaps
> >> bitmap indexes on disk --- there are cases we know can not be handled
> >> well by our existing code, so we have added (or might add) these
> >> features to try to address those difficult cases.
> >
> > Not really. Bitmap index scan and bitmap index are both new additions
> > working well with existing framework.
> >
> > While the problem of slowdown on frequent updates is real, the suggested
> > fix is just plain wrong, as it is based on someones faulty assumption on
> > how index lookup works, and very much simplified view of how different
> > parts of the system work to implement MVCC.
>
> Yes, the suggestion was based on MVCC concepts, not a particular
> implementation.

On the contrary - afaik, it was loosely based on how Oracle does it with
its rollback segments, only assuming that rollback segments are kept in
heap and that indexes point only to the oldest row version :p

> > The original fix he "suggests" was to that imagined behaviour and thus
> > ignored all the real problems of such change.
>
> The original suggestion, was nothing more than a hypothetical for the
> purpose of discussion.
>
> The problem was the steady degradation of performance on frequent updates.
> That was the point of discussion. I brought up "one possible way" to
> start a "brain storm." The discussion then morphed into critisizing the
> example and not addressing the problem.

The problem is heatedly discussed every 3-4 months.

> Anyway, I think some decent discussion about the problem did happen, and
> that is good.

Agreed.

Maybe this _was_ the best way to bring up the discussion again.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 14:50:26
Message-ID:	200606261450.k5QEoQh03466@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> ?hel kenal p?eval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
> Oosterhout:
> > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > Correct! We use the same pointers used by normal UPDATEs, except we set
> > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > don't create index entries for the new tuple. Index scan routines will
> > > need to be taught about the new chains, but because only one tuple in
> > > the chain is visible to a single backend, the callers should not need to
> > > be modified.
> >
> > I suppose we would also change the index_getmulti() function to return
> > a set of ctids plus flags so the caller knows to follow the chains,
> > right?
>
> It is probably better to always return the pointer to the head of CITC
> chain (the one an index points to) and do extra visibility checks and
> chain-following on each access. This would keep the change internal to
> tuple fetching functions.

So index_getnext() traverses the chain and returns one member per call.
Makes sense. Just realize you are in a single index entry returning
multiple tuples. We will need some record keeping to track that.

> > And for bitmap index scans you would only remember the page in
> > the case of such a tuple, since you can't be sure the exact ctid you've
> > got is the one you want.
>
> no, you should only use the pointer to CITC head outside tuple access
> funtions. And this pointer to CITC head is what is always passed to
> those access functions/macros.
>
> The VACUUM would run its passes thus:
>
> pass 1: run over heap, collect pointers to single dead tuples, and fully
> dead CITC chains (fully dead = no live tuples on this page). Clean up
> old tuples from CITC chains and move live tuples around so that CITC
> points to oldest possibly visible (not vacuumed) tuple. Doing this there
> frees us from need to collect a separate set of pointers for those. Or
> have you planned that old tuples from CITC chains are collected on the
> go/as needed ? Of course we could do both.

Non-visible CITC members should be freed during an UPDATE on the same
page, so vacuum doesn't have to be involved.

> pass 2: clean indexes based on ctid from pass 1
>
> pass 3: clean heap based on ctid from pass 1
>
> If yo do it this way, you dont need to invent new data structures to
> pass extra info about CITC internals to passes 2 and 3
>
> On more thing - when should free space map be notified about free space
> in pages with CITC chains ?

Uh, well, I am thinking we only free CITC space when we are going to use
it for an UPDATE, rather than free things while doing an operation. It
is good to keep the cleanup overhead out of the main path as much as
possible.

Also, seems I can't spell algorithms very well:

Definition: Single-Index-Tuple Chain (SITC)
-
Thinking of vacuum, right now it does these cleanups:

o non-visible UPDATEs on the same page with no key changes
o non-visible UPDATEs on the same page with key changes
o non-visible UPDATEs on different pages
o DELETEs
o aborted transactions

The big question is what percentage of dead space is the first one? My
guess is 65%.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 14:58:44
Message-ID:	20060626145844.GI24611@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
> > > I suppose we would also change the index_getmulti() function to return
> > > a set of ctids plus flags so the caller knows to follow the chains,
> > > right?
> >
> > It is probably better to always return the pointer to the head of CITC
> > chain (the one an index points to) and do extra visibility checks and
> > chain-following on each access. This would keep the change internal to
> > tuple fetching functions.
>
> So index_getnext() traverses the chain and returns one member per call.
> Makes sense. Just realize you are in a single index entry returning
> multiple tuples. We will need some record keeping to track that.

Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
have to change all the users of that (which aren't many, I suppose).
It's probably worth making a utility function to expand them.

I'm still confused where bitmap index scan fit into all of this. Is
preserving the sequential scan aspect of these a goal with this new
setup?

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 14:58:53
Message-ID:	1151333933.3885.31.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, E, 2006-06-26 kell 10:50, kirjutas Bruce Momjian:
> Hannu Krosing wrote:
> > ?hel kenal p?eval, E, 2006-06-26 kell 14:56, kirjutas Martijn van
> > Oosterhout:
> > > On Mon, Jun 26, 2006 at 07:17:31AM -0400, Bruce Momjian wrote:
> > > > Correct! We use the same pointers used by normal UPDATEs, except we set
> > > > a bit on the old tuple indicating it is a single-index tuple, and we
> > > > don't create index entries for the new tuple. Index scan routines will
> > > > need to be taught about the new chains, but because only one tuple in
> > > > the chain is visible to a single backend, the callers should not need to
> > > > be modified.
> > >
> > > I suppose we would also change the index_getmulti() function to return
> > > a set of ctids plus flags so the caller knows to follow the chains,
> > > right?
> >
> > It is probably better to always return the pointer to the head of CITC
> > chain (the one an index points to) and do extra visibility checks and
> > chain-following on each access. This would keep the change internal to
> > tuple fetching functions.
>
> So index_getnext() traverses the chain and returns one member per call.
> Makes sense. Just realize you are in a single index entry returning
> multiple tuples. We will need some record keeping to track that.

Maybe we need to push visibility checks further down, so that
index_getnext() returns only the one heap row that is visible.

> > > And for bitmap index scans you would only remember the page in
> > > the case of such a tuple, since you can't be sure the exact ctid you've
> > > got is the one you want.
> >
> > no, you should only use the pointer to CITC head outside tuple access
> > funtions. And this pointer to CITC head is what is always passed to
> > those access functions/macros.
> >
> > The VACUUM would run its passes thus:
> >
> > pass 1: run over heap, collect pointers to single dead tuples, and fully
> > dead CITC chains (fully dead = no live tuples on this page). Clean up
> > old tuples from CITC chains and move live tuples around so that CITC
> > points to oldest possibly visible (not vacuumed) tuple. Doing this there
> > frees us from need to collect a separate set of pointers for those. Or
> > have you planned that old tuples from CITC chains are collected on the
> > go/as needed ? Of course we could do both.
>
> Non-visible CITC members should be freed during an UPDATE on the same
> page, so vacuum doesn't have to be involved.

Ok.

> > pass 2: clean indexes based on ctid from pass 1
> >
> > pass 3: clean heap based on ctid from pass 1
> >
> > If yo do it this way, you dont need to invent new data structures to
> > pass extra info about CITC internals to passes 2 and 3
> >
> > On more thing - when should free space map be notified about free space
> > in pages with CITC chains ?
>
> Uh, well, I am thinking we only free CITC space when we are going to use
> it for an UPDATE, rather than free things while doing an operation. It
> is good to keep the cleanup overhead out of the main path as much as
> possible.

So vacuum should only remove dead CITC chains and leave the ones with
live tuples to CITC internal use ?

That would also suggest that pages having live CITC chains and less than
N% of free space should mot be reported to FSM.

> Also, seems I can't spell algorithms very well:
>
> Definition: Single-Index-Tuple Chain (SITC)
> -
> Thinking of vacuum, right now it does these cleanups:
>
> o non-visible UPDATEs on the same page with no key changes
> o non-visible UPDATEs on the same page with key changes
> o non-visible UPDATEs on different pages
> o DELETEs
> o aborted transactions
>
> The big question is what percentage of dead space is the first one? My
> guess is 65%.

Can be from 0% to 99.9%, very much dependent on application.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

NOTICE: This communication contains privileged or other confidential
information. If you have received it in error, please advise the sender
by reply email and immediately delete the message and any attachments
without copying or disclosing the contents.

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Bruce Momjian <bruce(at)momjian(dot)us>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 15:15:58
Message-ID:	1151334958.3885.33.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, E, 2006-06-26 kell 16:58, kirjutas Martijn van
Oosterhout:
> On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
> > > > I suppose we would also change the index_getmulti() function to return
> > > > a set of ctids plus flags so the caller knows to follow the chains,
> > > > right?
> > >
> > > It is probably better to always return the pointer to the head of CITC
> > > chain (the one an index points to) and do extra visibility checks and
> > > chain-following on each access. This would keep the change internal to
> > > tuple fetching functions.
> >
> > So index_getnext() traverses the chain and returns one member per call.
> > Makes sense. Just realize you are in a single index entry returning
> > multiple tuples. We will need some record keeping to track that.
>
> Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
> have to change all the users of that (which aren't many, I suppose).
> It's probably worth making a utility function to expand them.
>
> I'm still confused where bitmap index scan fit into all of this. Is
> preserving the sequential scan aspect of these a goal with this new
> setup?

Bitmap index scan does not have to change much - only the function that
gets tuple by its ctid must be able to trace forward chains within the
page.

--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 15:29:27
Message-ID:	200606261529.k5QFTRY23824@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Mon, Jun 26, 2006 at 10:50:26AM -0400, Bruce Momjian wrote:
> > > > I suppose we would also change the index_getmulti() function to return
> > > > a set of ctids plus flags so the caller knows to follow the chains,
> > > > right?
> > >
> > > It is probably better to always return the pointer to the head of CITC
> > > chain (the one an index points to) and do extra visibility checks and
> > > chain-following on each access. This would keep the change internal to
> > > tuple fetching functions.
> >
> > So index_getnext() traverses the chain and returns one member per call.
> > Makes sense. Just realize you are in a single index entry returning
> > multiple tuples. We will need some record keeping to track that.
>
> Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
> have to change all the users of that (which aren't many, I suppose).
> It's probably worth making a utility function to expand them.
>
> I'm still confused where bitmap index scan fit into all of this. Is
> preserving the sequential scan aspect of these a goal with this new
> setup?

No. I was just pointing out that if you get to the tuple via an index,
you get handed the head of the SITC via the index tuple, but if you are
doing a sequential scan, you don't get it, so you have to find it, or
any other non-visible SITC header.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Hannu Krosing <hannu(at)skype(dot)net>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 15:31:15
Message-ID:	200606261531.k5QFVFp24010@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> > > pass 3: clean heap based on ctid from pass 1
> > >
> > > If yo do it this way, you dont need to invent new data structures to
> > > pass extra info about CITC internals to passes 2 and 3
> > >
> > > On more thing - when should free space map be notified about free space
> > > in pages with CITC chains ?
> >
> > Uh, well, I am thinking we only free CITC space when we are going to use
> > it for an UPDATE, rather than free things while doing an operation. It
> > is good to keep the cleanup overhead out of the main path as much as
> > possible.
>
> So vacuum should only remove dead CITC chains and leave the ones with
> live tuples to CITC internal use ?

Yes, it has to. What else would it do? Add index entries?

> That would also suggest that pages having live CITC chains and less than
> N% of free space should mot be reported to FSM.

Parts of the CITC that are not visible can be used for free space by
vacuum, but the visible part is left alone.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	PFC <lists(at)peufeu(dot)com>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC, and compression
Date:	2006-06-26 15:43:26
Message-ID:	op.tbrfaonicigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

There were some talks lately about compression.
With a bit of lateral thinking I guess this can be used to contain the
bloat induced by updates.
Of course this is just my hypothesis.

Compression in indexes :

Instead of storing (value, tuple identifier) keys in the indexes, store
(value, [tuple identifier list]) ; ie. all tuples which have the same
indexed value are referenced by the same index tuple, instead of having
one index tuple per actual tuple.
The length of the list would of course be limited to the space actually
available on an index page ; if many rows have the same indexed value,
several index tuples would be generated so that index tuples fit on index
pages.
This would make the index smaller (more likely to fit in RAM) at the cost
of a little CPU overhead for index modifications, but would make the index
scans actually use less CPU (no need to compare the indexed value on each
table tuple).

Compression in data pages :

The article that circulated on the list suggested several types of
compression, offset, dictionary, etc. The point is that several row
versions on the same page can be compressed well because these versions
probably have similar column values.

Just a thought...

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC, and compression
Date:	2006-06-26 15:45:05
Message-ID:	200606261545.k5QFj5d15409@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

PFC wrote:
>
> There were some talks lately about compression.
> With a bit of lateral thinking I guess this can be used to contain the
> bloat induced by updates.
> Of course this is just my hypothesis.
>
> Compression in indexes :
>
> Instead of storing (value, tuple identifier) keys in the indexes, store
> (value, [tuple identifier list]) ; ie. all tuples which have the same
> indexed value are referenced by the same index tuple, instead of having
> one index tuple per actual tuple.
> The length of the list would of course be limited to the space actually
> available on an index page ; if many rows have the same indexed value,
> several index tuples would be generated so that index tuples fit on index
> pages.
> This would make the index smaller (more likely to fit in RAM) at the cost
> of a little CPU overhead for index modifications, but would make the index
> scans actually use less CPU (no need to compare the indexed value on each
> table tuple).

What about increasing the size of an existing index entry? Can that be
done easily when a new row is added?

> Compression in data pages :
>
> The article that circulated on the list suggested several types of
> compression, offset, dictionary, etc. The point is that several row
> versions on the same page can be compressed well because these versions
> probably have similar column values.
>
> Just a thought...

I would be worried about the overhead of doing that on compression and
decompression.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	PFC <lists(at)peufeu(dot)com>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>
Cc:	pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC, and compression
Date:	2006-06-26 15:55:08
Message-ID:	op.tbrft6k7cigqcu@apollo13
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> What about increasing the size of an existing index entry? Can that be
> done easily when a new row is added?

I'd say it looks pretty much like inserting a new index tuple...
Say "value" is the indexed column.

Find first page in the index featuring "value".
1 If there is space on the page,
add the tuple id to the list of the corresponding index entry (just like
creating a new index tuple, but uses less space).
else
look at next page.
If next page has an index tuple with the same indexed value,
goto 1
else
insert new page and create an index tuple on it

> I would be worried about the overhead of doing that on compression and
> decompression.

The compression methods mentioned in the article which was passed on the
list seemed pretty fast. From IO-limited, the test database became
CPU-limited (and a lot faster).

From:	"Zeugswetter Andreas DCP SD" <ZeugswetterA(at)spardat(dot)at>
To:	"Bruce Momjian" <bruce(at)momjian(dot)us>, "PostgreSQL-development" <pgsql-hackers(at)postgreSQL(dot)org>
Cc:	"Martijn van Oosterhout" <kleptog(at)svana(dot)org>, "Heikki Linnakangas" <hlinnaka(at)iki(dot)fi>, "Jan Wieck" <JanWieck(at)Yahoo(dot)com>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Csaba Nagy" <nagy(at)ecircle-ag(dot)com>, "Mark Woodward" <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, "Christopher Browne" <cbbrowne(at)acm(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 16:06:00
Message-ID:	E1539E0ED7043848906A8FF995BDA579011F007F@m0143.s-mxs.net
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> > head of the chain yet. With an index scan, finding the head is
easy,
> > but for a sequential scan, it seems more difficult, and we don't
have
> > any free space in the tail of the chain to maintain a pointer to the
head.
>
> Thinking some more, there will need to be a bit to uniquely
> identify the head of a CITC.

I don't think so. It would probably be sufficient to impose an order on
the CITC.
e.g. the oldest tuple version in the CITC is the head.
(An idea just in case we can't spare a bit :-)

Andreas

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Zeugswetter Andreas DCP SD <ZeugswetterA(at)spardat(dot)at>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgreSQL(dot)org>, Martijn van Oosterhout <kleptog(at)svana(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jan Wieck <JanWieck(at)Yahoo(dot)com>, Hannu Krosing <hannu(at)skype(dot)net>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Csaba Nagy <nagy(at)ecircle-ag(dot)com>, Mark Woodward <pgsql(at)mohawksoft(dot)com>, "Jonah H(dot) Harris" <jonah(dot)harris(at)gmail(dot)com>, Christopher Browne <cbbrowne(at)acm(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 16:21:01
Message-ID:	200606261621.k5QGL1c11454@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Zeugswetter Andreas DCP SD wrote:
>
> > > head of the chain yet. With an index scan, finding the head is
> easy,
> > > but for a sequential scan, it seems more difficult, and we don't
> have
> > > any free space in the tail of the chain to maintain a pointer to the
> head.
> >
> > Thinking some more, there will need to be a bit to uniquely
> > identify the head of a CITC.
>
> I don't think so. It would probably be sufficient to impose an order on
> the CITC.
> e.g. the oldest tuple version in the CITC is the head.
> (An idea just in case we can't spare a bit :-)

Well, if we need to scan the page quickly, having the bit, or a bit
combination that can only be the head, is helpful. What we usually do
is to combine a SITC bit with another bit that would never be set for
SITC, and that is the head, and you use macros to properly do tests. We
do this already in a number of cases.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

From:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
To:	Mark Woodward <pgsql(at)mohawksoft(dot)com>
Cc:	Christopher Kings-Lynne <chris(dot)kings-lynne(at)calorieking(dot)com>, Rod Taylor <pg(at)rbt(dot)ca>, Hannu Krosing <hannu(at)skype(dot)net>, Christopher Browne <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-26 16:46:45
Message-ID:	20060626164645.GM93655@pervasive.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Fri, Jun 23, 2006 at 06:37:01AM -0400, Mark Woodward wrote:
> While we all know session data is, at best, ephemeral, people still want
> some sort of persistence, thus, you need a database. For mcache I have a
> couple plugins that have a wide range of opitions, from read/write at
> startup and shut down, to full write through cache to a database.
>
> In general, my clients don't want this, they want the database to store
> their data. When you try to explain to them that a database may not be the
> right place to store this data, they ask why, sadly they have little hope
> of understanding the nuances and remain unconvinced.

Have you done any benchmarking between a site using mcache and one not?
I'll bet there's a huge difference, which translates into hardware $$.
That's something managers can understand.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby(at)pervasive(dot)com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461

From:	Hannu Krosing <hannu(at)skype(dot)net>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-27 08:48:06
Message-ID:	1151398087.3516.18.camel@localhost.localdomain
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Ühel kenal päeval, E, 2006-06-26 kell 11:31, kirjutas Bruce Momjian:
> Hannu Krosing wrote:
> > > > pass 3: clean heap based on ctid from pass 1
> > > >
> > > > If yo do it this way, you dont need to invent new data structures to
> > > > pass extra info about CITC internals to passes 2 and 3
> > > >
> > > > On more thing - when should free space map be notified about free space
> > > > in pages with CITC chains ?
> > >
> > > Uh, well, I am thinking we only free CITC space when we are going to use
> > > it for an UPDATE, rather than free things while doing an operation. It
> > > is good to keep the cleanup overhead out of the main path as much as
> > > possible.
> >
> > So vacuum should only remove dead CITC chains and leave the ones with
> > live tuples to CITC internal use ?
>
> Yes, it has to. What else would it do? Add index entries?

No, clean out the dead part.

But this would probably add the page to FSM - do we want that.

Also, this cleaning should probably be done at pass1, so we dont have to
carry the ctids of tuples which have no index entries around to passes 2
and 3 . This has the downside of possibly writing the heap page twice,
so maybe we dont want it.

> > That would also suggest that pages having live CITC chains and less than
> > N% of free space should mot be reported to FSM.
>
> Parts of the CITC that are not visible can be used for free space by
> vacuum, but the visible part is left alone.
>
--
----------------
Hannu Krosing
Database Architect
Skype Technologies OÜ
Akadeemia tee 21 F, Tallinn, 12618, Estonia

Skype me: callto:hkrosing
Get Skype for free: http://www.skype.com

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Hannu Krosing" <hannu(at)skype(dot)net>
Cc:	"postgres hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-27 12:48:19
Message-ID:	18357.24.91.171.78.1151412499.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> Ãhel kenal pÃ¤eval, E, 2006-06-26 kell 09:10, kirjutas Mark Woodward:
>> > ÃÅhel kenal pÃÂ¤eval, R, 2006-06-23 kell 17:27, kirjutas Bruce
>> Momjian:
>> >> Jonah H. Harris wrote:
>> >> > On 6/23/06, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> >> > > What I see in this discussion is a huge amount of "the grass must
>> be
>> >> > > greener on the other side" syndrome, and hardly any recognition
>> that
>> >> > > every technique has its downsides and complications.
>> >> >
>> >> > I'm being totally objective. I don't think we should abandon
>> >> > PostgreSQL's overall design at all, because we do perform INSERTs
>> and
>> >> > DELETEs much better than most systems. However, I've looked at
>> many
>> >> > systems and how they implement UPDATE so that it is a scalable
>> >> > operation. Sure, there are costs and benefits to each
>> implementation,
>> >> > but I think we have some pretty brilliant people in this community
>> and
>> >> > can come up with an elegant design for scalable UPDATEs.
>> >>
>> >> I think the UPDATE case is similar to the bitmap index scan or
>> perhaps
>> >> bitmap indexes on disk --- there are cases we know can not be handled
>> >> well by our existing code, so we have added (or might add) these
>> >> features to try to address those difficult cases.
>> >
>> > Not really. Bitmap index scan and bitmap index are both new additions
>> > working well with existing framework.
>> >
>> > While the problem of slowdown on frequent updates is real, the
>> suggested
>> > fix is just plain wrong, as it is based on someones faulty assumption
>> on
>> > how index lookup works, and very much simplified view of how different
>> > parts of the system work to implement MVCC.
>>
>> Yes, the suggestion was based on MVCC concepts, not a particular
>> implementation.
>
> On the contrary - afaik, it was loosely based on how Oracle does it with
> its rollback segments, only assuming that rollback segments are kept in
> heap and that indexes point only to the oldest row version :p

Well, give me a little more credit than that. Yes, Oracle did play small
part in my thinking, but only in as much as "they can't do it, why can't
we?" The problem was how to get the most recent tuple to be more efficient
and not have tuples that will never be used impact performance without
excessive locking or moving data around.

It was a just a quick idea. Bruce's solution, you have to admit, is
somewhat similar.

>
>> > The original fix he "suggests" was to that imagined behaviour and thus
>> > ignored all the real problems of such change.
>>
>> The original suggestion, was nothing more than a hypothetical for the
>> purpose of discussion.
>>
>> The problem was the steady degradation of performance on frequent
>> updates.
>> That was the point of discussion. I brought up "one possible way" to
>> start a "brain storm." The discussion then morphed into critisizing the
>> example and not addressing the problem.
>
> The problem is heatedly discussed every 3-4 months.

And yet, here we are again.

>
>> Anyway, I think some decent discussion about the problem did happen, and
>> that is good.
>
> Agreed.
>
> Maybe this _was_ the best way to bring up the discussion again.

I have a way, for better or worse, I guess, of stirring up the pot. :-)

Cry as we may about MySQL, but I have a sneaking suspicion that this is
one of the issues that puts PostgreSQL at a serious disadvantage.

While heavily updated rows are a single type of problem, these days I
think *most* database deployments are as back-ends for web sites. This
problem is *very* critical to that type of application, consequently
probably why PostgreSQL has difficulty in that space.

If PostgreSQL can be made *not* to suffer performance degradation on
heavily updated rows, then that is realy the last issue in the way of it
being a completely creadible medium to large enterprise back end. This
combined with its amazing pragramability, should make it unstoppable.

From:	"Mark Woodward" <pgsql(at)mohawksoft(dot)com>
To:	"Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>
Cc:	"Christopher Kings-Lynne" <chris(dot)kings-lynne(at)calorieking(dot)com>, "Rod Taylor" <pg(at)rbt(dot)ca>, "Hannu Krosing" <hannu(at)skype(dot)net>, "Christopher Browne" <cbbrowne(at)acm(dot)org>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-27 13:03:55
Message-ID:	18037.24.91.171.78.1151413435.squirrel@mail.mohawksoft.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> On Fri, Jun 23, 2006 at 06:37:01AM -0400, Mark Woodward wrote:
>> While we all know session data is, at best, ephemeral, people still want
>> some sort of persistence, thus, you need a database. For mcache I have a
>> couple plugins that have a wide range of opitions, from read/write at
>> startup and shut down, to full write through cache to a database.
>>
>> In general, my clients don't want this, they want the database to store
>> their data. When you try to explain to them that a database may not be
>> the
>> right place to store this data, they ask why, sadly they have little
>> hope
>> of understanding the nuances and remain unconvinced.
>
> Have you done any benchmarking between a site using mcache and one not?
> I'll bet there's a huge difference, which translates into hardware $$.
> That's something managers can understand.
>

Last benchmark I did was on a pure data level, a couple years ago,
PostgreSQL could handle about 800 session transactions a second, but
degraded over time, MCache was up about 7500 session transactions a second
and held steady. I should dig up that code and make it available on my
site.

I have a couple users that tell me that their sites couldn't work without
it, not even with MySQL.

From:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
To:	Bruce Momjian <bruce(at)momjian(dot)us>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-27 14:20:41
Message-ID:	20060627142041.GC16840@svana.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Mon, Jun 26, 2006 at 11:29:27AM -0400, Bruce Momjian wrote:
> > Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
> > have to change all the users of that (which aren't many, I suppose).
> > It's probably worth making a utility function to expand them.
> >
> > I'm still confused where bitmap index scan fit into all of this. Is
> > preserving the sequential scan aspect of these a goal with this new
> > setup?
>
> No. I was just pointing out that if you get to the tuple via an index,
> you get handed the head of the SITC via the index tuple, but if you are
> doing a sequential scan, you don't get it, so you have to find it, or
> any other non-visible SITC header.

Ok, but it remains true that you can only have one SITC per tuple. So
if you have 5 indexes on a table, any SITC will only join tuples that
didn't change any values in any of the indexed columns. That's probably
not a big deal though; indexes columns arn't likely to be the ones
changing much.

So, for the bitmap scan you have to make sure that within a single
transaction, scanning multiple indexes will have to provide the same
SITC for each set of tuples, even in the face of concurrent updates.
Otherwise the BitmapAnd will incorrectly throw them out.

That should be doable, if you only change the head of the SITC on
VACUUM. I'm not sure if that's what's being suggested right now.

Have a nice day,
--
Martijn van Oosterhout <kleptog(at)svana(dot)org> http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

From:	Bruce Momjian <bruce(at)momjian(dot)us>
To:	Martijn van Oosterhout <kleptog(at)svana(dot)org>
Cc:	Hannu Krosing <hannu(at)skype(dot)net>, postgres hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: vacuum, performance, and MVCC
Date:	2006-06-27 16:21:40
Message-ID:	200606271621.k5RGLea27251@momjian.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Martijn van Oosterhout wrote:
-- Start of PGP signed section.
> On Mon, Jun 26, 2006 at 11:29:27AM -0400, Bruce Momjian wrote:
> > > Yes, and for index_getmulti (which doesn't visit the heap at all) we'll
> > > have to change all the users of that (which aren't many, I suppose).
> > > It's probably worth making a utility function to expand them.
> > >
> > > I'm still confused where bitmap index scan fit into all of this. Is
> > > preserving the sequential scan aspect of these a goal with this new
> > > setup?
> >
> > No. I was just pointing out that if you get to the tuple via an index,
> > you get handed the head of the SITC via the index tuple, but if you are
> > doing a sequential scan, you don't get it, so you have to find it, or
> > any other non-visible SITC header.
>
> Ok, but it remains true that you can only have one SITC per tuple. So
> if you have 5 indexes on a table, any SITC will only join tuples that
> didn't change any values in any of the indexed columns. That's probably
> not a big deal though; indexes columns arn't likely to be the ones
> changing much.

Right.

> So, for the bitmap scan you have to make sure that within a single
> transaction, scanning multiple indexes will have to provide the same
> SITC for each set of tuples, even in the face of concurrent updates.
> Otherwise the BitmapAnd will incorrectly throw them out.

The index points to the item id on the page, and that never changes,
even if the head tuple changes later.

--
Bruce Momjian bruce(at)momjian(dot)us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +