Re: Unexpected page allocation behavior on insert-only tables

Lists: pgsql-hackers
From: Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-15 23:53:49
Message-ID: 4BEF340D.1010909@amd.co.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

While preparing a replication test setup with 9.0beta1 I noticed strange
page allocation patterns which Andrew Gierth found interesting enough to
report here.

I've written a simple tool to generate traffic on a database [1], which
did about 30 TX/inserts per second to a table. Upon inspecting the data
in the table, I noticed the expected grouping of tuples which came from
a single backend to matching pages [2]. The strange part was that the
pages weren't completely filled but the backends seemed to jump
arbitrarily from one page to the next [3]. For the table in question
this resulted in about 10% wasted space.

After issuing a VACUUM on the table the free space map got updated (or
initialized?) and the backends used the remaining space in the pages,
though the spurious page allocation continued.

best regards,
Michael

[1] https://workbench.amd.co.at/hg/pgworkshop/file/dc5ab49c99bb/pgexerciser

[2] E.g.:

(0,1) TX1
(0,2) TX5
(0,3) TX7
..
(1,1) TX2
(1,2) TX6
(1,3) TX9

etc.

[3] http://nopaste.narf.at/show/55/
Optimal usage seems to be 136 tuples per page for the table in question.


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-16 00:16:47
Message-ID: 20352.1273969007@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at> writes:
> I've written a simple tool to generate traffic on a database [1], which
> did about 30 TX/inserts per second to a table. Upon inspecting the data
> in the table, I noticed the expected grouping of tuples which came from
> a single backend to matching pages [2]. The strange part was that the
> pages weren't completely filled but the backends seemed to jump
> arbitrarily from one page to the next [3]. For the table in question
> this resulted in about 10% wasted space.

Which table would that be? The trigger-driven updates to "auction",
in particular, would certainly guarantee some amount of "wasted" space.

regards, tom lane


From: Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-16 00:24:36
Message-ID: 4BEF3B44.5010402@amd.co.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On 16.05.2010 02:16, Tom Lane wrote:
> Michael Renner<michael(dot)renner(at)amd(dot)co(dot)at> writes:
>> I've written a simple tool to generate traffic on a database [1], which
>> did about 30 TX/inserts per second to a table. Upon inspecting the data
>> in the table, I noticed the expected grouping of tuples which came from
>> a single backend to matching pages [2]. The strange part was that the
>> pages weren't completely filled but the backends seemed to jump
>> arbitrarily from one page to the next [3]. For the table in question
>> this resulted in about 10% wasted space.
>
> Which table would that be? The trigger-driven updates to "auction",
> in particular, would certainly guarantee some amount of "wasted" space.

Yeah, the auction table receives heavy updates and gets vacuumed regularly.

The behavior I showed was for the "bid" table, which only gets inserts
(and triggers the updates for the auction table).

best regards,
Michael


From: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
To: Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-17 20:14:45
Message-ID: 1274127000-sup-992@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Excerpts from Michael Renner's message of sáb may 15 20:24:36 -0400 2010:
> On 16.05.2010 02:16, Tom Lane wrote:
> > Michael Renner<michael(dot)renner(at)amd(dot)co(dot)at> writes:
> >> I've written a simple tool to generate traffic on a database [1], which
> >> did about 30 TX/inserts per second to a table. Upon inspecting the data
> >> in the table, I noticed the expected grouping of tuples which came from
> >> a single backend to matching pages [2]. The strange part was that the
> >> pages weren't completely filled but the backends seemed to jump
> >> arbitrarily from one page to the next [3]. For the table in question
> >> this resulted in about 10% wasted space.
> >
> > Which table would that be? The trigger-driven updates to "auction",
> > in particular, would certainly guarantee some amount of "wasted" space.
>
> Yeah, the auction table receives heavy updates and gets vacuumed regularly.
>
> The behavior I showed was for the "bid" table, which only gets inserts
> (and triggers the updates for the auction table).

I think this may be related to the smgr_targblock stuff; if the relcache
entry gets invalidated at the wrong time for whatever reason, the
"current page" could be abandoned in favor of extending the rel. This
has changed since 8.4, but a quick perusal suggests that it should be
less likely on 9.0 than 8.4 but maybe there's something weird going on.
--


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc: Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 02:42:25
Message-ID: 24230.1275273745@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> writes:
> Excerpts from Michael Renner's message of sáb may 15 20:24:36 -0400 2010:
>>> I've written a simple tool to generate traffic on a database [1], which
>>> did about 30 TX/inserts per second to a table. Upon inspecting the data
>>> in the table, I noticed the expected grouping of tuples which came from
>>> a single backend to matching pages [2]. The strange part was that the
>>> pages weren't completely filled but the backends seemed to jump
>>> arbitrarily from one page to the next [3]. For the table in question
>>> this resulted in about 10% wasted space.

> I think this may be related to the smgr_targblock stuff; if the relcache
> entry gets invalidated at the wrong time for whatever reason, the
> "current page" could be abandoned in favor of extending the rel. This
> has changed since 8.4, but a quick perusal suggests that it should be
> less likely on 9.0 than 8.4 but maybe there's something weird going on.

I found time to try this example finally. The behavior that I see in
HEAD is even worse than Michael describes: there is room for 136 rows
per block in the bid table, but most blocks have only a few rows. The
distribution after letting the exerciser run for 500 bids or so is
typically like this:

#rows block#
136 0
6 1
5 2
4 3
3 4
5 5
3 6
1 7
4 8
4 9
136 10
6 11
7 12
9 13
9 14
7 15
9 16
7 17
8 18
5 19
136 20
2 21
4 22
4 23
3 24
5 25
3 26
4 27
3 28
2 29
1 30

Examining the insertion timestamps and bidder numbers (client process
IDs), and correlating this with logged autovacuum activity, makes it
pretty clear what is going on. See the logic in
RelationGetBufferForTuple, and note that at no time do we have any FSM
data for the bid table:

1. Initially, all backends will decide to insert into block 0. They do
so until the block is full.

2. At that point, each active backend individually decides it needs to
extend the relation. They each create a new block and start inserting
into that one, each carefully not telling anyone else about the block
so as to avoid block-level insertion contention. In the above diagram,
blocks 1-9 are each created by a different backend and the rows inserted
into it come (mostly?) from just one backend. Block 10's first few rows
also come from the one backend that created it, but it doesn't manage to
fill the block entirely before ...

3. After awhile, autovacuum notices all the insert activity and kicks
off an autoanalyze on the bid table. When committed, this forces a
relcache flush for each other backend's relcache entry for "bid".
In particular, the smgr targblock gets reset.

4. Now, all the backends again decide to try to insert into the last
available block. So everybody jams into the partly-filled block 10,
until it gets filled.

5. Lather, rinse, repeat. Since there are exactly 10 active clients
(by default) in this test program, the repeat distance is exactly 10
blocks.

The obvious thing to do about this would be to not reset targblock
on receipt of a relcache flush event, but we can *not* do that in the
general case. The reason that that gets reset is so that it's not
left pointing to a no-longer-existent block after a VACUUM truncation.
Maybe we could develop a way to distinguish truncation events from
others, but right now the sinval signaling mechanism can't do that.
This looks like there might be sufficient grounds to do something,
though.

Attached exhibits: contents of relevant columns of the bid table
and postmaster log entries for autovacuum actions during the run.

regards, tom lane

Attachment Content-Type Size
unknown_filename text/plain 27.0 KB
unknown_filename text/plain 1.9 KB

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 03:44:02
Message-ID: AANLkTill9sP47NQyVuuEz4UICUcfY1aQJrDUXPVmoA1t@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, May 30, 2010 at 10:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> pretty clear what is going on.  See the logic in
> RelationGetBufferForTuple, and note that at no time do we have any FSM
> data for the bid table:

Is this because, in the absence of updates or deletes, we never vacuum it?

> 4. Now, all the backends again decide to try to insert into the last
> available block.  So everybody jams into the partly-filled block 10,
> until it gets filled.

Would it be (a) feasible and (b) useful to inject some entropy into this step?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company


From: Takahiro Itagaki <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 03:46:40
Message-ID: 20100531124640.F2AE.52131E4D@oss.ntt.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> 3. After awhile, autovacuum notices all the insert activity and kicks
> off an autoanalyze on the bid table. When committed, this forces a
> relcache flush for each other backend's relcache entry for "bid".
> In particular, the smgr targblock gets reset.
>
> 4. Now, all the backends again decide to try to insert into the last
> available block. So everybody jams into the partly-filled block 10,
> until it gets filled.

The autovacuum process runs only analyze, but does not run vacuum at 3
because the workload is insert-only. Partially filled pages are never
tracked by freespace map. We could re-run an autovacuum if we saw the
report from the autoanalyze that says the table is low-density,
but the additional vacuum might be overhead in other cases.

> The obvious thing to do about this would be to not reset targblock
> on receipt of a relcache flush event

Even if we don't reset targblock, can we solve the issue when clients
connect and disconnect for each insert? New backends only check the end
of the table, and extend it as same as this case. If we are worrying
about the worst caase, we might need to track newly added pages with
freespace map. Of course we can ignore the case because frequent
connections and disconnections should be always avoided.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 04:03:59
Message-ID: 25101.1275278639@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Sun, May 30, 2010 at 10:42 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>> pretty clear what is going on. See the logic in
>> RelationGetBufferForTuple, and note that at no time do we have any FSM
>> data for the bid table:

> Is this because, in the absence of updates or deletes, we never vacuum it?

Right.

>> 4. Now, all the backends again decide to try to insert into the last
>> available block. So everybody jams into the partly-filled block 10,
>> until it gets filled.

> Would it be (a) feasible and (b) useful to inject some entropy into this step?

Maybe, but at least in this case, the insert rate is not fast enough
that contention for the block is worth worrying about. IMO this isn't
the part of the cycle that needs fixed.

I guess another path to a fix might be to allow the backends to record
new pages in the FSM immediately at creation. That might result in more
insert contention, but it'd avoid losing track of the free space
permanently, which is what is happening here (unless something happens
to cause a vacuum). One reason the current code doesn't do that is that
the old in-memory FSM couldn't efficiently support retail insertion of
single-page data, but the new FSM code hasn't got a problem with that.

regards, tom lane


From: Greg Stark <gsstark(at)mit(dot)edu>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 05:09:49
Message-ID: AANLkTilTyP3TUBCypf8kspiXpUv1op0eLS-va-meXEBj@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, May 31, 2010 at 3:42 AM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> note that at no time do we have any FSM
> data for the bid table:
>
>
> 3. After awhile, autovacuum notices all the insert activity and kicks
> off an autoanalyze on the bid table.  When committed, this forces a
> relcache flush for each other backend's relcache entry for "bid".
> In particular, the smgr targblock gets reset.

This is an analyze-only scan? Why does analyze need to issue a
relcache flush? Maybe we only need to issue one for an actual vacuum
which would also populate the fsm?

--
greg


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>
Cc: Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 05:23:37
Message-ID: 26120.1275283417@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Greg Stark <gsstark(at)mit(dot)edu> writes:
> This is an analyze-only scan? Why does analyze need to issue a
> relcache flush?

Directly: to cause other backends to pick up the updated pg_class row
(with new relpages/reltuples data).

Indirectly: to cause cached plans for the rel to be invalidated,
so that they can get replanned with updated pg_statistic entries.

So we can't just not have a relcache flush here. However, we
might be able to decouple targblock reset from the rest of it.
In particular, now that there's a distinction between smgr flush
and relcache flush, maybe we could associate targblock reset with
smgr flush (only) and arrange to not flush the smgr level during
ANALYZE --- basically, smgr flush would only be needed when truncating
or reassigning the relfilenode. I think this might work out nicely but
haven't chased the details.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2010-05-31 20:47:39
Message-ID: 19116.1275338859@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I wrote:
> In particular, now that there's a distinction between smgr flush
> and relcache flush, maybe we could associate targblock reset with
> smgr flush (only) and arrange to not flush the smgr level during
> ANALYZE --- basically, smgr flush would only be needed when truncating
> or reassigning the relfilenode. I think this might work out nicely but
> haven't chased the details.

I looked into that a bit more and decided that it'd be a ticklish
change: the coupling between relcache and smgr cache is pretty tight,
and there just isn't any provision for having an smgr cache entry live
longer than its owning relcache entry. Even if we could fix it to
work reliably, this approach does nothing for the case where a backend
actually exits after filling just part of a new page, as noted by
Takahiro-san.

The next most promising fix is to have RelationGetBufferForTuple tell
the FSM about the new page immediately on creation. I made a draft
patch for that (attached). It fixes Michael's scenario nicely ---
all pages get filled completely --- and a simple test with pgbench
didn't reveal any obvious change in performance. However there is
clear *potential* for performance loss, due to both the extra FSM
access and the potential for increased contention because of multiple
backends piling into the same new page. So it would be good to do
some real performance testing on insert-heavy scenarios before we
consider applying this. Any volunteers?

Note: patch is against HEAD but should work in 8.4, if you reverse out
the use of the rd_targblock access macros.

regards, tom lane

Attachment Content-Type Size
use-fsm-for-new-page.patch text/x-patch 2.6 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2011-02-05 03:10:48
Message-ID: 201102050310.p153AmQ16713@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I wrote:
> > In particular, now that there's a distinction between smgr flush
> > and relcache flush, maybe we could associate targblock reset with
> > smgr flush (only) and arrange to not flush the smgr level during
> > ANALYZE --- basically, smgr flush would only be needed when truncating
> > or reassigning the relfilenode. I think this might work out nicely but
> > haven't chased the details.
>
> I looked into that a bit more and decided that it'd be a ticklish
> change: the coupling between relcache and smgr cache is pretty tight,
> and there just isn't any provision for having an smgr cache entry live
> longer than its owning relcache entry. Even if we could fix it to
> work reliably, this approach does nothing for the case where a backend
> actually exits after filling just part of a new page, as noted by
> Takahiro-san.
>
> The next most promising fix is to have RelationGetBufferForTuple tell
> the FSM about the new page immediately on creation. I made a draft
> patch for that (attached). It fixes Michael's scenario nicely ---
> all pages get filled completely --- and a simple test with pgbench
> didn't reveal any obvious change in performance. However there is
> clear *potential* for performance loss, due to both the extra FSM
> access and the potential for increased contention because of multiple
> backends piling into the same new page. So it would be good to do
> some real performance testing on insert-heavy scenarios before we
> consider applying this. Any volunteers?
>
> Note: patch is against HEAD but should work in 8.4, if you reverse out
> the use of the rd_targblock access macros.

Is this something we want to address or should I just add it to the
TODO?

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +


From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Greg Stark <gsstark(at)mit(dot)edu>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Michael Renner <michael(dot)renner(at)amd(dot)co(dot)at>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Unexpected page allocation behavior on insert-only tables
Date: 2011-02-17 03:22:25
Message-ID: 201102170322.p1H3MPE07664@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> I wrote:
> > In particular, now that there's a distinction between smgr flush
> > and relcache flush, maybe we could associate targblock reset with
> > smgr flush (only) and arrange to not flush the smgr level during
> > ANALYZE --- basically, smgr flush would only be needed when truncating
> > or reassigning the relfilenode. I think this might work out nicely but
> > haven't chased the details.
>
> I looked into that a bit more and decided that it'd be a ticklish
> change: the coupling between relcache and smgr cache is pretty tight,
> and there just isn't any provision for having an smgr cache entry live
> longer than its owning relcache entry. Even if we could fix it to
> work reliably, this approach does nothing for the case where a backend
> actually exits after filling just part of a new page, as noted by
> Takahiro-san.
>
> The next most promising fix is to have RelationGetBufferForTuple tell
> the FSM about the new page immediately on creation. I made a draft
> patch for that (attached). It fixes Michael's scenario nicely ---
> all pages get filled completely --- and a simple test with pgbench
> didn't reveal any obvious change in performance. However there is
> clear *potential* for performance loss, due to both the extra FSM
> access and the potential for increased contention because of multiple
> backends piling into the same new page. So it would be good to do
> some real performance testing on insert-heavy scenarios before we
> consider applying this. Any volunteers?

I have added this TODO:

Allow concurrent inserts to use recently created pages rather than
creating new ones

* http://archives.postgresql.org/pgsql-hackers/2010-05/msg00853.php

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ It's impossible for everything to be true. +