Re: New FSM allocation policy

Lists: pgsql-hackers
From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: New FSM allocation policy
Date: 2008-08-29 07:40:23
Message-ID: 48B7A7E7.2040206@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

The way my rewritten FSM works at the moment is that pages are handed
out of the FSM in random order, to spread the load of multiple backends
to different pages. That's good for spreading the load, but it occurred
to me while observing a test case that inserts a lot of rows to an
almost empty table with plenty of free space, that that sucks for the
case of a single backend populating a table. Currently, FSM will hand
out pages in sequential order, so from OS point of view we're reading
and writing the pages sequentially. If the pages are handed out in
random order, we'll do random I/O instead.

Fortunately there's an easy fix for that. If we optimize
RecordAndGetPageWithFreeSpace so that it will always return the next
page if it has enough space, we'll be doing sequential I/O again. That's
trivial as long as the next heap page is on the same FSM page, and
probably not too hard even if it's not. If we limit this optimization to
within the same FSM page, we'll effectively be filling fully a 32MB stripes

Thoughts?

I'm running more tests, and there's more issues that need discussion,
but I'll start separate threads for each. I'll also post an updated
patch separately.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-08-29 09:26:44
Message-ID: 871w082muj.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:

> Fortunately there's an easy fix for that. If we optimize
> RecordAndGetPageWithFreeSpace so that it will always return the next page if it
> has enough space, we'll be doing sequential I/O again. That's trivial as long
> as the next heap page is on the same FSM page, and probably not too hard even
> if it's not. If we limit this optimization to within the same FSM page, we'll
> effectively be filling fully a 32MB stripes

Starting from an arbitrary block that would be on average a 16MB stripe.

One idea, we could scan the rest of the current page and use the first match.

Another, given the way your tree structure works you can also descend the tree
with a "target" page. You can find the first page with enough free space after
the target page if there are any. (Take left branch if it's > target and has
enough free space else take right branch if there's enough free space else
take left branch).

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's On-Demand Production Tuning


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-08-29 15:15:02
Message-ID: 6280.1220022902@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> One idea, we could scan the rest of the current page and use the first match.

> Another, given the way your tree structure works you can also descend the tree
> with a "target" page. You can find the first page with enough free space after
> the target page if there are any. (Take left branch if it's > target and has
> enough free space else take right branch if there's enough free space else
> take left branch).

I think the trick here is how to also preserve the property that
different backends tend to be inserting into different pages. There may
be no very good way to do that without maintaining some shared state,
ie the last page handed out.

However, it would probably be sufficient to do that for some pretty
small number of tables, allowing a fixed-size shared memory area to be
sufficient.

regards, tom lane


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-08-29 15:55:11
Message-ID: 48B81BDF.3060309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Gregory Stark <stark(at)enterprisedb(dot)com> writes:
>> One idea, we could scan the rest of the current page and use the first match.
>
>> Another, given the way your tree structure works you can also descend the tree
>> with a "target" page. You can find the first page with enough free space after
>> the target page if there are any. (Take left branch if it's > target and has
>> enough free space else take right branch if there's enough free space else
>> take left branch).
>
> I think the trick here is how to also preserve the property that
> different backends tend to be inserting into different pages.

Yep. If we just always look at the next page, there's the danger of
having multiple backends compete for the same pages again.

> There may
> be no very good way to do that without maintaining some shared state,
> ie the last page handed out.
>
> However, it would probably be sufficient to do that for some pretty
> small number of tables, allowing a fixed-size shared memory area to be
> sufficient.

.. similar to how we handle synchronized seqscans. Yep, that's one
option. I wish we could get rid of the shared memory area and lock
altogether, though.

We could put an extra field on the FSM root page. Hmm, it doesn't
actually need to be a single global field, so we could have a field like
that on every FSM page, for better concurrency. So the algorithm for the
set+search operation becomes:

1. Lock FSM page for the old heap page X
2. Set the value for X
3. See if the page pointed to by (fsmpage->nextPtr++) has enough free
space. If it does, return that.
4. Otherwise, start searching from root, with random traversal.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-08-29 16:09:39
Message-ID: 7138.1220026179@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> There may
>> be no very good way to do that without maintaining some shared state,
>> ie the last page handed out.

> We could put an extra field on the FSM root page. Hmm, it doesn't
> actually need to be a single global field, so we could have a field like
> that on every FSM page, for better concurrency.

Yeah, that would work, and it is only a hint so you needn't WAL-log
changing it.

regards, tom lane


From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-09-01 16:27:06
Message-ID: 1220286426.4371.176.camel@ebony.2ndQuadrant
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Fri, 2008-08-29 at 18:55 +0300, Heikki Linnakangas wrote:
> Tom Lane wrote:
> > Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> >> One idea, we could scan the rest of the current page and use the first match.
> >
> >> Another, given the way your tree structure works you can also descend the tree
> >> with a "target" page. You can find the first page with enough free space after
> >> the target page if there are any. (Take left branch if it's > target and has
> >> enough free space else take right branch if there's enough free space else
> >> take left branch).
> >
> > I think the trick here is how to also preserve the property that
> > different backends tend to be inserting into different pages.
>
> Yep. If we just always look at the next page, there's the danger of
> having multiple backends compete for the same pages again.

Can the FSM hand out page ranges? That way we would be able to use the
next page logic without fear of competition.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Gregory Stark <stark(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-09-05 08:46:28
Message-ID: 48C0F1E4.5000201@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> There may
>>> be no very good way to do that without maintaining some shared state,
>>> ie the last page handed out.
>
>> We could put an extra field on the FSM root page. Hmm, it doesn't
>> actually need to be a single global field, so we could have a field like
>> that on every FSM page, for better concurrency.
>
> Yeah, that would work, and it is only a hint so you needn't WAL-log
> changing it.

Done. That seems to have helped a lot with the case of two concurrent
backends bulk inserting to a table.

I'm now happy with the bulk insert performance and behavior. I've been
using the attached test case to measure that. It uses pgbench to measure
the speed of bulk inserting one million rows, with these variations:
- one client, start with a freshly-created table (i.e FSM is empty, and
the table is extended)
- same, but with two clients
- one client, start with a pre-existing, but empty, table. So all
requests for free space are satisfied by the FSM, the table is not extended.
- same, but with two clients.

Running that test on my laptop, with data directory on a RAM drive to
measure just the CPU overhead and concurrency, it looks like the patched
version is now on par with the current implementation. Looking at the
pattern that pages are filled, by polling pg_freespace('tmp') while the
test is running, the pages are being filled sequentially, with the
target pages of the two backends interleaved, which is as expected and
emulates the behavior of the current implementation. I haven't tried
larger I/O bound tests yet, but because the access pattern is now the
same as without the patch, I would expect the performance to be comparable.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
fsm-lazy-2.patch.gz application/x-gzip 42.8 KB
populatebench.sh application/x-sh 1.0 KB

From: Bruce Momjian <bruce(at)momjian(dot)us>
To: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-09-06 02:43:30
Message-ID: 200809060243.m862hUq12489@momjian.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> The way my rewritten FSM works at the moment is that pages are handed
> out of the FSM in random order, to spread the load of multiple backends
> to different pages. That's good for spreading the load, but it occurred
> to me while observing a test case that inserts a lot of rows to an
> almost empty table with plenty of free space, that that sucks for the
> case of a single backend populating a table. Currently, FSM will hand
> out pages in sequential order, so from OS point of view we're reading
> and writing the pages sequentially. If the pages are handed out in
> random order, we'll do random I/O instead.
>
> Fortunately there's an easy fix for that. If we optimize
> RecordAndGetPageWithFreeSpace so that it will always return the next
> page if it has enough space, we'll be doing sequential I/O again. That's
> trivial as long as the next heap page is on the same FSM page, and
> probably not too hard even if it's not. If we limit this optimization to
> within the same FSM page, we'll effectively be filling fully a 32MB stripes
>
> Thoughts?
>
> I'm running more tests, and there's more issues that need discussion,
> but I'll start separate threads for each. I'll also post an updated
> patch separately.

One other thing to keep in mind is that VACUUM can reduce a table's size
if the trailing blocks are empty, so there is some gain if the earlier
parts of the table are preferred for inserts.

--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +


From: Decibel! <decibel(at)decibel(dot)org>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-09-13 00:07:59
Message-ID: 7A8411CA-CCD3-4F27-93A6-3E995C12BB58@decibel.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote:
>> Fortunately there's an easy fix for that. If we optimize
>> RecordAndGetPageWithFreeSpace so that it will always return the next
>> page if it has enough space, we'll be doing sequential I/O again.
>> That's
>> trivial as long as the next heap page is on the same FSM page, and
>> probably not too hard even if it's not. If we limit this
>> optimization to
>> within the same FSM page, we'll effectively be filling fully a
>> 32MB stripes
>>
>> Thoughts?
>>
>> I'm running more tests, and there's more issues that need discussion,
>> but I'll start separate threads for each. I'll also post an updated
>> patch separately.
>
> One other thing to keep in mind is that VACUUM can reduce a table's
> size
> if the trailing blocks are empty, so there is some gain if the earlier
> parts of the table are preferred for inserts.

Yeah; I would actually really, really like to see a mode you could
set on a table that says "I want to try and shrink this table". One
of the things that would mean is that the FSM should prefer pages at
the beginning of the heap.

Also related to this is the idea of asking the FSM for pages within a
specific range so that you can try and maintain cluster order on a
table. You would look in the clustering index for the closest value
to your key and where it is in the heap and then ask for a page in
that neighborhood. (You'd probably want to look at more than just one
index tuple, but you get the idea).
--
Decibel!, aka Jim C. Nasby, Database Architect decibel(at)decibel(dot)org
Give your computer some brain candy! www.distributed.net Team #1828


From: Heikki Linnakangas <heikki(dot)linnakangas(at)enterprisedb(dot)com>
To: "Decibel!" <decibel(at)decibel(dot)org>
Cc: Bruce Momjian <bruce(at)momjian(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: New FSM allocation policy
Date: 2008-09-13 06:59:59
Message-ID: 48CB64EF.9080406@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Decibel! wrote:
> On Sep 5, 2008, at 9:43 PM, Bruce Momjian wrote:
>> One other thing to keep in mind is that VACUUM can reduce a table's size
>> if the trailing blocks are empty, so there is some gain if the earlier
>> parts of the table are preferred for inserts.
>
> Yeah; I would actually really, really like to see a mode you could set
> on a table that says "I want to try and shrink this table". One of the
> things that would mean is that the FSM should prefer pages at the
> beginning of the heap.

Not sure how exactly that should work, but it should be pretty easy to
do with the new FSM data structure. Perhaps VACUUM should just reset the
"next pointers" as it goes.

> Also related to this is the idea of asking the FSM for pages within a
> specific range so that you can try and maintain cluster order on a
> table. You would look in the clustering index for the closest value to
> your key and where it is in the heap and then ask for a page in that
> neighborhood. (You'd probably want to look at more than just one index
> tuple, but you get the idea).

The new FSM data structure should make that much easier to implement as
well, as it supports naturally the operation "give me page with X bytes
of free space, as close as possible to page Y".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com