Quick Links

Free Space Map data structure

Lists:	pgsql-hackers

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Free Space Map data structure
Date:	2008-04-08 08:33:20
Message-ID:	47FB2DD0.6030401@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

The last thread about Free Space Map evolved into discussion about
whether the FSM and other kinds of auxiliary data should be stored
within the heap pages, in "map forks", or auxiliary relfilenodes
attached to the relation. It seems the consensus was to go with the map
forks, but what I really wanted to discuss is the data structure needed
for the Free Space Map.

The FSM data structure needs to support two basic operations:

1. Fast lookup of page with >= X bytes of free space
2. Update of arbitrary, individual pages.

Our current code doesn't support 2, as we always update the FSM in bulk
after vacuum, but we will need that capability to be able to do partial
vacuums in the future.

Additionally, the data structure needs to be "pageable", so that we can
efficiently store it in pages that can be loaded in the buffer cache on
demand, and not require a fixed size shared memory block.

The simplest conceivable data structure is a straight array, where nth
element is the free space on page n. That's easily pageable, and
provides O(1) lookup/update of arbitrary page. Unfortunately it's slow
to search for a page with X bytes of free space in that.

One brilliant idea I had, is a binary heap/tree kind of structure, where
each heap page is represented by one leaf node. Each leaf node stores
the amount of free space on the corresponding heap apge. Each non-leaf
node stores the max. amount of free space in any of its children. So the
top-most node immediately tells the max. amount of free space on *any*
page, which means that to find out that there's no suitable page is a
O(1) operation, which is good when you're inserting to a full relation.
When you're looking for X bytes, you traverse the tree down a path with
nodes > X.

For example:

9
4 9
2 4 0 9

The leaf nodes correspond the heap pages, so page #0 has 2 units of free
space, page #1 has 4, page #1 is full and page has 9.

Let's work through a couple of examples:

To search for a page with 3 bytes of free space, start from the top. We
see that the topmost node has value 9, so we know there is a page
somewhere with enough space. Traverse down the tree, to the node where X
>= 3. In this case, that's the left child, but if it was true for both,
we could pick either one. Traverse down from that node similarly, until
we hit the bottom.

To update the free space on page #1 to 3, you look up the leaf node
corresponding that page, which is easy if we store the tree in an array.
We update the 4 on that node to 3, and walk up the tree updating the
parents. In this case, we update the parent of that node to 3, and stop,
because the value in top node is higher than that. The lookup/update is
therefore O(log N) operation.

Unfortunately, this data structure isn't easily pageable. It still seems
like a good and simple structure within a page, but we need a way to
scale it.

If we use one byte to store the amount of free space on a page (I think
that's enough granularity), you can fit a tree like that with about 8000
nodes, with 4000 leaf nodes, in one 8k block. That means that we can
address 4000 heap pages, or 32MB of heap, with one FSM page like that.

To go above that, we can introduce upper pages, where the leaf nodes
refer not to heap pages but to other FSM pages. The addressing gets
tricky, though. Because the number of upper pages required depends on
the depth of the tree, we can't just divide heap page number by 4000 to
get to the right FSM page. I think that's solvable, by always storing
the upper pages at predetermined places in the FSM file, but you'll need
a bit of logic to get from heap page # to FSM page #.

Thoughts? Any other data structures that better fill the bill?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Brendan Jurd" <direvus(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 08:54:43
Message-ID:	37ed240d0804080154t4558806bo599ddbffd29e5676@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, Apr 8, 2008 at 6:33 PM, Heikki Linnakangas wrote:
> For example:
>
> 9
> 4 9
> 2 4 0 9
>
> The leaf nodes correspond the heap pages, so page #0 has 2 units of free
> space, page #1 has 4, page #1 is full and page has 9.
>
> Let's work through a couple of examples:
>
> To search for a page with 3 bytes of free space, start from the top. We see
> that the topmost node has value 9, so we know there is a page somewhere with
> enough space.

If I understand your design correctly, this claim isn't true. If the
topmost node reports 9 bytes free, could you not have seven pages each
with 1 byte free, and an eighth page with 2 bytes free?

9
4 5
2 2 2 3
1 1 1 1 1 1 1 2

So you'd actually end up walking two levels down the left hand side of
the tree, discovering not enough space, and then walking two levels
down the right hand side to again discover not enough space.

(Apologies if the above is somehow misguided)

Otherwise seems like a cool structure.

Cheers,
BJ
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: http://getfiregpg.org

iD8DBQFH+zLI5YBsbHkuyV0RAhD+AKDiH3S2vG7bSYMR6JvmUK5nfK/5zQCaA+kA
Gc1wGRty/5zBvRqDBZ/pt+4=
=5tcB
-----END PGP SIGNATURE-----

From:	"Dave Page" <dpage(at)pgadmin(dot)org>
To:	"Brendan Jurd" <direvus(at)gmail(dot)com>
Cc:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 08:59:08
Message-ID:	937d27e10804080159i10e9bacckeeb7404ec5374835@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 8, 2008 at 9:54 AM, Brendan Jurd <direvus(at)gmail(dot)com> wrote:
> If I understand your design correctly, this claim isn't true. If the
> topmost node reports 9 bytes free, could you not have seven pages each
> with 1 byte free, and an eighth page with 2 bytes free?
>
> 9
> 4 5
> 2 2 2 3
> 1 1 1 1 1 1 1 2
>
> So you'd actually end up walking two levels down the left hand side of
> the tree, discovering not enough space, and then walking two levels
> down the right hand side to again discover not enough space.

As I read it, each node takes the value of the largest child, not the
sum of the children.

--
Dave Page
EnterpriseDB UK: http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 09:26:07
Message-ID:	1207646767.8153.31.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2008-04-08 at 09:33 +0100, Heikki Linnakangas wrote:
> The last thread about Free Space Map evolved into discussion about
> whether the FSM and other kinds of auxiliary data should be stored
> within the heap pages, in "map forks", or auxiliary relfilenodes
> attached to the relation. It seems the consensus was to go with the map
> forks, but what I really wanted to discuss is the data structure needed
> for the Free Space Map.
>
> The FSM data structure needs to support two basic operations:
>
> 1. Fast lookup of page with >= X bytes of free space
> 2. Update of arbitrary, individual pages.
>
> Our current code doesn't support 2, as we always update the FSM in bulk
> after vacuum, but we will need that capability to be able to do partial
> vacuums in the future.
>
> Additionally, the data structure needs to be "pageable", so that we can
> efficiently store it in pages that can be loaded in the buffer cache on
> demand, and not require a fixed size shared memory block.
>
> The simplest conceivable data structure is a straight array, where nth
> element is the free space on page n. That's easily pageable, and
> provides O(1) lookup/update of arbitrary page. Unfortunately it's slow
> to search for a page with X bytes of free space in that.

> One brilliant idea I had, is a binary heap/tree kind of structure, where
> each heap page is represented by one leaf node. Each leaf node stores
> the amount of free space on the corresponding heap apge. Each non-leaf
> node stores the max. amount of free space in any of its children. So the
> top-most node immediately tells the max. amount of free space on *any*
> page, which means that to find out that there's no suitable page is a
> O(1) operation, which is good when you're inserting to a full relation.
> When you're looking for X bytes, you traverse the tree down a path with
> nodes > X.
>
> For example:
>
> 9
> 4 9
> 2 4 0 9
>
> The leaf nodes correspond the heap pages, so page #0 has 2 units of free
> space, page #1 has 4, page #1 is full and page has 9.
>
> Let's work through a couple of examples:
>
> To search for a page with 3 bytes of free space, start from the top. We
> see that the topmost node has value 9, so we know there is a page
> somewhere with enough space. Traverse down the tree, to the node where X
> >= 3. In this case, that's the left child, but if it was true for both,
> we could pick either one. Traverse down from that node similarly, until
> we hit the bottom.
>
> To update the free space on page #1 to 3, you look up the leaf node
> corresponding that page, which is easy if we store the tree in an array.
> We update the 4 on that node to 3, and walk up the tree updating the
> parents. In this case, we update the parent of that node to 3, and stop,
> because the value in top node is higher than that. The lookup/update is
> therefore O(log N) operation.
>
>
> Unfortunately, this data structure isn't easily pageable. It still seems
> like a good and simple structure within a page, but we need a way to
> scale it.

Also, we should aim at not needing write anything for "initially full"
pages, so that there will be no lock contention for parallel bulkloads.

That should be easily achieved, if we initialize all pages to "full" and
only do the tree update only if there is some room (above a certain
threshold) left after doing an insert.

In other words, pages filled by bulkload should not go into FSM.

> If we use one byte to store the amount of free space on a page (I think
> that's enough granularity), you can fit a tree like that with about 8000
> nodes, with 4000 leaf nodes, in one 8k block. That means that we can
> address 4000 heap pages, or 32MB of heap, with one FSM page like that.
>
> To go above that, we can introduce upper pages, where the leaf nodes
> refer not to heap pages but to other FSM pages. The addressing gets
> tricky, though. Because the number of upper pages required depends on
> the depth of the tree, we can't just divide heap page number by 4000 to
> get to the right FSM page. I think that's solvable, by always storing
> the upper pages at predetermined places in the FSM file, but you'll need
> a bit of logic to get from heap page # to FSM page #.

If we could rely on sparse files, then we could just extend the
tree-in-array idea to pages. Two levels of of pages (4000) can cover
256Gb of data ((8k*8k)/2) leaves * 8kb page size

The pages actually containing any meaningful data can be computed, the
rest should not be touched.

Same is true for in-page updates - if we have only 4 pages in heap, then
there is no need to update the whole first branch (12 levels) just
update the page and its 2 parents, and when searching, start from parent
when adding 5th page, then extend the tree upward and start future
searches from there 3rd level grandparent

Probably we could do without sparse files, if we find an efficient way
to compute the "add order" of leaf and parent pages for above algorithm.

> Thoughts? Any other data structures that better fill the bill?

> --
> Heikki Linnakangas
> EnterpriseDB http://www.enterprisedb.com
>

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 10:35:48
Message-ID:	2e78013d0804080335n461a782cx39f9818cd8d47ff7@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, Apr 8, 2008 at 2:03 PM, Heikki Linnakangas
<heikki(at)enterprisedb(dot)com> wrote:

> Our current code doesn't support 2, as we always update the FSM in bulk
> after vacuum, but we will need that capability to be able to do partial
> vacuums in the future.
>

+1 for that. This will be extremely useful to update FSM after HOT prune/defrag.
Right now, the page free space is only available for subsequent UPDATEs in
the page.

>
> Thoughts? Any other data structures that better fill the bill?
>

Can we not use bitmaps to track approximate rather than exact free
space ? For example, we can have 8 or 16 buckets of free space.
A page with 1-32 bytes free, goes into bucket 1, a page with at 33-64 bytes
free, goes into bucket 2 and so on. If you want a page with X bytes free,
look into the next immediate larger bucket. (the ranges can varry here)

This would give us O(1) time for FSM updates. To speed up lookup, we
can use hierarchical bitmaps. Lets work through an example for 8 buckets:

At the segment level, we have one FSM page. That can summarize FSM
info for ~8000 heap page (1 bit per page per bucket). In addition, we
can have first level hierarchy in the same page where one bit, say
summarizes info for 100 pages. So if there a page with 'bucket' worth
of free space in the first 100 pages in the segment, the corresponding
bit in the first hierarchy will be set. In this case, the first level hierarchy
would need 80 bits per bucket i.e 80 bytes.

We can then extend the concept of segments to say, 'extents'. One FSM page
per extent summarizes the FSM info for segments in that extent. With the same
logic as above, we can have ~8000 segments per extent.

And then one FSM page to summarize info for all extents. I guess at that
level we would run out of maximum supported file size anyways.

Well, this could be completely orthogonal to suggestions you are seeking,
but nevertheless I had this thought for quite some time. So just wanted
to speak about it :-)

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 10:38:58
Message-ID:	1207651138.8153.55.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2008-04-08 at 12:26 +0300, Hannu Krosing wrote:

> Probably we could do without sparse files, if we find an efficient way
> to compute the "add order" of leaf and parent pages for above algorithm.

if we always add only the minimal needed set of parents then the order
will look something like

1: 0
2: 1
3: (0-1)
4: 2
5: (0-3)
6: 3
7: (2-3)
8: 4
9: (0-7)
10: 5
11: (4-5)
12. 6
13: (4-7)
13: 7
14: (6-7)

seems pretty regular :)

and is probably easy to expand into pages

------------
Will work on this a little more

From:	Gregory Stark <stark(at)enterprisedb(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 10:50:43
Message-ID:	87d4p062nw.fsf@oxford.xeocode.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:

> For example:
>
> 9
> 4 9
> 2 4 0 9

It occurs to me now that this it actually wouldn't be so easy to ripple up
changes due to the amount amount of space *decreasing*. Consider if you reduce
the leaf 9 to an 8. You want to decrease its parent to 8 but first you have to
check that its sibling isn't also 9. Again when you get to the grandparnt you
have to check that the "4" isn't a 9. As long as you're on one page that's
cheap but if it means paging in another page of the FSM that could be
annoying.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 11:55:26
Message-ID:	1207655726.8153.58.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Tue, 2008-04-08 at 13:38 +0300, Hannu Krosing wrote:
> On Tue, 2008-04-08 at 12:26 +0300, Hannu Krosing wrote:
>
> > Probably we could do without sparse files, if we find an efficient way
> > to compute the "add order" of leaf and parent pages for above algorithm.
>
> if we always add only the minimal needed set of parents then the order
> will look something like
>
> 1: 0
> 2: 1
> 3: (0-1)
> 4: 2
> 5: (0-3)
> 6: 3
> 7: (2-3)
> 8: 4
> 9: (0-7)
> 10: 5
> 11: (4-5)
> 12. 6
> 13: (4-7)
> 13: 7
> 14: (6-7)
>
> seems pretty regular :)
>
> and is probably easy to expand into pages
>
> ------------
> Will work on this a little more

index tree node binary mask mask bits
0 0 00000
1 0-1 0000? 1
2 1 00001
3 0-3 000?? 2
4 2 00010
5 2-3 0001? 1
6 3 00011
7 0-7 00??? 3
8 4 00100
9 4-5 0010? 1
10 5 00101
11 4-7 001?? 2
12 6 00110
13 6-7 0011? 1
14 7 00111
15 0-15 0???? 4
16 8 01000
17 8-9 0100? 1
18 9 01001
19 8-11 010?? 2
20 10 01010
21 10-11 0101? 1
22 11 01011
23 8-15 01??? 3
24 12 01100
25 12-13 0110? 1
26 13 01101
27 12-15 011?? 2
28 14 01110
29 14-15 0111? 1
30 15 01111
31 0-31 ????? 5
...32........16.........10000.........

seems to be a simple pattern

for each node row (each second row) take binary representation of
next row, start from end and replace all zeros up to and including
rightmost one with mask bits.

mask 011?? means (data and 11100) == 01100

I will try to work out some experimental/sample code in python for find
and update in the evening.

---------------
Hannu

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 16:34:02
Message-ID:	2597.1207672442@sss.pgh.pa.us
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
> ... what I really wanted to discuss is the data structure needed
> for the Free Space Map.

> The FSM data structure needs to support two basic operations:
> 1. Fast lookup of page with >= X bytes of free space
> 2. Update of arbitrary, individual pages.
> Our current code doesn't support 2, as we always update the FSM in bulk
> after vacuum, but we will need that capability to be able to do partial
> vacuums in the future.

Note that the lack of such infrastructure is mainly because we didn't
need it, not because it couldn't be done. But in any case the current
implementation is a loser, agreed.

> One brilliant idea I had, is a binary heap/tree kind of structure, where
> each heap page is represented by one leaf node.

I'm worried about a couple of things here:

* Loss of concurrency in access to the FSM itself, due to updates having
to change large amounts of interrelated entries, and due to all
processes always wanting to touch the root first.

* Loss of concurrency in subsequent heap updates. The current FSM gets
one thing right: if different backends ask for space, they will (if at
all possible) be given different heap pages to insert into, and
therefore will not suffer page-level contention while they do their
insertions. I don't see what you'll do to ensure a comparable behavior
with this structure.

> Unfortunately, this data structure isn't easily pageable. It still seems
> like a good and simple structure within a page, but we need a way to
> scale it.

I suggest that you *not* scale it. It seems possibly useful to have
this kind of structure within a single map page, but trying to go above
that will result in too much contention IMHO.

regards, tom lane

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 18:44:10
Message-ID:	47FBBCFA.30908@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> On Tue, 2008-04-08 at 12:26 +0300, Hannu Krosing wrote:
>
>> Probably we could do without sparse files, if we find an efficient way
>> to compute the "add order" of leaf and parent pages for above algorithm.
>
> if we always add only the minimal needed set of parents then the order
> will look something like
>
> 1: 0
> 2: 1
> 3: (0-1)
> 4: 2
> 5: (0-3)
> 6: 3
> 7: (2-3)
> 8: 4
> 9: (0-7)
> 10: 5
> 11: (4-5)
> 12. 6
> 13: (4-7)
> 13: 7
> 14: (6-7)
>
> seems pretty regular :)
>
> and is probably easy to expand into pages

That's what I thought when I started thinking about this, but even after
spending a good few hours sketching i on a whiteboard, I still couldn't
figure out what the actual formula behind that is. I feel like I'm
missing something obvious textbook algorithm here, so please help me out
if you can spot a pattern in that. If there isn't one, we could build a
lookup table for enough levels by hand, but...

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-08 18:52:01
Message-ID:	47FBBED1.20206@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Pavan Deolasee wrote:
> Can we not use bitmaps to track approximate rather than exact free
> space ? For example, we can have 8 or 16 buckets of free space.
> A page with 1-32 bytes free, goes into bucket 1, a page with at 33-64 bytes
> free, goes into bucket 2 and so on. If you want a page with X bytes free,
> look into the next immediate larger bucket. (the ranges can varry here)

Yep, that's actually what I was planning to do, except I was thinking of
using 8 bits per page, or 256 buckets, because that makes the code a
little bit simpler. 16 buckets would probably be enough in practice, though.

Also, the free space doesn't necessarily need to be divided linearly
into buckets, we could for example use 8 buckets like:

0 32
1 64
2 128
3 256
4 512
5 1024
6 2048
7 4096

> This would give us O(1) time for FSM updates. To speed up lookup, we
> can use hierarchical bitmaps. Lets work through an example for 8 buckets:
>
> At the segment level, we have one FSM page. That can summarize FSM
> info for ~8000 heap page (1 bit per page per bucket). In addition, we
> can have first level hierarchy in the same page where one bit, say
> summarizes info for 100 pages. So if there a page with 'bucket' worth
> of free space in the first 100 pages in the segment, the corresponding
> bit in the first hierarchy will be set. In this case, the first level hierarchy
> would need 80 bits per bucket i.e 80 bytes.
>
> We can then extend the concept of segments to say, 'extents'. One FSM page
> per extent summarizes the FSM info for segments in that extent. With the same
> logic as above, we can have ~8000 segments per extent.
>
> And then one FSM page to summarize info for all extents. I guess at that
> level we would run out of maximum supported file size anyways.

Well, if you add the higher levels, we're no longer talking about O(1),
but O(log n) (for a pretty steep logarithm), just like my proposal.

> Well, this could be completely orthogonal to suggestions you are seeking,
> but nevertheless I had this thought for quite some time.

It's actually not that orthogonal. You describe it as a hierarchical
bitmap, but it's essentially the same idea as the binary heap/tree, just
with way more than 2 children per parent.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-09 04:47:49
Message-ID:	2e78013d0804082147q5c6a8dabs28567bf56c37c271@mail.gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, Apr 9, 2008 at 12:22 AM, Heikki Linnakangas
<heikki(at)enterprisedb(dot)com> wrote:

>
> Well, if you add the higher levels, we're no longer talking about O(1), but
> O(log n) (for a pretty steep logarithm), just like my proposal.
>

For updates, I would still call it O(1) because the number of levels is limited
and a very small number.

>
> It's actually not that orthogonal. You describe it as a hierarchical
> bitmap, but it's essentially the same idea as the binary heap/tree, just
> with way more than 2 children per parent.
>

Yes, I agree.

Btw, I agree with Tom's point about the lock contention on the higher levels
for FSM updates. What we can do is during normal operations, FSM pages
are updated with a SHARE lock - so the information can best be considered
only as hint. Hopefully, if we are only doing bit flipping, we won't go wrong
often. And periodically, VACUUM would correct any mistakes in FSM info.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>
Cc:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-09 12:49:58
Message-ID:	1207745398.7189.21.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-04-09 at 10:17 +0530, Pavan Deolasee wrote:
> On Wed, Apr 9, 2008 at 12:22 AM, Heikki Linnakangas
> <heikki(at)enterprisedb(dot)com> wrote:
>
> >
> > Well, if you add the higher levels, we're no longer talking about O(1), but
> > O(log n) (for a pretty steep logarithm), just like my proposal.
> >
>
> For updates, I would still call it O(1) because the number of levels is limited
> and a very small number.
>
>
> >
> > It's actually not that orthogonal. You describe it as a hierarchical
> > bitmap, but it's essentially the same idea as the binary heap/tree, just
> > with way more than 2 children per parent.
> >
>
> Yes, I agree.
>
> Btw, I agree with Tom's point about the lock contention on the higher levels
> for FSM updates. What we can do is during normal operations, FSM pages
> are updated with a SHARE lock - so the information can best be considered
> only as hint. Hopefully, if we are only doing bit flipping, we won't go wrong
> often. And periodically, VACUUM would correct any mistakes in FSM info.

Sabing 1 byte is an atomic op so if we update going from branches to
root and read from root to branches, then we can probably detect
inconsistencies when reading and then re-read.

What I'm more concerned about is L1 cache swapping between several
processors/cores, which could lead to content switch storms we used to
have in locking.

anyway, one thing to do for sure is to avoid storing info for new/last
page(s) appended by any process before that process is done filling it,
as that would lead to excessive up-tree max_free updates.

BTW, I'm pretty sure I have figured out the method for storing minimal
required binary tree as an array, where adding each page adds exactly
one upper node. The node added is often not the immediate parent, but is
always the one needed for covering all nodes.

I just have to write it up in an understandable way and then you all can
look at it and tell if it is something well-known from Knuth or Date ;)

--------------
Hannu

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-09 18:09:34
Message-ID:	47FD065E.7030704@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> Sabing 1 byte is an atomic op

Unfortunately, it's not. Most if not all modern CPUs will perform one
byte modification as "load word" + "modify word" + "save word".

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-09 19:08:27
Message-ID:	47FD142B.1010407@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Tom Lane wrote:
> Heikki Linnakangas <heikki(at)enterprisedb(dot)com> writes:
>> ... what I really wanted to discuss is the data structure needed
>> for the Free Space Map.
>
>> The FSM data structure needs to support two basic operations:
>> 1. Fast lookup of page with >= X bytes of free space
>> 2. Update of arbitrary, individual pages.
>> Our current code doesn't support 2, as we always update the FSM in bulk
>> after vacuum, but we will need that capability to be able to do partial
>> vacuums in the future.
>
> Note that the lack of such infrastructure is mainly because we didn't
> need it, not because it couldn't be done.

Yep. I actually remember seeing a function in freespace.c in CVS history
to do partial updates, but it was removed at some point because it
wasn't needed.

>> One brilliant idea I had, is a binary heap/tree kind of structure, where
>> each heap page is represented by one leaf node.
>
> I'm worried about a couple of things here:
>
> * Loss of concurrency in access to the FSM itself, due to updates having
> to change large amounts of interrelated entries, and due to all
> processes always wanting to touch the root first.

When searching for free space, we start from the top, and go down from
there. We only need to take a shared lock on the (page that contains)
the upper nodes, so that shouldn't become a bottleneck.

When we update a node, we can stop propagating the change upwards as
soon as we hit a node where the other child has less space than the
current node. For example:

5
5 3
4 5 3
4 3 2 5 2 3

To update the first leaf node from 4 to 2, we need to update it's parent
to 3. But we don't need to update it's parent (5), and we can stop at
that point without accessing the root.

We probably want to mark new pages as full, as Hannu suggested, to avoid
repeatedly updating the FSM during bulk loads.

Given that the current FSM is protected by a single lock, shared by all
relations, and always taken in exclusive mode, I'm not too worried. I'm
not sure how long an FSM page needs to be locked in my scheme compared
to the current FSM, but at least there will be a separate FSM for each
relation. I do want to beat the current FSM in terms of scalability,
though, I think I've seen FreeSpaceLock become contented on some tests.

As a little optimization, we could optimistically start from somewhere
else than the top node when searching for free space. Perhaps from the
top of the FSM page that contained our previous rd_targetblock. I don't
think we need to, though.

> * Loss of concurrency in subsequent heap updates. The current FSM gets
> one thing right: if different backends ask for space, they will (if at
> all possible) be given different heap pages to insert into, and
> therefore will not suffer page-level contention while they do their
> insertions. I don't see what you'll do to ensure a comparable behavior
> with this structure.

Whenever we have a choice to traverse to either left or right child
because there's heap pages with enough free space on both sides, we can
randomize that decision to achieve the spreading.

This actually brings up another nice property this data structure has:
instead of randomizing, we can choose the child node that's "closer" to
some specific heap page. That can be used to keep an updated tuple
version close to the old version, or to maintain cluster order. That's a
goal that's at least somewhat at odds with the goal of spreading the
inserts, of course, but the data structure has the flexibility, which is
nice.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-09 20:55:18
Message-ID:	1207774518.6560.4.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

On Wed, 2008-04-09 at 21:09 +0300, Heikki Linnakangas wrote:
> Hannu Krosing wrote:
> > Saving 1 byte is an atomic op
>
> Unfortunately, it's not. Most if not all modern CPUs will perform one
> byte modification as "load word" + "modify word" + "save word".

Hmm, maybe we I should change my design to modify page free info and its
parent together ?

or what is word ? 2 bytes ? 4 bytes ? even 8 bytes ?

------------
Hannu

From:	PFC <lists(at)peufeu(dot)com>
To:	"Hannu Krosing" <hannu(at)krosing(dot)net>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-09 22:36:41
Message-ID:	op.t9c7rfkhcigqcu@apollo13.peufeu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

About the FSM :

Would it be possible to add a flag marking pages where all tuples are
visible to all transactions ? (kinda like frozen I think)
This could be useful to implement index-only scans, for count(), or to
quickly skip rows when OFFSET is used, or to use only the index when the
selected columns are all in the index. Of course if the page is flagged as
"may contain updated tuples", then it would have to look in the heap. But,
for tables that are not randomly updated (for instance tables that are
mostly appended to, like forum posts, or logs, or the huge archive table,
etc) it could save a lot of heap lookups and IO.

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	PFC <lists(at)peufeu(dot)com>
Cc:	Hannu Krosing <hannu(at)krosing(dot)net>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-10 05:18:40
Message-ID:	47FDA330.1080007@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

PFC wrote:
>
> About the FSM :
>
> Would it be possible to add a flag marking pages where all tuples
> are visible to all transactions ? (kinda like frozen I think)

Ah, the visibility map. That's another line of discussion. The current
plan is to not tie that to the FSM, but implement it separately. There's
some infrastructure changes that are needed for both, like the "map
forks" (see recent FSM discussions), which is why we need to have a
design for FSM as well before we start implementing the visibility map.

It's definitely something I want to do for 8.4. Here's my rough plan:

1. Common "map forks" support
2. Rewrite FSM
3. Implement visibility map, to allow partial vacuums
4. Implement index-only scans, using the visibility map.

We'll see how far I get in the 8.4 cycle. Any help is welcome, of course
:-).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	PFC <lists(at)peufeu(dot)com>
To:	"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc:	"Hannu Krosing" <hannu(at)krosing(dot)net>, "Pavan Deolasee" <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-10 08:58:42
Message-ID:	op.t9d0j4jzcigqcu@apollo13.peufeu.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> PFC wrote:
>> About the FSM :
>> Would it be possible to add a flag marking pages where all tuples
>> are visible to all transactions ? (kinda like frozen I think)
>
> Ah, the visibility map. That's another line of discussion. The current
> plan is to not tie that to the FSM, but implement it separately. There's
> some infrastructure changes that are needed for both, like the "map
> forks" (see recent FSM discussions), which is why we need to have a
> design for FSM as well before we start implementing the visibility map.
>
> It's definitely something I want to do for 8.4.

Ahhhhhh, yes, yes, yes ;) yes !!!!!

> Here's my rough plan:
> 1. Common "map forks" support
> 2. Rewrite FSM
> 3. Implement visibility map, to allow partial vacuums
> 4. Implement index-only scans, using the visibility map.

Throwing another idea that is related to partial vacuums (perhaps ?):
Consider a table that is often inserted (archive, forum posts, log,
whatever), we want to CLUSTER it.
In that case it would be beneficial to only cluster the "tail" of the
table, where all the recently inserted rows are.

Example :
- Table is clustered.
- Insert rows in random order, update some, delete some, etc, supposing
inserts happen at the end
- Table now looks like "head":[clustered part with some holes] plus
"tail":[rows in random order]
- As long as the "tail" fits in disk cache, the random order isn't a
problem.
- So, when the "tail" reaches a certain size :
- Grab it, sort it by cluster order, write it again in the heap
- Update the indexes in a manner similar to VACUUM (ie. bulk update)
- Table now looks like "head":[clustered part with some holes] plus
"tail":[clustered]
This does not remove the holes in the "head", but is this really a
problem ? In this usage scenario, I don't think so. Regular CLUSTER could
also be run, much less frequently than before, and it will also be much
faster since the rows are approximately in-order already.

This approach is complimentary to the "auto-cluster" approach where the
index is asked "where should I insert that row ?" (this will be used to
fill the holes). Auto-cluster will work well in tables that are updated
very often. But starting from an empty table, or an already clustered
table, or in a mostly-insert scenario, the index will have no idea where
to put that row...

The goodness of this approach is that
- As long as the "tail" fits in RAM, sorting it is going to be very fast
(unlike the current CLUSTER).
- Bulk index updates will also be fast as long as the list of changes to
apply to the index fits in memory.
- Therefore it will block the table for much less time than good old
CLUSTER.
- Therefore it will get more use ;)

How to make it non-locking ?
- Doing something like this in pseudo SQL :
INSERT INTO table SELECT * FROM (DELETE FROM table WHERE date > last time
we did this RETURNING *) ORDER BY cluster_columns;
VACUUM;

That is, take the "tail" of the table (as above), sort it, insert it back
in big chunks, and mark the old rows as deleted just like a regular delete
would have done. Then VACUUM.

In this case you now have :
"head":[clustered part with some holes] + "big hole" + "tail":[clustered
rows]

Is the "big hole" a problem ? Probably not, it will be marked as free
space by VACUUM and used for new inserts. A week later we get this :
"head":[clustered part with some holes] + [rows in random order]
+ "tail":[clustered rows]
Repeating the process above will make the "tail" grow and the "hole" will
stay more or less in the same place.

Another way to do it is to use partitions :
- "archive" table
- "current" table

Periodically the rows from "current" are transferred to the "archive"
table and sorted in the process. Then "current" is truncated.
This works, but it is blocking, and you have the overhead from
partitioning...

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-10 12:05:31
Message-ID:	47FE028B.7040802@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> On Wed, 2008-04-09 at 21:09 +0300, Heikki Linnakangas wrote:
>> Hannu Krosing wrote:
>>> Saving 1 byte is an atomic op
>> Unfortunately, it's not. Most if not all modern CPUs will perform one
>> byte modification as "load word" + "modify word" + "save word".
>
> Hmm, maybe we I should change my design to modify page free info and its
> parent together ?
>
> or what is word ? 2 bytes ? 4 bytes ? even 8 bytes ?

It depends on architecture, I believe it's 4 bytes on 32-bit
architectures and 8 bytes on 64-bit ones, typically. But we have to work
on all of them.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

From:	Hannu Krosing <hannu(at)krosing(dot)net>
To:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-10 20:04:45
Message-ID:	1207857885.6837.8.camel@huvostro
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

> BTW, I'm pretty sure I have figured out the method for storing minimal
> required binary tree as an array, where adding each page adds exactly
> one upper node. The node added is often not the immediate parent, but is
> always the one needed for covering all nodes.
>
> I just have to write it up in an understandable way and then you all can
> look at it and tell if it is something well-known from Knuth or Date ;)

Find sample code attached:

not optimal at all, meant as proof-of-concept.

just run the file to see how it works, some comments in the beginning.

currently it interleaves leaf and internal nodes, but it may be better
for some uses (like seqscanning leaf nodes when searching for clustered
pos) to separate them, for example having 1st 4k for lef nodes and 2nd
for internal nodes.

also, I think that having a fan-out factor above 2 (making it more like
b-tree instead of binary tree) would be more efficient due to CPU
caches, but it takes some more work to figure it out.

Of course unless someone recognizes the algorithm and can point to
textbook example ;)

---------------
Hannu

Attachment	Content-Type	Size
bin_min_tree.py	text/x-python	4.8 KB

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-11 08:07:32
Message-ID:	47FF1C44.2070301@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
>> BTW, I'm pretty sure I have figured out the method for storing minimal
>> required binary tree as an array, where adding each page adds exactly
>> one upper node. The node added is often not the immediate parent, but is
>> always the one needed for covering all nodes.
>>
>> I just have to write it up in an understandable way and then you all can
>> look at it and tell if it is something well-known from Knuth or Date ;)
>
> Find sample code attached:
>
> not optimal at all, meant as proof-of-concept.
>
> just run the file to see how it works, some comments in the beginning.
>
> currently it interleaves leaf and internal nodes, but it may be better
> for some uses (like seqscanning leaf nodes when searching for clustered
> pos) to separate them, for example having 1st 4k for lef nodes and 2nd
> for internal nodes.
>
> also, I think that having a fan-out factor above 2 (making it more like
> b-tree instead of binary tree) would be more efficient due to CPU
> caches, but it takes some more work to figure it out.

At least it would be more storage efficient, as you wouldn't need as
many non-leaf nodes. You would need more comparisons when traversing or
updating the tree, but as you pointed out that might be very cheap
because of cache effects.

Within a page, the traditional array representation, where root is at
position 0, it's children are at 1, 2 and so forth, is actually OK. When
the depth of the tree changes, you'll need to memcpy data around and
rebuild parent nodes, but that's acceptable. Or we can fix the depth of
the tree to the max. that fits on a page, and fill it with zeros when
it's not full; all but the rightmost pages are always full anyway.

Scaling into multiple pages, we don't want to move nodes across page
boundaries when extending the relation, so we do need something like
your scheme for that. Taking the upper nodes into account, one FSM page
can hold free space information of ~4k heap (or lower level FSM) pages.
IOW, we have a fan out of 4k across pages. With a fanout like that, we
only need 3-4 levels to address over 500 TB of data.

I've attached a diagram illustrating what I have in mind. In the
diagram, each page can hold only 7 nodes, but in reality that would be ~
BLCKSZ/2, or the 4k with default block size. The heap consists of 10
pages, the bottom leaf nodes correspond the heap pages. The leaf nodes
of the upper FSM page store the max the lower FSM pages; they should
match the top nodes of the lower FSM pages. The rightmost nodes in the
tree, colored grey, are unused.

I'm pretty happy with this structure. The concurrency seems reasonably
good, though will need to figure out how exactly the locking should work.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachment	Content-Type	Size
	image/png	17.6 KB

From:	Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To:	Hannu Krosing <hannu(at)krosing(dot)net>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Free Space Map data structure
Date:	2008-04-12 13:02:49
Message-ID:	4800B2F9.70305@enterprisedb.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Lists:	pgsql-hackers

Hannu Krosing wrote:
> index tree node binary mask mask bits
> 0 0 00000
> 1 0-1 0000? 1
> 2 1 00001
> 3 0-3 000?? 2
> 4 2 00010
> 5 2-3 0001? 1
> ...
> 31 0-31 ????? 5
> ...32........16.........10000.........
>
> seems to be a simple pattern

Oh, I see. I was thinking that the tree would be of the same height on
all branches, but your method seems simpler. Though it meens that
existing nodes need to be assigned to new parents as the tree grows.

Now how to extend this to n-ary trees? Assuming we use a normal binary
tree stored in an array the usual way, within page, we can treat the FSM
pages as nodes with fanout of (BLCKSZ - size of headers)/2.

Thanks to the large fanout, the height of that tree is not large, max 3
levels with default block size (*) and not much more than that with
smaller block sizes. One idea is to use a separate "fork" for each
level, that would keep the math simple.

(*) We need to cover max 2^32 heap pages == 2^32 leaf nodes. You can fit
~4000 leaf nodes on a page. 4000^3 > 2^32.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com