Free Space Map data structure

From: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Free Space Map data structure
Date: 2008-04-08 08:33:20
Message-ID: 47FB2DD0.6030401@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The last thread about Free Space Map evolved into discussion about
whether the FSM and other kinds of auxiliary data should be stored
within the heap pages, in "map forks", or auxiliary relfilenodes
attached to the relation. It seems the consensus was to go with the map
forks, but what I really wanted to discuss is the data structure needed
for the Free Space Map.

The FSM data structure needs to support two basic operations:

1. Fast lookup of page with >= X bytes of free space
2. Update of arbitrary, individual pages.

Our current code doesn't support 2, as we always update the FSM in bulk
after vacuum, but we will need that capability to be able to do partial
vacuums in the future.

Additionally, the data structure needs to be "pageable", so that we can
efficiently store it in pages that can be loaded in the buffer cache on
demand, and not require a fixed size shared memory block.

The simplest conceivable data structure is a straight array, where nth
element is the free space on page n. That's easily pageable, and
provides O(1) lookup/update of arbitrary page. Unfortunately it's slow
to search for a page with X bytes of free space in that.

One brilliant idea I had, is a binary heap/tree kind of structure, where
each heap page is represented by one leaf node. Each leaf node stores
the amount of free space on the corresponding heap apge. Each non-leaf
node stores the max. amount of free space in any of its children. So the
top-most node immediately tells the max. amount of free space on *any*
page, which means that to find out that there's no suitable page is a
O(1) operation, which is good when you're inserting to a full relation.
When you're looking for X bytes, you traverse the tree down a path with
nodes > X.

For example:

9
4 9
2 4 0 9

The leaf nodes correspond the heap pages, so page #0 has 2 units of free
space, page #1 has 4, page #1 is full and page has 9.

Let's work through a couple of examples:

To search for a page with 3 bytes of free space, start from the top. We
see that the topmost node has value 9, so we know there is a page
somewhere with enough space. Traverse down the tree, to the node where X
>= 3. In this case, that's the left child, but if it was true for both,
we could pick either one. Traverse down from that node similarly, until
we hit the bottom.

To update the free space on page #1 to 3, you look up the leaf node
corresponding that page, which is easy if we store the tree in an array.
We update the 4 on that node to 3, and walk up the tree updating the
parents. In this case, we update the parent of that node to 3, and stop,
because the value in top node is higher than that. The lookup/update is
therefore O(log N) operation.

Unfortunately, this data structure isn't easily pageable. It still seems
like a good and simple structure within a page, but we need a way to
scale it.

If we use one byte to store the amount of free space on a page (I think
that's enough granularity), you can fit a tree like that with about 8000
nodes, with 4000 leaf nodes, in one 8k block. That means that we can
address 4000 heap pages, or 32MB of heap, with one FSM page like that.

To go above that, we can introduce upper pages, where the leaf nodes
refer not to heap pages but to other FSM pages. The addressing gets
tricky, though. Because the number of upper pages required depends on
the depth of the tree, we can't just divide heap page number by 4000 to
get to the right FSM page. I think that's solvable, by always storing
the upper pages at predetermined places in the FSM file, but you'll need
a bit of logic to get from heap page # to FSM page #.

Thoughts? Any other data structures that better fill the bill?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brendan Jurd 2008-04-08 08:54:43 Re: Free Space Map data structure
Previous Message Dimitri Fontaine 2008-04-08 07:59:11 Re: COPY Transform support