Re: Rewriting Free Space Map

Lists: pgsql-hackers
From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Rewriting Free Space Map
Date: 2008-03-16 21:57:09
Message-ID: 47DD97B5.5020309@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

I've started working on revamping Free Space Map, using the approach
where we store a map of heap pages on every nth heap page. What we need
now is discussion on the details of how exactly it should work.

Here's my rough plan on the implementation. It's long, sorry. I'm fairly
confident with it, but please let me know if you see any problems or
have any suggestions or better ideas.

Heap FSM
--------

The FSM is stored in the special area of every nth heap page. When
extending the relation, the heapam checks if the block number of the new
page is one that belongs to the FSM. If it is, it let's the FSM to
initialize it by calling InitFSMPage() on it, and extends the relation
again to get another, normal heap page.

I chose the "every nth page is an FSM page" approach, rather than using
a separate relfilenode, which I also considered. The separate
relfilenode approach has some advantages, like being able to scan all
FSM pages in a sequential fashion, but it involves a fair amount of
catalog and buffer manager changes.

It's convenient that the FSM uses up the whole page, leaving no room for
heap tuples. It simplifies the locking, as we don't need to worry with
the possibility in the FSM that the caller is already holding a lock on
the same page.

In an FSM page, there's one byte for each of the next N heap pages,
starting from the FSM page. That one byte stores the amount of free
space on the corresponding heap page, in BLCKSZ/256 byte precision (32
bytes with default block size).

The mapping of free space to these 256 "buckets" wouldn't necessarily
have to be a linear one, we could for example have a single bucket for
pages with more than BLCKSZ/2 bytes of free space and divide the rest
linearly into 16 byte buckets, but let's keep it simple for now. Of
course, we could also just use 2 bytes per page, and store the page size
exactly, but 32 byte precision seems like enough to me.

Index FSM
---------

Indexes use a similar scheme, but since we only need to keep track
whether a page is used or not, we only need one bit per page. If the
amount of free space on pages is interesting for an indexam in the
future, it can use the heap FSM implementation instead. Or no FSM at
all, like the hash indexam.

To use the index FSM, the indexam needs to leave every nth page alone,
like in the heap. The B-tree assumes that the metapage is at block 0,
but that's also where we would like to store the first index FSM page.
To overcome that, we can make the index FSM special area a little bit
smaller, so that the B-tree metadata fits on the same page as the FSM
information. That will be wasted space on other FSM pages than block 0,
but we're only talking about 24 bytes per FSM page, and we only need one
FSM page per ~512 MB of index pages (with default BLCKSZ).

Performance
-----------

The patch I'm working on currently uses a naive way to find a page in
the FSM. To find a page with X bytes of free space, it scans the FSM
pages until one is found. And it always starts the scan from the
beginning, which is far from optimal. And when there's page with enough
free space, it still needs to scan all FSM pages just to find out that
we need to extend the relation.

To speed things up, we're going to need some mechanism to avoid that.
First of all, we need to somehow cache the information that "there's no
page with >= X bytes left", to avoid fruitless scanning. To speed up the
case when there's only a few pages with enough free space, we can keep a
limited size list of such pages in addition to the map.

These information needs to be in shared memory, either on heap pages
like the FSM pages and managed by the buffer manager, or in a separate
shmem block. I would like to go with normal bufmgr managed pages, as
fixed-sized memory blocks have their problems, and the cached
information should be stored to disk as well.

Let's have one special page in the heap file, called the Free Space List
(FSL) page, in addition to the normal FSM pages. It has the following
structure:

struct {
bit anypages[256]

struct {
BlockNumber blockno;
uint8 freespace;
} freespacelist[as large as fits on page]
}

Remember that we track the free space on each page using one byte, IOW,
each page falls into one of 256 buckets of free space. In the anypages
bitmap, we have one bit per bucket indicating "is there any pages with
this much free space". When we look for a page with X bytes, we check
the bits up to the bucket corresponding X bytes, and if there's no set
bits we know not to bother scanning the FSM pages.

To speed up the scan where there is space, we keep a simple list of
pages with free space. This list is actually like the current FSM, but
here we only use it as a small cache of the FSM pages. VACUUM and any
other operations that update the FSM can put pages to the list when
there's free slots.

We can store the FSL page on a magic location, say block #100. For
relations smaller than that, there's no need for the FSL and we might as
well scan the FSM page. I'm not sure if we should have more than one FSL
page for large tables.

I'm not sure yet how the locking of FSL and FSM pages should work. It
shouldn't be too hard, though, as the FSM/FSL information are just hints
anyway. We do need to take care that we don't permanently lose track of
any significant amount of free space, as after we can do partial vacuums
using the visibility map, VACUUM might not visit one part of a table
even if other parts it are frequently updated.

Page allocation algorithm
-------------------------

There's many different ways we can hand out pages from the FSM. Possible
strategies, some of which are at odds with each other, include:

1. Try to spread out inserts of different backends, to avoid contention
2. Prefer low-numbered pages, to increase the chance of being able to
truncate in VACUUM.
3. Reserve pages with lots of free space for big allocations, prefer
almost full pages for small allocations. To use all space more efficiently.
4. If the table is clustered, try to use pages close to those with
similar values.
5. On UPDATE, try to use pages close to the old tuple.
6. Prefer pages that are currently in cache, to avoid I/O.

The current FSM tries to achieve only 1, and there haven't been many
complaints, so I wouldn't worry too much about the other goals. 4 and 6
would need some major changes to the buffer manager and indexam
interfaces, respectively, and 3 could lead to more I/O when you do a lot
of small inserts.

We can spread inserts of different backends in the Free Space List by
moving a page to the end of the list when it's handed out, or removing
it from there altogether. When the FSL is empty, we can vary the place
where we start to scan the FSM.

To prefer low-number pages, to increase the chance of being able to
truncate the relation later, we can favor lower numbered blocks in
VACUUM when we decide which blocks to put into the FSL. And we can also
bias the starting point of where we start to scan the FSM when the FSL
is empty.

Visibility Map
--------------

This far I've only talked about the FSM, but it's important to consider
how the Visibility Map fits into the scheme. My current thinking is that
there will be one bit per heap page in the visibility map. The exact
semantics of that one bit is still not totally clear, but that's not
important right now.

There's a few alternatives:
1. Mix the visibility map with the FSM, stealing one bit of every FSM
byte. There would then be 7 bits for storing how much space there is on
each page, and 1 bit indicating the visibility.

2. Allocate part of the FSM pages for visibility map. For example, First
1/9 of the page for VM, and 8/9 for the FSM. The VM part would be a
straight bitmap with one bit per page, and the FSM part would use one
byte per page.

3. Use different pages for the VM, for example every nth page for VM,
and every (n+1)th page for FSM.

I'm leaning towards 2 at the moment. 3 is intriguing as well, though,
because it would help with potential lock contention on the VM/FSM pages.

Per-chunk relfrozenxid
----------------------
I'm imagining that we would set a bit in VM, if a page has no dead
tuples at all, making it possible to use it for index-only scans. Such a
page is also uninteresting for VACUUM. However, we would still need to
scan the whole table to freeze tuples.

To alleviate that, we could allocate some space in the FSM pages,
indicating a smallest Xid in a range of heap pages. IOW, like
relfrozenxid, but with a granularity of N pages, rather than whole
relation. You would still have to scan that range of pages to update it,
but that's much better than the whole relation.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-16 22:31:29
Message-ID: 6552.1205706689@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> I've started working on revamping Free Space Map, using the approach
> where we store a map of heap pages on every nth heap page. What we need
> now is discussion on the details of how exactly it should work.

You're cavalierly waving away a whole boatload of problems that will
arise as soon as you start trying to make the index AMs play along
with this :-(. Hash for instance has very narrow-minded ideas about
page allocation within its indexes.

Also, I don't think that "use the special space" will scale to handle
other kinds of maps such as the proposed dead space map. (This is
exactly why I said the other day that we need a design roadmap for all
these ideas.)

The idea that's becoming attractive to me while contemplating the
multiple-maps problem is that we should adopt something similar to
the old Mac OS idea of multiple "forks" in a relation. In addition
to the main data fork which contains the same info as now, there could
be one or more map forks which are separate files in the filesystem.
They are named by relfilenode plus an extension, for instance a relation
with relfilenode NNN would have a data fork in file NNN (plus perhaps
NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus
NNN.map.1 etc as needed). We'd have to add one more field to buffer
lookup keys (BufferTag) to disambiguate which fork the referenced page
is in. Having bitten that bullet, though, the idea trivially scales to
any number of map forks with potentially different space requirements
and different locking and WAL-logging requirements.

Another possible advantage is that a new map fork could be added to an
existing table without much trouble. Which is certainly something we'd
need if we ever hope to get update-in-place working.

The main disadvantage I can see is that for very small tables, the
percentage overhead from multiple map forks of one page apiece is
annoyingly high. However, most of the point of a map disappears if
the table is small, so we might finesse that by not creating any maps
until the table has reached some minimum size.

regards, tom lane


From: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 00:33:02
Message-ID: 20080317003301.GA18580@alvh.no-ip.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:

> The idea that's becoming attractive to me while contemplating the
> multiple-maps problem is that we should adopt something similar to
> the old Mac OS idea of multiple "forks" in a relation. In addition
> to the main data fork which contains the same info as now, there could
> be one or more map forks which are separate files in the filesystem.

I think something similar could be used to store tuple visibility bits
separately from heap tuple data itself, so +1 to this idea.

(The rough idea in my head was that you can do an indexscan and look
up visibility bits without having to pull the whole heap along; and
visibility updates are also cheaper, whether they come from indexscans
or heap scans. Of course, the implicit cost is that a seqscan needs to
fetch the visibility pages, too; and the locking is more complex.)

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


From: Hannu Krosing <hannu(at)krosing(dot)net>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Luke Lonergan <llonergan(at)greenplum(dot)com>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 10:28:45
Message-ID: 1205749726.7381.23.camel@huvostro
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
>
> > The idea that's becoming attractive to me while contemplating the
> > multiple-maps problem is that we should adopt something similar to
> > the old Mac OS idea of multiple "forks" in a relation. In addition
> > to the main data fork which contains the same info as now, there could
> > be one or more map forks which are separate files in the filesystem.

Are'nt we in a way doing this for indexes ?

> I think something similar could be used to store tuple visibility bits
> separately from heap tuple data itself, so +1 to this idea.

Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus
bits) should be stored separately.

A separate "fork" for visibility should be organized as a b-tree index
(as we already have well-honed mechanisms for dealing with those
effectively) but visibility fork is stored in a compressed form by
storing ranges of all-visible or all-deleted tuples as two endpoints
only and also the tree is reorganized when possible similar to what we
currently do for HOT updates.

This will keep the visibility index really small for cases with little
updates, most likely one or two pages regardless of table size.

One important difference from indexes is that visibility info should be
stored first, before writing data to heap and creating ordinary index
entries.

> (The rough idea in my head was that you can do an indexscan and look
> up visibility bits without having to pull the whole heap along; and
> visibility updates are also cheaper, whether they come from indexscans
> or heap scans. Of course, the implicit cost is that a seqscan needs to
> fetch the visibility pages, too; and the locking is more complex.)

another cost is heavy inserting/updating where there will probably be
more lock contention as visibility info for new tuples will more often
land on the same visibility pages due to visibility info being generally
smaller.

Of course, with visibility info in a separate fork, very narrow tables
will have the ratios reversed - for one byte wide table visibility info
will be a few times bigger than actual data, at least initially before
compression has kicked in.

--------------------
Hannu


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Hannu Krosing <hannu(at)krosing(dot)net>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Luke Lonergan <llonergan(at)greenplum(dot)com>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 13:29:31
Message-ID: 8163.1205760571@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Hannu Krosing <hannu(at)krosing(dot)net> writes:
> On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
>> Tom Lane wrote:
>>> The idea that's becoming attractive to me while contemplating the
>>> multiple-maps problem is that we should adopt something similar to
>>> the old Mac OS idea of multiple "forks" in a relation.

> Are'nt we in a way doing this for indexes ?

Not really --- indexes are closer to being independent entities, since
they have their own relfilenode values, own pg_class entries, etc. What
I'm imagining here is something that's so tightly tied to the core heap
that there's no value in managing it as a distinct entity, thus the idea
of same relfilenode with a different extension. The existence of
multiple forks in a relation wouldn't be exposed at all at the SQL
level.

>> I think something similar could be used to store tuple visibility bits
>> separately from heap tuple data itself, so +1 to this idea.

> Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus
> bits) should be stored separately.

I'm entirely un-sold on this idea, but yeah it would be something that
would be possible to experiment with once we have a multi-fork
infrastructure.

regards, tom lane


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 13:49:29
Message-ID: 47DE76E9.8060009@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>> I've started working on revamping Free Space Map, using the approach
>> where we store a map of heap pages on every nth heap page. What we need
>> now is discussion on the details of how exactly it should work.
>
> You're cavalierly waving away a whole boatload of problems that will
> arise as soon as you start trying to make the index AMs play along
> with this :-(.

It doesn't seem very hard. An indexam wanting to use FSM needs a little
bit of code where the relation is extended, to let the FSM initialize
FSM pages. And then there's the B-tree metapage issue I mentioned. But
that's all, AFAICS.

> Hash for instance has very narrow-minded ideas about
> page allocation within its indexes.

Hash doesn't use FSM at all.

> Also, I don't think that "use the special space" will scale to handle
> other kinds of maps such as the proposed dead space map. (This is
> exactly why I said the other day that we need a design roadmap for all
> these ideas.)

It works for anything that scales linearly with the relation itself. The
proposed FSM and visibility map both fall into that category.

A separate file is certainly more flexible. I was leaning towards that
option originally
(http://archives.postgresql.org/pgsql-hackers/2007-11/msg00142.php) for
that reason.

> The idea that's becoming attractive to me while contemplating the
> multiple-maps problem is that we should adopt something similar to
> the old Mac OS idea of multiple "forks" in a relation. In addition
> to the main data fork which contains the same info as now, there could
> be one or more map forks which are separate files in the filesystem.
> They are named by relfilenode plus an extension, for instance a relation
> with relfilenode NNN would have a data fork in file NNN (plus perhaps
> NNN.1, NNN.2, etc) and a map fork named something like NNN.map (plus
> NNN.map.1 etc as needed). We'd have to add one more field to buffer
> lookup keys (BufferTag) to disambiguate which fork the referenced page
> is in. Having bitten that bullet, though, the idea trivially scales to
> any number of map forks with potentially different space requirements
> and different locking and WAL-logging requirements.

Hmm. You also need to teach at least xlog.c and xlogutils.c about the
map forks, for full page images and the invalid page tracking. I also
wonder what the performance impact of extending BufferTag is.

My original thought was to have a separate RelFileNode for each of the
maps. That would require no smgr or xlog changes, and not very many
changes in the buffer manager, though I guess you'd more catalog
changes. You had doubts about that on the previous thread
(http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but
the "map forks" idea certainly seems much more invasive than that.

I like the "map forks" idea; it groups the maps nicely at the filesystem
level, and I can see it being useful for all kinds of things in the
future. The question is, is it really worth the extra code churn? If you
think it is, I can try that approach.

> Another possible advantage is that a new map fork could be added to an
> existing table without much trouble. Which is certainly something we'd
> need if we ever hope to get update-in-place working.

Yep.

> The main disadvantage I can see is that for very small tables, the
> percentage overhead from multiple map forks of one page apiece is
> annoyingly high. However, most of the point of a map disappears if
> the table is small, so we might finesse that by not creating any maps
> until the table has reached some minimum size.

Yeah, the map fork idea is actually better than the "every nth heap
page" approach from that point of view.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Lars-Erik Bjørk <Lars-Erik(dot)Bjork(at)Sun(dot)COM>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Hannu Krosing <hannu(at)krosing(dot)net>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Luke Lonergan <llonergan(at)greenplum(dot)com>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 13:53:30
Message-ID: 1205762010.13711.0.camel@falcon
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2008-03-17 at 09:29 -0400, Tom Lane wrote:
> Hannu Krosing <hannu(at)krosing(dot)net> writes:
> > On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating the
> >>> multiple-maps problem is that we should adopt something similar to
> >>> the old Mac OS idea of multiple "forks" in a relation.
>
> > Are'nt we in a way doing this for indexes ?
>
> Not really --- indexes are closer to being independent entities, since
> they have their own relfilenode values, own pg_class entries, etc. What
> I'm imagining here is something that's so tightly tied to the core heap
> that there's no value in managing it as a distinct entity, thus the idea
> of same relfilenode with a different extension. The existence of
> multiple forks in a relation wouldn't be exposed at all at the SQL
> level.
>
> >> I think something similar could be used to store tuple visibility bits
> >> separately from heap tuple data itself, so +1 to this idea.
>
> > Not just "bits", but whole visibility info (xmin,xmax,tmin,tmax, plus
> > bits) should be stored separately.
>
> I'm entirely un-sold on this idea, but yeah it would be something that
> would be possible to experiment with once we have a multi-fork
> infrastructure.
>
> regards, tom lane
>


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 14:11:39
Message-ID: 8800.1205763099@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> Tom Lane wrote:
>> You're cavalierly waving away a whole boatload of problems that will
>> arise as soon as you start trying to make the index AMs play along
>> with this :-(.

> It doesn't seem very hard.

The problem is that the index AMs are no longer in control of what goes
where within their indexes, which has always been their prerogative to
determine. The fact that you think you can kluge btree to still work
doesn't mean that it will work for other AMs.

>> Also, I don't think that "use the special space" will scale to handle
>> other kinds of maps such as the proposed dead space map. (This is
>> exactly why I said the other day that we need a design roadmap for all
>> these ideas.)

> It works for anything that scales linearly with the relation itself. The
> proposed FSM and visibility map both fall into that category.

It can work only with prearrangement among all the maps that are trying
to coexist in the same special space. Every time a new one comes along,
we'd have to reconsider the space allocation, re-optimize tradeoffs,
and force an initdb that could not possibly be implemented in-place.

If we had a short list of maps in mind with no real prospect of
changing, then I think this would be acceptable; but that doesn't seem
to be the case. It's certainly foolish to start detail design on
something like this when we don't even have the "roadmap" I asked
for about what sorts of maps we are agreed we want to have.

>> The idea that's becoming attractive to me while contemplating the
>> multiple-maps problem is that we should adopt something similar to
>> the old Mac OS idea of multiple "forks" in a relation.

> Hmm. You also need to teach at least xlog.c and xlogutils.c about the
> map forks, for full page images and the invalid page tracking.

Well, you'd have to teach them something anyway, for any incarnation
of maps that they might need to update.

> I also wonder what the performance impact of extending BufferTag is.

That's a fair objection, and obviously something we'd need to check.
But I don't recall seeing hash_any so high on any profile that I think
it'd be a big problem.

> My original thought was to have a separate RelFileNode for each of the
> maps. That would require no smgr or xlog changes, and not very many
> changes in the buffer manager, though I guess you'd more catalog
> changes. You had doubts about that on the previous thread
> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but
> the "map forks" idea certainly seems much more invasive than that.

The main problems with that are (a) the need to expose every type of map
in pg_class and (b) the need to pass all those relfilenode numbers down
to pretty low levels of the system. The nice thing about the fork idea
is that you don't need any added info to uniquely identify what relation
you're working on. The fork numbers would be hard-wired into whatever
code needed to know about particular forks. (Of course, these same
advantages apply to using special space in an existing file. I'm
just suggesting that we can keep these advantages without buying into
the restrictions that special space would have.)

regards, tom lane


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Alvaro Herrera <alvherre(at)commandprompt(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 16:07:20
Message-ID: 1205770040.4285.213.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Sun, 2008-03-16 at 21:33 -0300, Alvaro Herrera wrote:
> Tom Lane wrote:
>
> > The idea that's becoming attractive to me while contemplating the
> > multiple-maps problem is that we should adopt something similar to
> > the old Mac OS idea of multiple "forks" in a relation. In addition
> > to the main data fork which contains the same info as now, there could
> > be one or more map forks which are separate files in the filesystem.
>
> I think something similar could be used to store tuple visibility bits
> separately from heap tuple data itself, so +1 to this idea.
>
> (The rough idea in my head was that you can do an indexscan and look
> up visibility bits without having to pull the whole heap along; and
> visibility updates are also cheaper, whether they come from indexscans
> or heap scans. Of course, the implicit cost is that a seqscan needs to
> fetch the visibility pages, too; and the locking is more complex.)

I very much like the idea of a generic method for including additional
bulk metadata for a relation (heap or index). That neatly provides a
general infrastructure for lots of clever things such as dead space,
visibility or other properties, while at the same time maintaining
modularity.

Can we call them "maps" or "metadata maps"? "forks" sounds weird.

We don't need to assume anything about the maps themselves at this
stage, so we might imagine tightly coupled maps that are always updated
as a relation changes, or loosely coupled maps that are lazily updated
by background processes. Autovacuum then becomes the vehicle by which we
execute "map maintenance" procedures, defined according to which AMs are
installed and what relation options are set. So we have a completely
generalised data/metadata storage infrastructure.

Sensibly arranged this could provide an entry point for powerful new
features within existing and future index AMs. It also sounds like it
might avoid a whole class of bugs and special cases that I regrettably
foresee would be unavoidable in Heikki's proposal.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 16:16:14
Message-ID: 873aqpcomp.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>> My original thought was to have a separate RelFileNode for each of the
>> maps. That would require no smgr or xlog changes, and not very many
>> changes in the buffer manager, though I guess you'd more catalog
>> changes. You had doubts about that on the previous thread
>> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but
>> the "map forks" idea certainly seems much more invasive than that.
>
> The main problems with that are (a) the need to expose every type of map
> in pg_class and (b) the need to pass all those relfilenode numbers down
> to pretty low levels of the system. The nice thing about the fork idea
> is that you don't need any added info to uniquely identify what relation
> you're working on. The fork numbers would be hard-wired into whatever
> code needed to know about particular forks. (Of course, these same
> advantages apply to using special space in an existing file. I'm
> just suggesting that we can keep these advantages without buying into
> the restrictions that special space would have.)

One advantage of using separate relfilenodes would be that if we need to
regenerate a map we could do it in a new relfilenode and swap it in like we do
with heap rewrites.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 16:51:01
Message-ID: 47DEA175.1040405@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark wrote:
> "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:
>
>
>> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>>
>>> My original thought was to have a separate RelFileNode for each of the
>>> maps. That would require no smgr or xlog changes, and not very many
>>> changes in the buffer manager, though I guess you'd more catalog
>>> changes. You had doubts about that on the previous thread
>>> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but
>>> the "map forks" idea certainly seems much more invasive than that.
>>>
>> The main problems with that are (a) the need to expose every type of map
>> in pg_class and (b) the need to pass all those relfilenode numbers down
>> to pretty low levels of the system. The nice thing about the fork idea
>> is that you don't need any added info to uniquely identify what relation
>> you're working on. The fork numbers would be hard-wired into whatever
>> code needed to know about particular forks. (Of course, these same
>> advantages apply to using special space in an existing file. I'm
>> just suggesting that we can keep these advantages without buying into
>> the restrictions that special space would have.)
>>
>
> One advantage of using separate relfilenodes would be that if we need to
> regenerate a map we could do it in a new relfilenode and swap it in like we do
> with heap rewrites.
>
>

Why can't you just do that with a different extension and file rename?
You'd need an exclusive lock while swapping in the new map, but you need
that anyway, IIRC, and this way you don't even need a catalog change.

cheers

andrew


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Gregory Stark <stark(at)enterprisedb(dot)com>
Cc: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 17:08:29
Message-ID: 14761.1205773709@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Gregory Stark <stark(at)enterprisedb(dot)com> writes:
> One advantage of using separate relfilenodes would be that if we need to
> regenerate a map we could do it in a new relfilenode and swap it in like we do
> with heap rewrites.

You could probably do that using a temporary fork number, if the
situation ever came up.

regards, tom lane


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 17:23:46
Message-ID: 15230.1205774626@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
>> Tom Lane wrote:
>>> The idea that's becoming attractive to me while contemplating the
>>> multiple-maps problem is that we should adopt something similar to
>>> the old Mac OS idea of multiple "forks" in a relation.

> Can we call them "maps" or "metadata maps"? "forks" sounds weird.

I'm not wedded to "forks", that's just the name that was used in the
only previous example I've seen. Classic Mac had a "resource fork"
and a "data fork" within each file.

Don't think I like "maps" though, as (a) that prejudges what the
alternate forks might be used for, and (b) the name fails to be
inclusive of the data fork. Other suggestions anyone?

BTW, thinking about the Mac precedent a little more, I believe
the way they grafted that Classic concept onto BSD was that
applications (which the user thinks of as single objects) are
now directories with multiple files inside them. Probably it'd be
overkill to think of turning each relation into a subdirectory, but
then again maybe not?

regards, tom lane


From: David Fetter <david(at)fetter(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 18:25:18
Message-ID: 20080317182518.GI8834@fetter.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating
> >>> the multiple-maps problem is that we should adopt something
> >>> similar to the old Mac OS idea of multiple "forks" in a
> >>> relation.
>
> > Can we call them "maps" or "metadata maps"? "forks" sounds weird.
>
> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen. Classic Mac had a "resource fork"
> and a "data fork" within each file.
>
> Don't think I like "maps" though, as (a) that prejudges what the
> alternate forks might be used for, and (b) the name fails to be
> inclusive of the data fork. Other suggestions anyone?

Segment? Section? Module?

Cheers,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david(dot)fetter(at)gmail(dot)com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 18:29:41
Message-ID: 1205778581.4285.241.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating the
> >>> multiple-maps problem is that we should adopt something similar to
> >>> the old Mac OS idea of multiple "forks" in a relation.
>
> > Can we call them "maps" or "metadata maps"? "forks" sounds weird.
>
> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen. Classic Mac had a "resource fork"
> and a "data fork" within each file.

Layer? Slab? Sheet? Strata/um? Overlay?

Layer makes sense to me because of the way GIS and CAD systems work.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com

PostgreSQL UK 2008 Conference: http://www.postgresql.org.uk


From: "Jochem van Dieten" <jochemd(at)gmail(dot)com>
To: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 18:45:06
Message-ID: f96a9b830803171145u500d43b0o8190a659df8e45fb@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane wrote:
> Simon Riggs writes:
>>
>> Can we call them "maps" or "metadata maps"? "forks" sounds weird.
>
> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen. Classic Mac had a "resource fork"
> and a "data fork" within each file.

Microsoft / NTFS calls them Data Streams:
http://www.wikistc.org/wiki/Alternate_data_streams

Jochem


From: "Mark Cave-Ayland" <mark(dot)cave-ayland(at)siriusit(dot)co(dot)uk>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 18:54:28
Message-ID: 64383.86.3.91.160.1205780068.squirrel@ra.siriusit.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


On Mon, 2008-03-17 at 13:23 -0400, Tom Lane wrote:

> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen. Classic Mac had a "resource fork"
> and a "data fork" within each file.
>
> Don't think I like "maps" though, as (a) that prejudges what the
> alternate forks might be used for, and (b) the name fails to be
> inclusive of the data fork. Other suggestions anyone?

I believe that in the world of NTFS the concept is called "streams":
http://support.microsoft.com/kb/105763.

HTH,

Mark.

--
Mark Cave-Ayland
Sirius Corporation - The Open Source Experts
http://www.siriusit.co.uk
T: +44 870 608 0063


From: Gregory Stark <stark(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 19:02:04
Message-ID: 87hcf5b2dv.fsf@oxford.xeocode.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers


"Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us> writes:

> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen. Classic Mac had a "resource fork"
> and a "data fork" within each file.

fwiw forks are not unique to MacOS, c.f.:
http://en.wikipedia.org/wiki/Fork_%28filesystem%29

However I'm not sure reusing any of these terms is such a hot idea. All it's
going to do is confuse someone into thinking we're actually talking about HFS
forks or NTFS data streams or whatever. Better to pick a term that isn't
already being used for such things so people don't get misled.

> BTW, thinking about the Mac precedent a little more, I believe
> the way they grafted that Classic concept onto BSD was that
> applications (which the user thinks of as single objects) are
> now directories with multiple files inside them. Probably it'd be
> overkill to think of turning each relation into a subdirectory, but
> then again maybe not?

Well there are upsides and downsides. Many OSes have difficulties when you
have many files in a single directory. This would tend to reduce that. On the
other hand it would drastically increase the number of directory files the OS
has to keep track of and the total number of inodes being referenced.

--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's PostGIS support!


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 19:26:26
Message-ID: 47DEC5E2.6090103@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Tom Lane wrote:
> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>> Tom Lane wrote:
>>> You're cavalierly waving away a whole boatload of problems that will
>>> arise as soon as you start trying to make the index AMs play along
>>> with this :-(.
>
>> It doesn't seem very hard.
>
> The problem is that the index AMs are no longer in control of what goes
> where within their indexes, which has always been their prerogative to
> determine. The fact that you think you can kluge btree to still work
> doesn't mean that it will work for other AMs.

Well, it does work with all the existing AMs AFAICS. I do agree with the
general point; it'd certainly be cleaner, more modular and more flexible
if the AMs didn't need to know about the existence of the maps.

>>> The idea that's becoming attractive to me while contemplating the
>>> multiple-maps problem is that we should adopt something similar to
>>> the old Mac OS idea of multiple "forks" in a relation.
>
>> Hmm. You also need to teach at least xlog.c and xlogutils.c about the
>> map forks, for full page images and the invalid page tracking.
>
> Well, you'd have to teach them something anyway, for any incarnation
> of maps that they might need to update.

Umm, the WAL code doesn't care where the pages it operates on came from.
Sure, we'll need rmgr-specific code that know what to do with the maps,
but the full page image code would work without changes with the
multiple RelFileNode approach.

The essential change with the map fork idea is that a RelFileNode no
longer uniquely identifies a file on disk (ignoring the segmentation
which is handled in smgr for now). Anything that operates on
RelFileNodes, without any higher level information of what it is, needs
to be modified to use RelFileNode+forkid instead. That includes at least
the buffer manager, smgr, and the full page image code in xlog.c.

It's probably a pretty mechanical change, even though it affects a lot
of code. We'd probably want to have a new struct, let's call it
PhysFileId for now, for RelFileNode+forkid, and basically replace all
occurrences of RelFileNode with PhysFileId in smgr, bufmgr and xlog code.

>> I also wonder what the performance impact of extending BufferTag is.
>
> That's a fair objection, and obviously something we'd need to check.
> But I don't recall seeing hash_any so high on any profile that I think
> it'd be a big problem.

I do remember seeing hash_any in some oprofile runs. But that's fairly
easy to test: we don't need to actually implement any of the stuff,
other than add a field to BufferTag, and run pgbench.

>> My original thought was to have a separate RelFileNode for each of the
>> maps. That would require no smgr or xlog changes, and not very many
>> changes in the buffer manager, though I guess you'd more catalog
>> changes. You had doubts about that on the previous thread
>> (http://archives.postgresql.org/pgsql-hackers/2007-11/msg00204.php), but
>> the "map forks" idea certainly seems much more invasive than that.
>
> The main problems with that are (a) the need to expose every type of map
> in pg_class and (b) the need to pass all those relfilenode numbers down
> to pretty low levels of the system.

(a) is certainly a valid point. Regarding (b), I don't think the low
level stuff (I assume you mean smgr, bufmgr, bgwriter, xlog by that)
would need to be passed any additional relfilenode numbers. Or rather,
they already work with relfilenodes, and they don't need to know whether
the relfilenode is for an index, a heap, or an FSM attached to something
else. The relfilenodes would be in RelationData, and we already have
that around whenever we do anything that needs to differentiate between
those.

Another consideration is which approach is easiest to debug. The "map
fork" approach seems better on that front, as you can immediately see
from the PhysFileId if a page is coming from an auxiliary map or the
main data portion. That might turn out to be handy in the buffer manager
or bgwriter as well; they don't currently have any knowledge of what a
page contains.

> The nice thing about the fork idea
> is that you don't need any added info to uniquely identify what relation
> you're working on. The fork numbers would be hard-wired into whatever
> code needed to know about particular forks. (Of course, these same
> advantages apply to using special space in an existing file. I'm
> just suggesting that we can keep these advantages without buying into
> the restrictions that special space would have.)

I don't see that advantage. All the higher-level code that care which
relation you're working on already have Relation around. All the
lower-level stuff don't care.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: "Dawid Kuroczko" <qnex42(at)gmail(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "Simon Riggs" <simon(at)2ndquadrant(dot)com>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-17 19:34:42
Message-ID: 758d5e7f0803171234m76678d52m61aa79ea84abee84@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Mon, Mar 17, 2008 at 6:23 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> I'm not wedded to "forks", that's just the name that was used in the
> only previous example I've seen. Classic Mac had a "resource fork"
> and a "data fork" within each file.
>
> Don't think I like "maps" though, as (a) that prejudges what the
> alternate forks might be used for, and (b) the name fails to be
> inclusive of the data fork. Other suggestions anyone?

Shadow? As each err, fork trails each relfilenode? (Or perhaps "shade").

Hints? As something more generic than "map"?

Regards,
Dawid


From: tomas(at)tuxteam(dot)de
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Heikki Linnakangas <heikki(at)enterprisedb(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-18 08:03:07
Message-ID: 20080318080307.GB29147@www.trapp.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Mon, Mar 17, 2008 at 01:23:46PM -0400, Tom Lane wrote:
> Simon Riggs <simon(at)2ndquadrant(dot)com> writes:
> >> Tom Lane wrote:
> >>> The idea that's becoming attractive to me while contemplating the
> >>> multiple-maps problem is that we should adopt something similar to
> >>> the old Mac OS idea of multiple "forks" in a relation.
>
> > Can we call them "maps" or "metadata maps"? "forks" sounds weird.

Actually, I do like "forks", but to add a little bit diversity:

facets? aspects?

FWIW, the idea of mapping a relation to a directory quite compelling.

Regards
- -- tomás
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)

iD8DBQFH33c7Bcgs9XrR2kYRAuBQAJ9MjISqgn37umRIydxtUBYONORwDgCbBKkE
y7adUy7s/30TxQPQiJZZejA=
=PAQ9
-----END PGP SIGNATURE-----


From: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
To: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-20 20:47:55
Message-ID: 47E2CD7B.3090600@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Heikki Linnakangas wrote:
> Tom Lane wrote:
>> "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
>>> I also wonder what the performance impact of extending BufferTag is.
>>
>> That's a fair objection, and obviously something we'd need to check.
>> But I don't recall seeing hash_any so high on any profile that I think
>> it'd be a big problem.
>
> I do remember seeing hash_any in some oprofile runs. But that's fairly
> easy to test: we don't need to actually implement any of the stuff,
> other than add a field to BufferTag, and run pgbench.

I tried that. hash_any wasn't significant on pgbench, but I was able to
construct a test case where it is. It goes like this:

BEGIN;
CREATE TEMPORARY TABLE foo (id int4);
INSERT INTO foo SELECT a FROM generate_series(1, 10000) a;
INSERT INTO foo SELECT * FROM foo;
INSERT INTO foo SELECT * FROM foo;
INSERT INTO foo SELECT * FROM foo;
... (repeat multiple times)

oprofile says that hash_any consumes ~7 % of CPU time on that test on my
laptop.

More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending
BufferTag with one uint32, it takes 7.4-7.5%. So the effect is
measurable if you try hard enough, but not anything to get worried about.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com


From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Heikki Linnakangas" <heikki(at)enterprisedb(dot)com>
Cc: "PostgreSQL-development" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Rewriting Free Space Map
Date: 2008-03-20 20:59:21
Message-ID: 23294.1206046761@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

"Heikki Linnakangas" <heikki(at)enterprisedb(dot)com> writes:
> More precisely, on CVS HEAD it takes between 7.1-7.2%. After extending
> BufferTag with one uint32, it takes 7.4-7.5%. So the effect is
> measurable if you try hard enough, but not anything to get worried about.

And if we adopt the allegedly-faster hash_any that's in the patch
queue, the difference should get smaller yet.

regards, tom lane