Storage Model for Partitioning

Lists: pgsql-hackers
From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Storage Model for Partitioning
Date: 2008-01-11 10:25:40
Message-ID: 1200047140.4266.972.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

In my striving towards more effective partitioning for Postgres, I see
we have one main decision to make and that all other sub-tasks are
driven directly by this one issue. The issue is: At what point we store
the data within the existing storage model? We discussed this in 2005
when I started to discuss what became constraint exclusion and it
remains the core issue from which all other tasks are driven.

If we can establish the basics of how a table can be split into
partitions, then that allows work to progress on the other issues.

I'd like some guidance from the senior crew on this, which is hopefully
possible without getting embroiled in all the details of partitioning,
most of which are more straightforward technical issues.

The current storage model for a table is that above the smgr layer a
table looks like a single continuous range of blocks, while below the
smgr layer it is in fact a set of segment files.

Given where we are now, how should we change the storage model to
support partitioning? If at all. We have two types of requirement:
- manageability - tablespaces for each partition etc..
- performance - excluding partitions etc..

I've argued from multiple sides of the fence, so I'm trying to present a
neutral view to allow us to take the best route forward.

The basic options are these:

0. Do Nothing - we don't want any of the other options.

1. Partitions are Contiguous Ranges of Blocks
As proposed for segment exclusion based partitioning

2. Partitions are Tables
As used by current constraint exclusion based partitioning.

3. Partitions are RelFileNodes, but not Tables

4. Some Other Choice

In more detail...

1. Partitions are Contiguous Ranges of Blocks

Partitions are a simple subset of a table, i.e. a contiguous range of
blocks within the main block range of the table. That allows us to
maintain the current smgr model, which then allows...

- allows RI via SHARE locks
- avoids the need for complex planner changes
- allows unique indexes
- allows global indexes (because we only have one table)
- works naturally with synchronous scans and buffer recycling

Doing partitioning this way means we would (as a trivial example) assign
blocks 1-10 as partition 1, blocks 11-20 as partition 2 etc.. There are
two sub-options of this basic idea:

a) Dynamic Partitioning - we define partitioning based around what is
already in the table, rather than trying to force the data to a
"correct" partition. No changes to the current storage model. Useful,
but it doesn't do all that everybody wants.

- allows automated table extension, so works automatically with Slony
- allows partition wise merge joins
- but not easily usable with declarative partitioning

b) Fixed partitioning - we define partitions as static ranges of blocks,
which may leave us with holes in the range of BlockNumbers, plus each
partition has a maximum size that it cannot expand beyond. Probably
unacceptable.

2. Partitions are Tables

Current situation.

This means we have to change
- Nested loop joins work with partitions, so an IndexScan must be able
to cross partitions within the target table
- indexes, so they can refer to more than one partition
- share locking in the executor
- changes to allow synchronous scans and buffer recycling
- automatic partition creation required
- single DDL declaration from the Parent table

3. Partitions are RelFileNodes, but not Tables

We allow a table to have multiple RelFileNodes, when explicitly declared
that way.

This means we have to change
- DDL changes to allow TABLE level changes to apply to all RelFileNodes,
while PARTITION level changes to apply to only one RelFileNode
- indexes, so they can refer to more than one partition
- share locking in the executor
- changes to allow synchronous scans and buffer recycling

There *are* other changes not mentioned here that are required for
partitioning, which although complex are less doubtful.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: "Pavel Stehule" <pavel(dot)stehule(at)gmail(dot)com>
To: "Simon Riggs" <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Storage Model for Partitioning
Date: 2008-01-11 10:59:14
Message-ID: 162867790801110259v7fc706f5l6966815df380f921@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

>
> 3. Partitions are RelFileNodes, but not Tables
>
> We allow a table to have multiple RelFileNodes, when explicitly declared
> that way.
>
> This means we have to change
> - DDL changes to allow TABLE level changes to apply to all RelFileNodes,
> while PARTITION level changes to apply to only one RelFileNode
> - indexes, so they can refer to more than one partition
> - share locking in the executor
> - changes to allow synchronous scans and buffer recycling
>
> There *are* other changes not mentioned here that are required for
> partitioning, which although complex are less doubtful.
>
It can be used for global temporary tables with some modification

table defines structure and partitions defines contents for sessions.

Regards
Pavel Stehule
> Simon Riggs
> 2ndQuadrant http://www.2ndQuadrant.com
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
> http://www.postgresql.org/about/donate
>


From: Richard Huxton <dev(at)archonet(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Storage Model for Partitioning
Date: 2008-01-11 11:34:17
Message-ID: 47875439.2050309@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

More dumb user questions...

Simon Riggs wrote:
> 1. Partitions are Contiguous Ranges of Blocks
>
> Partitions are a simple subset of a table, i.e. a contiguous range of
> blocks within the main block range of the table.

> b) Fixed partitioning - we define partitions as static ranges of blocks,
> which may leave us with holes in the range of BlockNumbers, plus each
> partition has a maximum size that it cannot expand beyond. Probably
> unacceptable.

Clearly not going to make anyone happy if you can't fit 1.1GB of data
into your partition.

Is the following basically the same as option #3 (multiple RelFileNodes)?

1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a
contigous range of blocks.
2. Make a table-partition (implied or explicit constraints) map to
multiple "chunks".
That would reduce fragmentation (you'd have on average 32MB's worth of
blocks wasted per partition) and allow for stretchy partitions at the
cost of an extra layer of indirection.

For the single-partition case you'd not need to split the file of
course, so it would end up looking much like the current arrangement.

--
Richard Huxton
Archonet Ltd


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Storage Model for Partitioning
Date: 2008-01-11 11:48:18
Message-ID: 1200052098.7528.46.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote:
> 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a
> contigous range of blocks.
> 2. Make a table-partition (implied or explicit constraints) map to
> multiple "chunks".
> That would reduce fragmentation (you'd have on average 32MB's worth of
> blocks wasted per partition) and allow for stretchy partitions at the
> cost of an extra layer of indirection.

This sounds almost like some kind of "clustering index", where the index
contains ranges pointing to blocks of data... if the same index is also
used for inserting (i.e. the free space map is a partial "cluster index"
on blocks with free space), that would be a coarse clustering solution I
guess...

Cheers,
Csaba.


From: Richard Huxton <dev(at)archonet(dot)com>
To: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Storage Model for Partitioning
Date: 2008-01-11 11:57:42
Message-ID: 478759B6.4080505@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Csaba Nagy wrote:
> On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote:
>> 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a
>> contigous range of blocks.
>> 2. Make a table-partition (implied or explicit constraints) map to
>> multiple "chunks".
>> That would reduce fragmentation (you'd have on average 32MB's worth of
>> blocks wasted per partition) and allow for stretchy partitions at the
>> cost of an extra layer of indirection.
>
> This sounds almost like some kind of "clustering index", where the index
> contains ranges pointing to blocks of data... if the same index is also
> used for inserting (i.e. the free space map is a partial "cluster index"
> on blocks with free space), that would be a coarse clustering solution I
> guess...

Which is roughly what Simon's original "Dynamic Partitioning" would be
if it became visible at the planner level (unless I've misunderstood). I
was picturing it in my head as a two-dimensional bitmap with
value-ranges along one axis and block-ranges along the other. Of course,
unlike other indexes it needs visibility information to be of any use.

Thinking about it, I'm not sure how my thinking would affect a
full-table seq-scan. You'd not get blocks back in-order throughout the
scan - would that matter?

--
Richard Huxton
Archonet Ltd


From: Csaba Nagy <nagy(at)ecircle-ag(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Storage Model for Partitioning
Date: 2008-01-11 12:04:31
Message-ID: 1200053071.7528.53.camel@PCD12478
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

> Which is roughly what Simon's original "Dynamic Partitioning" would be
> if it became visible at the planner level (unless I've misunderstood). I
> was picturing it in my head as a two-dimensional bitmap with
> value-ranges along one axis and block-ranges along the other. Of course,
> unlike other indexes it needs visibility information to be of any use.

But why not have it as a normal index of ranges ? I'm not familiar with
the GIST extensions, but this sounds like a set of records (segments in
Simon's terms) indexed by their interval position on a line... isn't
that covered by some GIST index type ?

> Thinking about it, I'm not sure how my thinking would affect a
> full-table seq-scan. You'd not get blocks back in-order throughout the
> scan - would that matter?

That could be covered by something like the bitmap scan, just on coarser
level, the bits covering segments instead of blocks.

Cheers,
Csaba.


From: Simon Riggs <simon(at)2ndquadrant(dot)com>
To: Richard Huxton <dev(at)archonet(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Storage Model for Partitioning
Date: 2008-01-11 12:15:46
Message-ID: 1200053746.4266.1042.camel@ebony.site
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote:

> Is the following basically the same as option #3 (multiple RelFileNodes)?
>
> 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a
> contigous range of blocks.
> 2. Make a table-partition (implied or explicit constraints) map to
> multiple "chunks".
> That would reduce fragmentation (you'd have on average 32MB's worth of
> blocks wasted per partition) and allow for stretchy partitions at the
> cost of an extra layer of indirection.
>
> For the single-partition case you'd not need to split the file of
> course, so it would end up looking much like the current arrangement.

We need to think about the "data model" of the storage layer. Space
itself isn't the issue, its the assumptions that all of the other
subsystems currently make about what how a table is structured, indexed,
accessed and manipulated.

Currently: Table 1:M Segments

Option 1: Table 1:M Segments and *separately* Table 1:M Partitions, so
partitions are always have a maximum size. The size just changes the
impact, doesn't change the impact of holes, max sizes etc.
e.g. empty table with 10 partitions would be
a) 0 bytes in 1 file
b) 0 bytes in 1 file, plus 9GB in 9 files all full of empty blocks

e.g. table with 10 partitions each of 1.5GB would be
a) 15 GB in 15 files
b) hit max size limit of partition: ERROR

Option 2: Table 1:M Child Tables 1:M Segments
e.g. empty table with 10 partitions would be
0 bytes in each of 10 files

e.g. table with 10 partitions each of 1.5GB would be
15GB in 10 groups of 2 files

Option 3: Table 1:M Nodes 1:M Segments
e.g. empty table with 10 partitions would be
0 bytes in each of 10 files

e.g. table with 10 partitions each of 1.5GB would be
15GB in 10 groups of 2 files

So 1b) seems definitely out.

The implications of 2 and 3 are what I'm worried about, which is why the
shortcomings of 1a) seem acceptable currently.

--
Simon Riggs
2ndQuadrant http://www.2ndQuadrant.com


From: Richard Huxton <dev(at)archonet(dot)com>
To: Simon Riggs <simon(at)2ndquadrant(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Storage Model for Partitioning
Date: 2008-01-11 13:26:13
Message-ID: 47876E75.3040001@archonet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Lists: pgsql-hackers

Simon Riggs wrote:
> On Fri, 2008-01-11 at 11:34 +0000, Richard Huxton wrote:
>
>> Is the following basically the same as option #3 (multiple RelFileNodes)?
>>
>> 1. Make an on-disk "chunk" much smaller (e.g. 64MB). Each chunk is a
>> contigous range of blocks.
>> 2. Make a table-partition (implied or explicit constraints) map to
>> multiple "chunks".
>> That would reduce fragmentation (you'd have on average 32MB's worth of
>> blocks wasted per partition) and allow for stretchy partitions at the
>> cost of an extra layer of indirection.
>>
>> For the single-partition case you'd not need to split the file of
>> course, so it would end up looking much like the current arrangement.
>
> We need to think about the "data model" of the storage layer. Space
> itself isn't the issue, its the assumptions that all of the other
> subsystems currently make about what how a table is structured, indexed,
> accessed and manipulated.

Which was why I was thinking you'd want to maintain indexes etc.
thinking in terms of a table being a contiguous set of blocks, with the
mapping to an actual on-disk block taking place below that level. (If
I've understood you).

> Currently: Table 1:M Segments
>
> Option 1: Table 1:M Segments and *separately* Table 1:M Partitions, so
> partitions are always have a maximum size. The size just changes the
> impact, doesn't change the impact of holes, max sizes etc.
> e.g. empty table with 10 partitions would be
> a) 0 bytes in 1 file
> b) 0 bytes in 1 file, plus 9GB in 9 files all full of empty blocks

Well, presumably 0GB in 10 files, but 10GB-worth of block-numbers
"pre-allocated".

> e.g. table with 10 partitions each of 1.5GB would be
> a) 15 GB in 15 files

With the limitation that any given partition might contain a mix of
data-ranges (e.g. 2005 lies half in partition 2 and half in partition 3).

> b) hit max size limit of partition: ERROR

In the case of 1b, you could have a segment mapping to more than 1
partition, avoiding the error. So 2004 data is in partition 1, 2005 is
in partitions 2,3 (where 3 is half empty), 2006 is in partition 4.
However, this does mean you've got a lot of wasted block numbers. If you
were using explicit (fixed) partitioning and chose a bad set of criteria
your maximum table size could be substantially reduced.

> Option 2: Table 1:M Child Tables 1:M Segments
> e.g. empty table with 10 partitions would be
> 0 bytes in each of 10 files
>
> e.g. table with 10 partitions each of 1.5GB would be
> 15GB in 10 groups of 2 files

Cross-table indexes and constraints would be useful outside of the
current scenario.

> Option 3: Table 1:M Nodes 1:M Segments
> e.g. empty table with 10 partitions would be
> 0 bytes in each of 10 files
>
> e.g. table with 10 partitions each of 1.5GB would be
> 15GB in 10 groups of 2 files

Ah, so this does seem to be roughly the same as I was rambling about.
This would presumably mean that rather than (table, block #) specifying
the location of a row you'd need (table, node #, block #).

> So 1b) seems definitely out.
>
> The implications of 2 and 3 are what I'm worried about, which is why the
> shortcomings of 1a) seem acceptable currently.

--
Richard Huxton
Archonet Ltd